W3C WHATWG

URL

Immutable Snapshot Edition

W3C Working Draft

This version:
http://www.w3.org/TR/2014/WD-url-20140923/
Latest published version:
http://www.w3.org/TR/url/
Latest editor's draft:
This snapshot contains the prose up to and including commit 72e58483bf3cfe2c773ba3b87a710ace0e11ff12 of the URL Living Standard. For the current version of the URL Living Standard, including significant errata to the contents of this specification, please see: https://url.spec.whatwg.org/.
Bug tracker:
file a bug (open bugs)
Previous version:
http://www.w3.org/TR/1977/WD-url-19770315/
Editor:
Anne van Kesteren, Mozilla (Upstream WHATWG version)
Publishing Editor:
Daniel Appelquist, Telefónica

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is published as a snapshot of the URL Living Standard with the intent of keeping the differences from the original to a strict minimum, and only through subsetting (only things that are not implemented were removed for this publication).

This document was published by the Technical Architecture Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to www-tag@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 August 2014 W3C Process Document.

Usage of this Document

This specification is available both as a live and immutable document. The live document endeavours to reflect the latest state of implementations and may be very relevant to implementors. The immutable document serves as a stable reference for those that require it for technical or legal reasons, such as for IPR commitments. Immutable documents may not reflect the state of implementation. Live documents may not have clear patent licensing commitments or may change at any time. Which version you refer to depends on your needs, and there are advantages and disadvantages to each.

This is the immutable version.

For the live version, see https://url.spec.whatwg.org/.

Table of Contents

  1. Goals
  2. 1 Conformance
  3. 2 Terminology
    1. 2.1 Parsers
  4. 3 Percent-encoded bytes
  5. 4 Hosts (domains and IP addresses)
    1. 4.1 IDNA
    2. 4.2 Writing
    3. 4.3 Parsing
    4. 4.4 Serializing
  6. 5 URLs
    1. 5.1 Writing
    2. 5.2 Parsing
    3. 5.3 Serializing
    4. 5.4 Origin
  7. 6 application/x-www-form-urlencoded
    1. 6.1 Parsing
    2. 6.2 Serializing
    3. 6.3 Hooks
  8. 7 API
    1. 7.1 Constructors
    2. 7.2 URL statics
    3. 7.3 URLUtils and URLUtilsReadOnly members
    4. 7.4 Interface URLSearchParams
    5. 7.5 URL APIs elsewhere
  9. References
  10. Acknowledgments

Goals

The URL standard standardizes URLs, aiming to make them fully interoperable. It does so as follows:

As the editor learns more about the subject matter the goals might increase in scope somewhat.

1 Conformance

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this specification are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

2 Terminology

Some terms used in this specification are defined in the DOM, Encoding, and IDNA Standards. [DOM] [ENCODING] [IDNA]

The ASCII digits are code points in the range U+0030 to U+0039.

The ASCII hex digits are ASCII digits or are code points in the range U+0041 to U+0046 or in the range U+0061 to U+0066.

The ASCII alpha are code points in the range U+0041 to U+005A or in the range U+0061 to U+007A.

The ASCII alphanumeric are ASCII digits or ASCII alpha.

2.1 Parsers

The EOF code point is a conceptual code point that signifies the end of a string or code point stream.

A parse error indicates a non-fatal mismatch between input and requirements. User agents are encouraged to expose parse errors somehow.

Within a parser algorithm that uses a pointer variable, c references the code point the pointer variable points to.

Within a string-based parser algorithm that uses a pointer variable, remaining references the substring after pointer in the string being processed.

If "mailto:username@example" is a string being processed and pointer points to "@", c is "@" and remaining is "example".

3 Percent-encoded bytes

A percent-encoded byte is "%", followed by two ASCII hex digits. Sequences of percent-encoded bytes, after conversion to bytes, should not cause a utf-8 decoder to run into any errors.

To percent encode a byte into a percent-encoded byte, return a string consisting of "%", followed by a double-digit, uppercase, hexadecimal representation of byte.

To percent decode a byte sequence input, run these steps:

Using anything but a utf-8 decoder when the input contains bytes outside the range 0x00 to 0x7F might be insecure and is not recommended.

  1. Let output be an empty byte sequence.

  2. For each byte byte in input, run these steps:

    1. If byte is not `%`, append byte to output.

    2. Otherwise, if byte is `%` and the next two bytes after byte in input are not in the ranges 0x30 to 0x39, 0x41 to 0x46, and 0x61 to 0x66, append byte to output.

    3. Otherwise, run these substeps:

      1. Let bytePoint be the two bytes after byte in input, decoded, and then interpreted as hexadecimal number.

      2. Append a byte whose value is bytePoint to output.

      3. Skip the next two bytes in input.

  3. Return bytes.

The simple encode set are all code points less than U+0020 (i.e. excluding U+0020) and all code points greater than U+007E.

The default encode set is the simple encode set and code points U+0020, '"', "#", "<", ">", "?", and "`".

The password encode set is the default encode set and code points "/", "@", and "\".

The username encode set is the password encode set and code point ":".

To utf-8 percent encode a code point, using an encode set, run these steps:

  1. If code point is not in encode set, return code point.

  2. Let bytes be the result of running utf-8 encode on code point.

  3. Percent encode each byte in bytes, and then return them concatenated, in the same order.

4 Hosts (domains and IP addresses)

A host is a network address in the form of a domain or an IPv6 address.

A domain identifies a realm within a network.

An IPv6 address is a 128-bit identifier and for the purposes of this specification represented as an ordered list of eight 16-bit pieces. [IPV6]

4.1 IDNA

The domain to ASCII given a domain domain, runs these steps:

  1. Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to false, processing_option set to Transitional_Processing, and VerifyDnsLength set to false.

  2. If result is a failure value, return failure.

  3. Return result.

The domain to Unicode given a domain domain, runs these steps:

  1. Let result be the result of running Unicode ToUnicode with domain_name set to domain, UseSTD3ASCIIRules set to false.

  2. Return result, ignoring any returned errors.

    User agents are encouraged to report errors through a developer console.

4.2 Writing

A host must be either a domain or "[", followed by an IPv6 address, followed by "]".

A domain is a valid domain if these steps return success:

  1. Let result be the result of running Unicode ToASCII with domain_name set to domain, UseSTD3ASCIIRules set to true, processing_option set to Nontransitional_Processing, and VerifyDnsLength set to true.

  2. If result is a failure value, return failure.

  3. Set result to the result of running Unicode ToUnicode with domain_name set to result, UseSTD3ASCIIRules set to true.

  4. If result contains any errors, return failure.

  5. Return success.

Ideally we define this in terms of a sequence of code points that make up a valid domain rather than through a whack-a-mole: bug 25334.

A domain must be a string that is a valid domain.

An IPv6 address is defined in the "Text Representation of Addresses" chapter of IP Version 6 Addressing Architecture. [IPV6]

4.3 Parsing

The host parser takes a string input and optionally a Unicode flag, and then runs these steps:

  1. If input is the empty string, return failure.

  2. If input starts with "[", run these substeps:

    1. If input does not end with "]", parse error, return failure.

    2. Return the result of IPv6 parsing input with its leading "[" and trailing "]" removed.

  3. Let domain be the result of utf-8 decode without BOM on the percent decoding of utf-8 encode on input.

  4. Let asciiDomain be the result of running domain to ASCII on domain.

  5. If asciiDomain is failure, return failure.

  6. If asciiDomain contains one of U+0000, U+0009, U+000A, U+000D, U+0020, "#", "%", "/", ":", "?", "@", "[", "\", and "]", return failure.

  7. Return asciiDomain if the Unicode flag is unset, and the result of running domain to Unicode on asciiDomain otherwise.

The IPv6 parser takes a string input and then runs these steps:

  1. Let address be a new IPv6 address with its 16-bit pieces initialized to 0.

  2. Let piece pointer be a pointer into address's 16-bit pieces, initially zero (pointing to the first 16-bit piece), and let piece be the 16-bit piece it points to.

  3. Let compress pointer be another pointer into pieces, initially null and pointing to nothing.

  4. Let pointer be a pointer into input, initially zero (pointing to the first code point).

  5. If c is ":", run these substeps:

    1. If remaining does not start with ":", parse error, return failure.

    2. Increase pointer by two.

    3. Increase piece pointer by one and then set compress pointer to piece pointer.

  6. Main: While c is not the EOF code point, run these substeps:

    1. If piece pointer is eight, parse error, return failure.

    2. If c is ":", run these inner substeps:

      1. If compress pointer is not null, parse error, return failure.

      2. Increase pointer and piece pointer by one, set compress pointer to piece pointer, and then jump to Main.
    3. Let value and length be 0.

    4. While length is less than 4 and c is an ASCII hex digit, set value to value × 0x10 + c interpreted as hexadecimal number, and increase pointer and length by one.

    5. Based on c:

      "."

      If length is 0, parse error, return failure.

      Decrease pointer by length.

      Jump to IPv4.

      ":"

      Increase pointer by one.

      If c is the EOF code point, parse error, return failure.

      Anything but the EOF code point

      Parse error, return failure.

    6. Set piece to value.

    7. Increase piece pointer by one.

  7. If c is the EOF code point, jump to Finale.

  8. IPv4: If piece pointer is greater than six, parse error, return failure.

  9. Let dots seen be 0.

  10. While c is not the EOF code point, run these substeps:

    1. Let value be null.

    2. If c is not an ASCII digit, parse error, return failure.

    3. While c is an ASCII digit, run these subsubsteps:

      1. Let number be c interpreted as decimal number.

      2. If value is null, set value to number.

        Otherwise, if value is 0, parse error, return failure.

        Otherwise, set value to value × 10 + number.

      3. Increase pointer by one.

      4. If value is greater than 255, parse error, return failure.

    4. If dots seen is less than 3 and c is not a ".", parse error, return failure.

    5. Set piece to piece × 0x100 + value.

    6. If dots seen is 1 or 3, increase piece pointer by one.

    7. Increase pointer by one.

    8. If dots seen is 3 and c is not the EOF code point, parse error, return failure.

    9. Increase dots seen by one.

  11. Finale: If compress pointer is not null, run these substeps:

    1. Let swaps be piece pointercompress pointer.

    2. Set piece pointer to seven.

    3. While piece pointer is not zero and swaps is greater than zero, swap piece with the piece at pointer compress pointer + swaps − 1, and then decrease both piece pointer and swaps by one.

  12. Otherwise, if compress pointer is null and piece pointer is not eight, parse error, return failure.

  13. Return address.

4.4 Serializing

The host serializer takes null or a host host and then runs these steps:

  1. If host is null, return the empty string.

  2. If host is an IPv6 address, return "[", followed by the result of running the IPv6 serializer on host, followed by "]".

  3. Otherwise, host is a domain, return host.

The IPv6 serializer takes an IPv6 address address and then runs these steps:

  1. Let output be the empty string.

  2. Let compress pointer be a pointer to the first 16-bit piece in the first longest sequences of address's 16-bit pieces that are 0.

    In 0:f:0:0:f:f:0:0 it would point to the second 0.

  3. If there is no sequence of address's 16-bit pieces that are 0 longer than one, set compress pointer to null.

  4. For each piece in address's pieces, run these substeps:

    1. If compress pointer points to piece, append "::" to output if piece is address's first piece and append ":" otherwise, and then run these substeps again with all subsequent pieces in address's pieces that are 0 skipped or go the next step in the overall set of steps if that leaves no pieces.

    2. Append piece, represented as the shortest possible lowercase hexadecimal number, to output.

    3. If piece is not address's last piece, append ":" to output.

  5. Return output.

This algorithm requires the recommendation from A Recommendation for IPv6 Address Text Representation. [IPV6TEXT]

5 URLs

A URL is a universal identifier.

A URL consists of components, namely a scheme, scheme data, username, password, host, port, path, query, and fragment.

A URL's scheme is a string that identifies the type of URL and can be used to dispatch a URL for further processing after parsing. It is initially the empty string.

A URL's scheme data is a string holding the contents of a URL. It is initially the empty string.

A URL's scheme data will be its initial value if its scheme is a relative scheme, and otherwise will be the only component without an initial value.

A URL's username is a string identifying a user. It is initially the empty string.

A URL's password is either null or a string identifying a user's credentials. It is initially null.

A URL's host is either null or a host. It is initially null.

A URL's port is a string that identifies a networking port. It is initially the empty string.

A URL's path is a list of zero or more strings holding data, usually identifying a location in hierarchical form. It is initially the empty list.

A URL's query is either null or a string holding data. It is initially null.

A URL's fragment is either null or a string holding data that can be used for further processing on the resource the URL's other components identify.

A URL also has an associated relative flag. It is initially unset.

The relative flag exists as checking if a URL's scheme is a relative scheme can give incorrect results due to the protocol attribute.

A URL also has an associated object that is either null or a Blob. [FILEAPI]

At this point this is used primarily to support "blob" URLs, but others can be added going forward, hence "object".

A relative scheme is a scheme listed in the first column of the following table. A default port is a relative scheme's optional corresponding port and is listed in the second column on the same row.

scheme port
"ftp""21"
"file"
"gopher""70"
"http""80"
"https""443"
"ws""80"
"wss""443"

A URL includes credentials if either its username is not the empty string or its password is non-null.

A URL can be designated as base URL.

A base URL is useful for the URL parser when the input is potentially a relative URL.

5.1 Writing

A URL must be written as either a relative URL or an absolute URL, optionally followed by "#" and a fragment.

An absolute URL must be a scheme, followed by ":", followed by either a scheme-relative URL, if scheme is a relative scheme, or scheme data otherwise, optionally followed by "?" and a query.

A scheme must be one ASCII alpha, followed by zero or more of ASCII alphanumeric, "+", "-", and ".". A scheme must be registered ....

The syntax of scheme data depends on the scheme and is typically defined alongside it. Standards must define scheme data within the constraints of zero or more URL units.

A relative URL must be either a scheme-relative URL, an absolute-path-relative URL, or a path-relative URL that does not start with a scheme and ":", optionally followed by a "?" and a query.

At the point where a relative URL is parsed, a base URL must be in scope.

A scheme-relative URL must be "//", optionally followed by userinfo and "@", followed by a host, optionally followed by ":" and a port, optionally followed by an absolute-path-relative URL.

Userinfo must be a username, optionally followed by a ":" and a password.

A username must be zero or more URL units, excluding "/", ":, "?", and "@".

A password must be zero or more URL units, excluding "/", "?", and "@".

A port must be zero or more ASCII digits.

An absolute-path-relative URL must be "/", followed by a path-relative URL that does not start with "/".

A path-relative URL must be zero or more path segments separated from each other by a "/".

A path segment must be zero or more URL units, excluding "/" and "?".

A query must be zero or more URL units.

A fragment must be zero or more URL units.

The URL code points are ASCII alphanumeric, "!", "$", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFFD, U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000 to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD, U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to U+FFFFD, U+100000 to U+10FFFD.

Code points higher than U+009F will be converted to percent-encoded bytes by the URL parser.

The URL units are URL code points and percent-encoded bytes.

5.2 Parsing

Add the ability to halt on the first conformance error.

The URL parser takes a string input, optionally with a base URL base, and optionally with an encoding encoding override, and then runs these steps:

  1. Let url be the result of running the basic URL parser on input with base, and encoding override as provided.

  2. If url is failure, return failure.

  3. If url's scheme is not "blob", return url.

  4. If url's scheme data is not in the blob URL store, return url. [FILEAPI]

  5. Set url's object to a structured clone of the entry in the blob URL store corresponding to url's scheme data. [HTML]

  6. Return url.


The basic URL parser takes a string input, optionally with a base URL base, optionally with an encoding encoding override, optionally with an URL url and a state override state override, and then runs these steps:

The encoding override argument is a legacy concept only relevant for HTML. The url and state override arguments are only for use by methods of objects implementing the URLUtils interface. [HTML]

When the url and state override arguments are not passed the basic URL parser returns either a URL or failure. If they are passed the algorithm simply modifies the passed url and can terminate without returning anything.

  1. If url is not given:

    1. Set url to a new URL.

    2. Remove any leading and trailing ASCII whitespace from input.

  2. Let state be state override if given, or scheme start state otherwise.

  3. If base is not given, set it to null.

  4. If encoding override is not given, set it to utf-8.

  5. Let buffer be the empty string.

  6. Let the @ flag and the [] flag be unset.

  7. Let pointer be a pointer to first code point in input.

  8. Keep running the following state machine by switching on state, increasing pointer by one after each time it is run, and if after a run pointer points to the EOF code point, go to the next step.

    scheme start state
    1. If c is an ASCII alpha, append c, lowercased, to buffer, and set state to scheme state.

    2. Otherwise, if state override is not given, set state to no scheme state, and decrease pointer by one.

    3. Otherwise, parse error, terminate this algorithm.

    scheme state
    1. If c is an ASCII alphanumeric, "+", "-", or ".", append c, lowercased, to buffer.

    2. Otherwise, if c is ":", set url's scheme to buffer, buffer to the empty string, and then run these substeps:

      1. If state override is given, terminate this algorithm.

      2. If url's scheme is a relative scheme, set url's relative flag.

      3. If url's scheme is "file", set state to relative state.

      4. Otherwise, if url's relative flag is set, base is not null and base's scheme is equal to url's scheme, set state to relative or authority state.

      5. Otherwise, if url's relative flag is set, set state to authority first slash state.

      6. Otherwise, set state to scheme data state.

    3. Otherwise, if state override is not given, set buffer to the empty string, state to no scheme state, and start over (from the first code point in input).

    4. Otherwise, if c is the EOF code point, terminate this algorithm.

    5. Otherwise, parse error, terminate this algorithm.

    scheme data state
    1. If c is "?", set url's query to the empty string and state to query state.

    2. Otherwise, if c is "#", set url's fragment to the empty string and state to fragment state.

    3. Otherwise, run these substeps:

      1. If c is not the EOF code point, not a URL code point, and not "%", parse error.

      2. If c is "%" and remaining does not start with two ASCII hex digits, parse error.

      3. If c is none of EOF code point, U+0009, U+000A, and U+000D, utf-8 percent encode c using the simple encode set, and append the result to url's scheme data.

    no scheme state

    If base is null, or base's scheme is not a relative scheme, parse error, return failure.

    Due to the protocol attribute's ability to change base's scheme, base's relative flag is not used here.

    Otherwise, set state to relative state, and decrease pointer by one.

    relative or authority state

    If c is "/" and remaining starts with "/", set state to authority ignore slashes state and increase pointer by one.

    Otherwise, parse error, set state to relative state and decrease pointer by one.

    relative state

    Set url's relative flag, set url's scheme to base's scheme if url's scheme is not "file", and then, based on c:

    EOF code point

    Set url's host to base's host, url's port to base's port, url's path to base's path, and url's query to base's query.

    "/"
    "\"
    1. If c is "\", parse error.

    2. Set state to relative slash state.

    "?"

    Set url's host to base's host, url's port to base's port, url's path to base's path, url's query to the empty string, and state to query state.

    "#"

    Set url's host to base's host, url's port to base's port, url's path to base's path, url's query to base's query, url's fragment to the empty string, and state to fragment state.

    Otherwise
    1. If url's scheme is not "file", or c is not an ASCII alpha, or remaining does not start with either ":" or "|", or remaining consists of one code point, or remaining's second code point is not one of "/", "\", "?", and "#", then set url's host to base's host, url's port to base's port, url's path to base's path, and then remove url's path's last entry.

      This is a (platform-independent) Windows drive letter quirk. When found at the start of a file URL it is treated as an absolute path rather than one relative to base's path.

    2. Set state to relative path state, and decrease pointer by one.

    relative slash state

    If c is either "/" or "\", run these steps:

    1. If c is "\", parse error.

    2. If url's scheme is "file", set state to file host state.

    3. Otherwise, set state to authority ignore slashes state.

    Otherwise, run these steps:

    1. If url's scheme is not "file", set url's host to base's host and url's port to base's port.

      file:/path/ will not inherit base's host.

    2. Set state to relative path state, and decrease pointer by one.

    authority first slash state

    If c is "/", set state to authority second slash state.

    Otherwise, parse error, set state to authority ignore slashes state, and decrease pointer by one.

    authority second slash state

    If c is "/", set state to authority ignore slashes state.

    Otherwise, parse error, set state to authority ignore slashes state, and decrease pointer by one.

    authority ignore slashes state

    If c is neither "/" nor "\", set state to authority state, and decrease pointer by one.

    Otherwise, parse error.

    authority state
    1. If c is "@", run these substeps:

      1. If the @ flag is set, parse error, prepend "%40" to buffer.

      2. Set the @ flag.

      3. For each code point in buffer, run these substeps:

        1. If code point is U+0009, U+000A, or U+000D, parse error, continue.

        2. If code point is not a URL code point and not "%", parse error.

        3. If code point is "%" and remaining does not start with two ASCII hex digits, parse error.

        4. If code point is ":" and url's password is null, set url's password to the empty string and continue.

        5. utf-8 percent encode code point using the default encode set and append the result to url's password if url's password is non-null, and to url's username otherwise.

      4. Set buffer to the empty string.

    2. Otherwise, if c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by the number of code points in buffer plus one, set buffer to the empty string, and state to host state.

    3. Otherwise, append c to buffer.

    file host state
    1. If c is one of EOF code point, "/", "\", "?", and "#", decrease pointer by one, and run these substeps:

      1. If buffer consists of two code points, of which the first is an ASCII alpha and the second is either ":" or "|", set state to relative path state.

        This is a (platform-independent) Windows drive letter quirk. buffer is not reset here and instead used in the relative path state.

      2. Otherwise, if buffer is the empty string, set state to relative path start state.

      3. Otherwise, run these steps:

        1. Let host be the result of host parsing buffer.

        2. If host is failure, return failure.

        3. Set url's host to host, buffer to the empty string, and state to relative path start state.

    2. Otherwise, if c is U+0009, U+000A, or U+000D, parse error.

    3. Otherwise, append c to buffer.

    host state
    hostname state
    1. If c is ":" and the [] flag is unset, run these substeps:

      1. Let host be the result of host parsing buffer.

      2. If host is failure, return failure.

      3. Set url's host to host, buffer to the empty string, and state to port state.

      4. If state override is hostname state, terminate this algorithm.

    2. Otherwise, if c is the EOF code point, "/", "\", "?", or "#", decrease pointer by one, and run these substeps:

      1. Let host be the result of host parsing buffer.

      2. If host is failure, return failure.

      3. Set url's host to host, buffer to the empty string, and state to relative path start state.

      4. If state override is given, terminate this algorithm.

    3. Otherwise, if c is U+0009, U+000A, or U+000D, parse error.

    4. Otherwise, run these substeps:

      1. If c is "[", set the [] flag.

      2. If c is "]", unset the [] flag.

      3. Append c to buffer.

    port state
    1. If c is an ASCII digit, append c to buffer.

    2. Otherwise, if c is one of EOF code point, "/", "\", "?", and "#", or state override is given, run these substeps:

      1. Remove leading U+0030 code points from buffer until either the leading code point is not U+0030 or buffer is one code point.

        InputOutput
        "42""42"
        "031""31"
        "080""80"
        "0000""0"
      2. If buffer is equal to url's scheme's default port, set buffer to the empty string.

      3. Set url's port to buffer.

      4. If state override is given, terminate this algorithm.

      5. Set buffer to the empty string, state to relative path start state, and decrease pointer by one.

    3. Otherwise, if c is U+0009, U+000A, or U+000D, parse error.

    4. Otherwise, parse error, return failure.

    relative path start state
    1. If c is "\", parse error.

    2. Set state to relative path state and if c is neither "/" nor "\", decrease pointer by one.

    relative path state
    1. If either c is one of EOF code point, "/", and "\", or state override is not given and c is one of "?" and "#", run these substeps:

      1. If c is "\", parse error.

      2. If buffer, lowercased, matches any row in the first column of the following table, set buffer to the contents of the cell in the second column of the matched row:

        "%2e" "."
        ".%2e" ".."
        "%2e."
        "%2e%2e"
      3. If buffer is "..", remove url's path's last entry, if any, and then if c is neither "/" nor "\", append the empty string to url's path.

      4. Otherwise, if buffer is "." and c is neither "/" nor "\", append an empty string to url's path.

      5. Otherwise, if buffer is not ".", run these subsubsteps:

        1. If url's scheme is "file", url's path is the empty list, buffer consists of two code points, of which the first is an ASCII alpha, and the second is "|", replace the second code point in buffer with ":".

          This is a (platform-independent) Windows drive letter quirk. They are beautiful, no?

        2. Append buffer to url's path.

      6. Set buffer to the empty string.

      7. If c is "?", set url's query to the empty string, and state to query state.

      8. If c is "#", set url's fragment to the empty string, and state to fragment state.

    2. Otherwise, if c is U+0009, U+000A, or U+000D, parse error.

    3. Otherwise, run these steps:

      1. If c is not a URL code point and not "%", parse error.

      2. If c is "%" and remaining does not start with two ASCII hex digits, parse error.

      3. utf-8 percent encode c using the default encode set, and append the result to buffer.

    query state
    1. If c is the EOF code point or state override is not given and c is "#", run these substeps:

      1. If url's relative flag is unset or url's scheme is either "ws" or "wss", set encoding override to utf-8.

      2. Set buffer to the result of encoding buffer using encoding override.

      3. For each byte in buffer run these subsubsteps:

        1. If byte is less than 0x21, greater than 0x7E, or is one of 0x22, 0x23, 0x3C, 0x3E, and 0x60, append byte, percent encoded, to url's query.

        2. Otherwise, append a code point whose value is byte to url's query.

      4. Set buffer to the empty string.

      5. If c is "#", set url's fragment to the empty string, and state to fragment state.

    2. Otherwise, if c is U+0009, U+000A, or U+000D, parse error.

    3. Otherwise, run these substeps:

      1. If c is not a URL code point and not "%", parse error.

      2. If c is "%" and remaining does not start with two ASCII hex digits, parse error.

      3. Append c to buffer.

    fragment state

    Based on c:

    EOF code point

    Do nothing.

    U+0009
    U+000A
    U+000D

    Parse error.

    Otherwise
    1. If c is not a URL code point and not "%", parse error.

    2. If c is "%" and remaining does not start with two ASCII hex digits, parse error.

    3. utf-8 percent encode c using the simple encode set, and append the result to url's fragment.

  9. Return url.

5.3 Serializing

The URL serializer takes a URL url, optionally an exclude fragment flag, and then runs these steps:

  1. Let output be url's scheme and ":" concatenated.

  2. If url's relative flag is set:

    1. Append "//" to output.

    2. If url's username is not the empty string or url's password is non-null, run these substeps:

      1. Append url's username to output.

      2. If url's password is non-null, append ":" concatenated with url's password to output.

      3. Append "@" to output.

    3. Append url's host, serialized, to output.

    4. If url's port is not the empty string, append ":" concatenated with url's port to output.

    5. Append "/" concatenated with the strings in url's path (including empty strings), separated from each other by "/" to output.

  3. Otherwise, if url's relative flag is unset, append url's scheme data to output.

  4. If url's query is non-null, append "?" concatenated with url's query to output.

  5. If the exclude fragment flag is unset and url's fragment is non-null, append "#" concatenated with url's fragment to output.

  6. Return output.

5.4 Origin

See origin's definition in HTML for the necessary background information. [HTML]

A URL's origin is the origin returned by running these steps, switching on URL's scheme:

"blob"

Let url be the result of parsing URL's scheme data.

If url is failure, return an opaque identifier. Otherwise, return url's origin.

"ftp"
"gopher"
"http"
"https"
"ws"
"wss"

Return a tuple consisting of URL's scheme, its host, and its default port if its port is the empty string, and its port otherwise.

"file"

Unfortunate as it is, this is left as an exercise to the reader. When in doubt, return an opaque identifier.

Otherwise

Return an opaque identifier.

6 application/x-www-form-urlencoded

The application/x-www-form-urlencoded format is a simple way to encode name-value pairs in a byte sequence where all bytes are in the 0x00 to 0x7F range.

While this description makes application/x-www-form-urlencoded sound dated — and really, it is — the format is in widespread use due to its prevalence of HTML forms. [HTML]

6.1 Parsing

The features provided by the application/x-www-form-urlencoded parser are mainly relevant for server-oriented implementations. A browser-based implementation only needs what the application/x-www-form-urlencoded string parser requires.

The application/x-www-form-urlencoded parser takes a byte sequence input, optionally with an encoding encoding override, optionally with a use _charset_ flag, and optionally with an isindex flag, and then runs these steps:

  1. If encoding override is not given, set it to utf-8.

  2. If encoding override is not utf-8 and input contains bytes whose value is greater than 0x7F, return failure.

    This can only happen if input was not generated through the serializer or URLSearchParams.

  3. Let sequences be the result of splitting input on `&`.

  4. If the isindex flag is set and the first byte sequence in sequences does not contain a `=`, prepend `=` to the first byte sequence in sequences.

  5. Let pairs be an empty list of name-value pairs where both name and value hold a byte sequence.

  6. For each byte sequence bytes in sequences, run these substeps:

    1. If bytes is the empty byte sequence, run these substeps for the next byte sequence.

    2. If bytes contains a `=`, then let name be the bytes from the start of bytes up to but excluding its first `=`, and let value be the bytes, if any, after the first `=` up to the end of bytes. If `=` is the first byte, then name will be the empty byte sequence. If it is the last, then value will be the empty byte sequence.

    3. Otherwise, let name have the value of bytes and let value be the empty byte sequence.

    4. Replace any `+` in name and value with 0x20.

    5. If use _charset_ flag is set, name is `_charset_`, run these substeps:

      1. Let result be the result of getting an encoding for value, decoded.

      2. If result is not failure, unset use _charset_ flag and set encoding override to result.

    6. Add a pair consisting of name and value to pairs.

  7. Let output be an empty list of name-value pairs where both name and value hold a string.

  8. For each name-value pair in pairs, append a name-value pair to output where the new name and value appended to output are the result of running encoding override's decoder on the percent decoding of the name and value from pairs, respectively.

  9. Return pairs.

6.2 Serializing

The application/x-www-form-urlencoded byte serializer takes a byte sequence input and then runs these steps:

  1. Let output be the empty string.

  2. For each byte in input, depending on byte:

    0x20

    Append U+002B to output.

    0x2A
    0x2D
    0x2E
    0x30 to 0x39
    0x41 to 0x5A
    0x5F
    0x61 to 0x7A

    Append a code point whose value is byte to output.

    Otherwise

    Append byte, percent encoded, to output.

  3. Return output.

The application/x-www-form-urlencoded serializer takes a list of name-value pairs pairs, optionally with an encoding encoding override, and then runs these steps:

  1. If encoding override is not given, set it to utf-8.

  2. Let output be the empty string.

  3. For each pair in pairs, run these substeps:

    1. Replace pair's name and value with the result of running encode on them using encoding override, respectively.

    2. Replace pair's name and value with their serialization.

    3. If this is not the first pair, append "&" to output.

    4. Append pair's name, followed by "=", followed by pair's value to output.

  4. Return output.

6.3 Hooks

The application/x-www-form-urlencoded string parser takes a string input, utf-8 encodes it, and then returns the result of application/x-www-form-urlencoded parsing it.

7 API

[Constructor(ScalarValueString url, optional ScalarValueString base = "about:blank"),
 Exposed=(Window,Worker)]
interface URL {
  static ScalarValueString domainToASCII(ScalarValueString domain);
  static ScalarValueString domainToUnicode(ScalarValueString domain);
};
URL implements URLUtils;

[NoInterfaceObject,
 Exposed=(Window,Worker)]
interface URLUtils {
  stringifier attribute ScalarValueString href;
  readonly attribute ScalarValueString origin;

           attribute ScalarValueString protocol;
           attribute ScalarValueString username;
           attribute ScalarValueString password;
           attribute ScalarValueString host;
           attribute ScalarValueString hostname;
           attribute ScalarValueString port;
           attribute ScalarValueString pathname;
           attribute ScalarValueString search;
           attribute URLSearchParams searchParams;
           attribute ScalarValueString hash;
};

[NoInterfaceObject,
 Exposed=(Window,Worker)]
interface URLUtilsReadOnly {
  stringifier readonly attribute ScalarValueString href;
  readonly attribute ScalarValueString origin;

  readonly attribute ScalarValueString protocol;
  readonly attribute ScalarValueString host;
  readonly attribute ScalarValueString hostname;
  readonly attribute ScalarValueString port;
  readonly attribute ScalarValueString pathname;
  readonly attribute ScalarValueString search;
  readonly attribute ScalarValueString hash;
};

Except where different objects implementing URLUtilsReadOnly are identical to objects implementing URLUtils.

Since all members are readonly and certain members from URLUtils are not exposed a number of potential optimizations is possible compared to objects implementing URLUtils. These are left as an exercise to the reader.

Specifications defining objects implementing URLUtils or URLUtilsReadOnly must define a get the base algorithm, which must return the appropriate base URL for the object.

Specifications defining objects implementing URLUtils may define update steps to make it possible for an underlying string (such as an attribute value) to be updated. The update steps are passed a string value for this purpose.

An object implementing URLUtils or URLUtilsReadOnly has an associated input (a string), query encoding (an encoding), query object (a URLSearchParams object or null), and a url (a URL or null). Unless stated otherwise, query encoding is utf-8. The others follow from the set the input algorithm.

The associated query encoding is a legacy concept only relevant for HTML. [HTML]

Specifications defining objects implementing URLUtils or URLUtilsReadOnly must use the set the input algorithms to set input, url, and query object. To set the input run these steps:

  1. Set url to null.

  2. Set input to the given value.

  3. Let url be the result of running the URL parser on input with base URL being the result of running get the base and query encoding as encoding override.

  4. If url is not failure, set url to url.

  5. If url is non-null and its relative flag is set, run these substeps:

    1. If query object is null, set query object to a new URLSearchParams object using url's query.

    2. Otherwise, set query object's associated list of name-value pairs to the result of parsing url's query.

  6. If url is null and query object is non-null, empty query object's associated list of name-value pairs.

To run the pre-update steps for an object implementing URLUtils, optionally given a value, run these steps:

  1. If value is not given, let value be the result of serializing the associated url.

  2. Run the update steps with value.

7.1 Constructors

The URL(url, base) constructor must run these steps:

  1. Basic URL parse base and set base to the result of that algorithm.

  2. If base is failure, throw a TypeError exception.

  3. Let result be a new URL object.

  4. Let result's get the base return base.

  5. Run result's set the input for url.

  6. If result's url is null, throw a TypeError exception.

  7. Return result.

To Basic URL parse a string into a URL without using a base URL, invoke the constructor with a single argument:

var input = "http://example.org/",
    url = new URL(input)
url.pathname // "/%F0%9F%92%A9"

Alternatively you can use the base URL of a document through baseURI:

var input = "/",
    url = new URL(input, document.baseURI)
url.href // "http://url.spec.whatwg.org/%F0%9F%92%A9"

7.2 URL statics

The domainToASCII(domain) static method must run these steps:

  1. Let asciiDomain be the result of host parsing domain.

  2. If asciiDomain is an IPv6 address or failure, return the empty string.

  3. Return asciiDomain.

The domainToUnicode(domain) static method must run these steps:

  1. Let unicodeDomain be the result of host parsing domain with the Unicode flag set.

  2. If unicodeDomain is an IPv6 address or failure, return the empty string.

  3. Return unicodeDomain.

Add domainToUI() which follows the UA conventions for when to use the Unicode representation?

7.3 URLUtils and URLUtilsReadOnly members

The URLUtils and URLUtilsReadOnly interfaces are not exposed on the global object. They are meant to augment other interfaces, such as URL.

The href attribute must run these steps:

  1. If url is null, return input.

  2. Return the serialization of url.

Setting the href attribute must run these steps:

  1. Run the set the input algorithm for the given value.

  2. If the context object is a URL object and its url is null, throw a TypeError exception.

  3. Run the pre-update steps with the given value.

This means that if the href attribute is set to value that would cause the URL parser to return failure, that value is still passed through unchanged. This is one of those unfortunate legacy incidents.

The origin attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return the Unicode serialization of url's origin. [HTML]

It returns the Unicode rather than the ASCII serialization for compatibility with HTML's MessageEvent feature. [HTML]

The protocol attribute must run these steps:

  1. If url is null, return ":".

  2. Return scheme and ":" concatenated.

Setting the protocol attribute must run these steps:

  1. If url is null, terminate these steps.

  2. Basic URL parse the given value and ":" concatenated with url as url and scheme start state as state override.

  3. Run the pre-update steps.

The username attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return username.

Setting the username attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Set username to the empty string.

  3. For each code point in the given value, utf-8 percent encode it using the username encode set, and append the result to username.

  4. Run the pre-update steps.

The password attribute must run these steps:

  1. If url is null or its password is null, return the empty string.

  2. Return password.

Setting the password attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. If the given value is the empty string, set password to null, run the pre-update steps, and terminate these steps.

  3. Set password to the empty string.

  4. For each code point in the given value, utf-8 percent encode it using the password encode set, and append the result to password.

  5. Run the pre-update steps.

The host attribute must run these steps:

  1. If url is null, return the empty string.

  2. If port is the empty string, return host, serialized.

  3. Return host, serialized, ":", and port concatenated.

Setting the host attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Basic URL parse the given value with url as url and host state as state override.

  3. Run the pre-update steps.

The hostname attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return host, serialized.

Setting the hostname attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Basic URL parse the given value with url as url and hostname state as state override.

  3. Run the pre-update steps.

The port attribute must run these steps:

  1. If url is null, return the empty string.

  2. Return port.

Setting the port attribute must run these steps:

  1. If url is null, its relative flag is unset, or its scheme is "file", terminate these steps.

  2. Otherwise, Basic URL parse the given value with url as url and port state as state override.

  3. Run the pre-update steps.

The pathname attribute must run these steps:

  1. If url is null, return the empty string.

  2. If the relative flag is unset, return scheme data.

  3. Return "/" concatenated with the strings in path (including empty strings), separated from each other by "/".

Setting the pathname attribute must run these steps:

  1. If url is null, or its relative flag is unset, terminate these steps.

  2. Set path to the empty list.

  3. Basic URL parse the given value with url as url and relative path start state as state override.

  4. Run the pre-update steps.

The search attribute must run these steps:

  1. If url is null, or its query is either null or the empty string, return the empty string.

  2. Return "?" concatenated with query.

Setting the search attribute must run these steps:

  1. If url is null, terminate these steps.

  2. If the given value is the empty string, set query to null, set query object's associated list of name-value pairs to the empty list, run its update steps, and terminate these steps.

  3. Let input be the given value with a single leading "?" removed, if any.

  4. Set query to the empty string.

  5. Basic URL parse input with url as url, query state as state override, and the associated query encoding as encoding override.

  6. Set query object's associated list of name-value pairs to the result of parsing input.

  7. Run query object's update steps.

The update steps of query object are run to ensure all url objects remain synchronized.

The searchParams attribute must return the query object.

Setting the searchParams attribute must run these steps:

  1. Let object be the given value.

  2. Remove the context object from query object's associated list of url objects.

  3. Append the context object to object's associated list of url objects.

  4. Set query object to object.

  5. Set query to the serialization of the query object's associated list of name-value pairs.

  6. Run the pre-update steps.

The hash attribute must run these steps:

  1. If url is null, or its fragment is either null or the empty string, return the empty string.

  2. Return "#" concatenated with fragment.

Setting the hash attribute must run these steps:

  1. If url is null, or its scheme is "javascript", terminate these steps.

  2. If the given value is the empty string, set fragment to null, run the pre-update steps, and terminate these steps.

  3. Let input be the given value with a single leading "#" removed, if any.

  4. Set fragment to the empty string.

  5. Basic URL parse input with url as url and fragment state as state override.

  6. Run the pre-update steps.

7.4 Interface URLSearchParams

[Constructor(optional (ScalarValueString or URLSearchParams) init = ""),
 Exposed=(Window,Worker)]
interface URLSearchParams {
  void append(ScalarValueString name, ScalarValueString value);
  void delete(ScalarValueString name);
  ScalarValueString? get(ScalarValueString name);
  sequence<ScalarValueString> getAll(ScalarValueString name);
  boolean has(ScalarValueString name);
  void set(ScalarValueString name, ScalarValueString value);
  stringifier;
};

A URLSearchParams object has an associated list of name-value pairs, which is initially empty.

A URLSearchParams object has an associated list of zero or more url objects.

URLSearchParams objects always use utf-8 as encoding, despite the existence of concepts such as query encoding. This is to encourage developers to migrate towards utf-8, which they really ought to have done a long time ago now.

To create a new URLSearchParams object, optionally using init, run these steps:

  1. Let query be a new URLSearchParams object.

  2. If init is the empty string or null, return query.

  3. If init is a string, set query's associated list of name-value pairs to the result of parsing input.

  4. If init is a URLSearchParams object, set query's associated list of name-value pairs to a copy of init associated list of name-value pairs.

  5. Return query.

A URLSearchParams object's update steps are to run these steps for each associated url object urlObject, in order:

  1. Set urlObject's url's query to the serialization of the URLSearchParams object's associated list of name-value pairs.

  2. Run urlObject's pre-update steps.

The URLSearchParams(init) constructor must return a new URLSearchParams object using init if given.

The append(name, value) method must run these steps:

  1. Append a new name-value pair whose name is name and value is value, to the list of name-value pairs.

  2. Run the update steps.

The delete(name) method must run these steps:

  1. Remove all name-value pairs whose name is name.

  2. Run the update steps.

The get(name) method must return the value of the first name-value pair whose name is name, and null if there is no such pair.

The getAll(name) method must return the values of all name-value pairs whose name is name, in list order, and the empty sequence otherwise.

The set(name, value) method must run these steps:

  1. If there are any name-value pairs whose name is name, set the value of the first such name-value pair to value and remove the others.

  2. Otherwise, append a new name-value pair whose name is name and value is value, to the list of name-value pairs.

  3. Run the update steps.

The has(name) method must return true if there is a name-value pair whose name is name, and false otherwise.

The stringifier must return the serialization of the URLSearchParams object's associated list of name-value pairs.

7.5 URL APIs elsewhere

A standard that exposes URLs, should expose the URL as a string (by serializing an internal URL). A standard should not expose a URL using a URL object. URL objects are meant for URL manipulation. In IDL the ScalarValueString type should be used.

The higher-level notion here is that values are to be exposed as immutable data structures.

If a standard decides to use a variant of the name "URL" for a feature it defines, it should name such a feature "url" (i.e. lowercase and with an "l" at the end). Names such as "URL", "URI", and "IRI" should not be used. However, if the name is a compound, "URL" (i.e. uppercase) is preferred, e.g. "newURL" and "oldURL".

The EventSource and HashChangeEvent interfaces in HTML are examples of proper naming. [HTML]

References

[DOM]
DOM, Anne van Kesteren, Aryeh Gregor and Ms2ger. WHATWG.
[ENCODING]
Encoding, Anne van Kesteren. WHATWG.
[FILEAPI]
File API, Arun Ranganathan and Jonas Sicking. W3C.
[HTML]
HTML, Ian Hickson. WHATWG.
[IDNA]
Unicode IDNA Compatibility Processing, Mark Davis and Michel Suignard. Unicode Consortium.
[IPV6]
IP Version 6 Addressing Architecture, R. Hinden and Steve Deering. IETF.
[IPV6TEXT]
(Non-normative) A Recommendation for IPv6 Address Text Representation, S. Kawamura and M. Kawashima. IETF.
[IRI]
(Non-normative) Internationalized Resource Identifiers (IRIs), Martin Dürst and Michel Suignard. IETF.
[ORIGIN]
(Non-normative) The Web Origin Concept, Adam Barth. IETF.
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels, Scott Bradner. IETF.
[URI]
(Non-normative) Uniform Resource Identifier (URI): Generic Syntax, Tim Berners-Lee, Roy Fielding and Larry Masinter. IETF.

Acknowledgments

Thanks to Adam Barth, Albert Wiersch, Alexandre Morgaut, Behnam Esfahbod, Bobby Holley, Boris Zbarsky, Brandon Ross, Daniel Bratell, David Sheets, Erik Arvidsson, Gavin Carothers, Geoff Richards, Glenn Maynard, Henri Sivonen, Ian Hickson, James Graham, James Manger, James Ross, Kevin Grandon, Marcos Cáceres, Martin Dürst, Mathias Bynens, Michael Peick, Michael™ Smith, Peter Occil, Rodney Rehm, Santiago M. Mola, Simon Pieters, Simon Sapin, Tab Atkins, Tantek Çelik, Vyacheslav Matva, and 成瀬ゆい (Yui Naruse) for being awesome!

While this standard has been written from scratch, special thanks should be extended to the editors of the various specifications that previously defined what we now call URLs: Larry Masinter, Martin Dürst, Michel Suignard, Roy Fielding, and Tim Berners-Lee.

Domenic Denicola, and Robin Berjon get a cookie for their extensive efforts in putting together a peace treaty.