SHARE

In our previous segment, “Server to Client,” we saw how a URL is requested from a server and learned all about the many conditions and caches that help optimize delivery of the associated resource. Once the browser engine finally gets the resource, it needs to start turning it into a rendered web page. In this segment, we focus primarily on HTML resources, and how the tags of HTML are transformed into the building blocks for what will eventually be presented on screen.

To use a construction metaphor, we’ve drafted the blueprints, acquired all the permits, and collected all the raw materials at the construction site; it’s time to start building!

Parsing

Once content gets from the server to the client through the networking system, its first stop is the HTML parser, which is composed of a few systems working together: encoding, pre-parsing, tokenization, and tree construction. The parser is the part of the construction project metaphor where we walk through all the raw materials: unpacking boxes; unbinding pallets, pipes, wiring, etc.; and pouring the foundation before handing off everything to the experts working on the framing, plumbing, electrical, etc.

Encoding

The payload of an HTTP response body can be anything from HTML text to image data. The first job of the parser is to figure out how to interpret the bits just received from the server. Assuming we’re processing an HTML document, the decoder must figure out how the text document was translated into bits in order to reverse the process.

Binary-to-text representation
Characters D O M
ASCII Values 68 79 77
Binary Values 01000100 01001111 01001101
Bits 8 8 8

(Remember that ultimately even text must be translated to binary in the computer. Encoding—in this case ASCII encoding—defines that a binary value such as “01000100” means the letter “D,” as shown in the figure above.) Many possible encodings exist for text—it’s the browser’s job to figure out how to properly decode the text. The server should provide hints via Content-Type headers, and the leading bits themselves can be analyzed (for a byte order mark, or BOM). If the encoding still cannot be determined, the browser can apply its best guess based on heuristics. Sometimes the only definitive answer comes from the (encoded) content itself in the form of a html tag. Worst case scenario, the browser makes an educated guess and then later finds a contradicting tag after parsing has started in earnest. In these rare cases, the parser must restart, throwing away the previously decoded content. Browsers sometimes have to deal with old web content (using legacy encodings), and a lot of these systems are in place to support that.

When saving your HTML documents for the web today, the choice is clear: use UTF-8 encoding. Why? It nicely supports the full Unicode range of characters, has good compatibility with ASCII for single-byte characters common to languages like CSS, HTML, and JavaScript, and is likely to be the browser’s fallback default. You can tell when encoding goes wrong, because text won’t render properly (you will tend to get garbage characters or boxes where legible text is usually visible).

Pre-parsing/scanning

Once the encoding is known, the parser starts an initial pre-parsing step to scan the content with the goal of minimizing round-trip latency for additional resources. The pre-parser is not a full parser; for example, it doesn’t understand nesting levels or parent/child relationships in HTML. However, the pre-parser does recognize specific HTML tag names and attributes, as well as URLs. For example, if you have an somewhere in your HTML content, the pre-parser will notice the src attribute, and queue a resource request for the dog picture via the networking system. The dog image is requested as quickly as possible, minimizing the time you need to wait for it to arrive from the network. The pre-parser may also notice certain explicit requests in the HTML such as preload and prefetch directives, and queue these up for processing as well.

Tokenization

Tokenization is the first half of parsing HTML. It involves turning the markup into individual tokens such as “begin tag,” “end tag,” “text run,” “comment,” and so forth, which are fed into the next state of the parser. The tokenizer is a state machine that transitions between the different states of the HTML language, such as “in tag open state” (<|video controls>), “in attribute name state” (), and “after attribute name state” (), doing so iteratively as each character in the HTML markup text document is read.

(In each of those example tags, the vertical pipe illustrates the tokenizer’s position.)

Diagram showing HTML tags being run through a tokenizer to create tokens

The HTML spec (see “12.2.5 Tokenization”) currently defines eighty separate states for the tokenizer. The tokenizer and parser are very adaptable: both can handle and convert any text content into an HTML document—even if code in the text is not valid HTML. Resiliency like this is one of the features that has made the web so approachable by developers of all skill levels. However, the drawback of the tokenizer and parser’s resilience is that you may not always get the results you expect, which can lead to some subtle programming bugs. (Checking your code in the HTML validator can help you avoid bugs like this.)

For those who prefer a more black-and-white approach to markup language correctness, browsers have an alternate parsing mechanism built in that treats any failure as a catastrophic failure (meaning any failure will cause the content to not render). This parsing mode uses the rules of XML to process HTML, and can be enabled by sending the document to the browser with the “application/xhtml+xml” MIME type (or any XML-based MIME type that uses elements in the HTML namespace).

Browsers may combine the pre-parser and tokenization steps together as an optimization.

Parsing/tree construction

The browser needs an internal (in-memory) representation of a web page, and, in the DOM standard, web standards define exactly what shape that representation should be. The parser’s responsibility is to take the tokens created by the tokenizer in the previous step, and create and insert the objects into the Document Object Model (DOM) in the appropriate way (specifically using the twenty-three separate states of its state machine; see “12.2.6.4 The rules for parsing tokens in HTML content”). The DOM is organized into a tree data structure, so this process is sometimes referred to as tree construction. (As an aside, Internet Explorer did not use a tree structure for much of its history.)

Diagram showing tokens being turned into the DOM

HTML parsing is complicated by the variety of error-handling cases that ensure that legacy HTML content on the web continues to have compatible structure in today’s modern browsers. For example, many HTML tags have implied end tags, meaning that if you don’t provide them, the browser auto-closes the matching tag for you. Consider, for instance, this HTML:

sincerely

The authors

The parser has a rule that will create an implied end tag for the paragraph, like so:

sincerely

The authors

This ensures the two paragraph objects in the resulting tree are siblings, as opposed to one paragraph object by ignoring the second open tag. HTML tables are perhaps the most complicated where the parser’s rules attempt to ensure that tables have the proper structure.

Despite all the complicated parsing rules, once the DOM tree is created, all of the parsing rules that try to create a “correct” HTML structure are no longer enforced. Using JavaScript, a web page can rearrange the DOM tree in almost any way it likes, even if it doesn’t make sense! (For example, adding a table cell as the child of a tag). The rendering system becomes responsible for figuring out how to deal with any weird inconsistencies like that.

Another complicating factor in HTML parsing is that JavaScript can add more content to be parsed while the parser is in the middle of doing its job.