In our previous segment, “Server to Client,” we saw how a URL is requested from a server and learned all about the many conditions and caches that help optimize delivery of the associated resource. Once the browser engine finally gets the resource, it needs to start turning it into a rendered web page. In this segment, we focus primarily on HTML resources, and how the tags of HTML are transformed into the building blocks for what will eventually be presented on screen.
To use a construction metaphor, we’ve drafted the blueprints, acquired all the permits, and collected all the raw materials at the construction site; it’s time to start building!
Once content gets from the server to the client through the networking system, its first stop is the HTML parser, which is composed of a few systems working together: encoding, pre-parsing, tokenization, and tree construction. The parser is the part of the construction project metaphor where we walk through all the raw materials: unpacking boxes; unbinding pallets, pipes, wiring, etc.; and pouring the foundation before handing off everything to the experts working on the framing, plumbing, electrical, etc.
The payload of an HTTP response body can be anything from HTML text to image data. The first job of the parser is to figure out how to interpret the bits just received from the server. Assuming we’re processing an HTML document, the decoder must figure out how the text document was translated into bits in order to reverse the process.
(Remember that ultimately even text must be translated to binary in the computer. Encoding—in this case ASCII encoding—defines that a binary value such as “01000100” means the letter “D,” as shown in the figure above.) Many possible encodings exist for text—it’s the browser’s job to figure out how to properly decode the text. The server should provide hints via
Content-Type headers, and the leading bits themselves can be analyzed (for a byte order mark, or BOM). If the encoding still cannot be determined, the browser can apply its best guess based on heuristics. Sometimes the only definitive answer comes from the (encoded) content itself in the form of a
html tag. Worst case scenario, the browser makes an educated guess and then later finds a contradicting
tag after parsing has started in earnest. In these rare cases, the parser must restart, throwing away the previously decoded content. Browsers sometimes have to deal with old web content (using legacy encodings), and a lot of these systems are in place to support that.
Once the encoding is known, the parser starts an initial pre-parsing step to scan the content with the goal of minimizing round-trip latency for additional resources. The pre-parser is not a full parser; for example, it doesn’t understand nesting levels or parent/child relationships in HTML. However, the pre-parser does recognize specific HTML tag names and attributes, as well as URLs. For example, if you have an
somewhere in your HTML content, the pre-parser will notice the
src attribute, and queue a resource request for the dog picture via the networking system. The dog image is requested as quickly as possible, minimizing the time you need to wait for it to arrive from the network. The pre-parser may also notice certain explicit requests in the HTML such as preload and prefetch directives, and queue these up for processing as well.
Tokenization is the first half of parsing HTML. It involves turning the markup into individual tokens such as “begin tag,” “end tag,” “text run,” “comment,” and so forth, which are fed into the next state of the parser. The tokenizer is a state machine that transitions between the different states of the HTML language, such as “in tag open state” (
<|video controls>), “in attribute name state” (
), and “after attribute name state” (
), doing so iteratively as each character in the HTML markup text document is read.
(In each of those example tags, the vertical pipe illustrates the tokenizer’s position.)
The HTML spec (see “12.2.5 Tokenization”) currently defines eighty separate states for the tokenizer. The tokenizer and parser are very adaptable: both can handle and convert any text content into an HTML document—even if code in the text is not valid HTML. Resiliency like this is one of the features that has made the web so approachable by developers of all skill levels. However, the drawback of the tokenizer and parser’s resilience is that you may not always get the results you expect, which can lead to some subtle programming bugs. (Checking your code in the HTML validator can help you avoid bugs like this.)
For those who prefer a more black-and-white approach to markup language correctness, browsers have an alternate parsing mechanism built in that treats any failure as a catastrophic failure (meaning any failure will cause the content to not render). This parsing mode uses the rules of XML to process HTML, and can be enabled by sending the document to the browser with the “application/xhtml+xml” MIME type (or any XML-based MIME type that uses elements in the HTML namespace).
Browsers may combine the pre-parser and tokenization steps together as an optimization.
The browser needs an internal (in-memory) representation of a web page, and, in the DOM standard, web standards define exactly what shape that representation should be. The parser’s responsibility is to take the tokens created by the tokenizer in the previous step, and create and insert the objects into the Document Object Model (DOM) in the appropriate way (specifically using the twenty-three separate states of its state machine; see “220.127.116.11 The rules for parsing tokens in HTML content”). The DOM is organized into a tree data structure, so this process is sometimes referred to as tree construction. (As an aside, Internet Explorer did not use a tree structure for much of its history.)
HTML parsing is complicated by the variety of error-handling cases that ensure that legacy HTML content on the web continues to have compatible structure in today’s modern browsers. For example, many HTML tags have implied end tags, meaning that if you don’t provide them, the browser auto-closes the matching tag for you. Consider, for instance, this HTML:
The parser has a rule that will create an implied end tag for the paragraph, like so:
This ensures the two paragraph objects in the resulting tree are siblings, as opposed to one paragraph object by ignoring the second open tag. HTML tables are perhaps the most complicated where the parser’s rules attempt to ensure that tables have the proper structure.
tag). The rendering system becomes responsible for figuring out how to deal with any weird inconsistencies like that.
tags contain text that the parser must collect and then send to a scripting engine for evaluation. While the script engine parses and evaluates the script text, the parser waits. If the script evaluation includes invoking the
document.write API, a second instance of the HTML parser must start running (reentrantly). To quickly revisit our construction metaphor,
document.write require stopping all in-progress work to go back to the store to get some additional materials that we hadn’t realized we needed. While we’re away at the store, all progress on the construction is stalled.
All of these complications make writing a compliant HTML parser a non-trivial undertaking.
When the parser finishes, it announces its completion via an event called
DOMContentLoaded, there are a variety of events that signal significant state changes in the web page such as
load (meaning parsing is done, and all the resources requested by the parser, like images, CSS, video, etc., have been downloaded) and
unload (meaning the web page is about to be closed). Many events are specific to user input, such as the user touching the screen (
pointerup, and others), using a mouse (
mousemove, and others), or typing on the keyboard (
The tree structure of the DOM makes it convenient to “filter” how frequently code responds to an event by allowing events to be listened for at any level in the tree (i.e.., at the root of the tree, in the leaves of the tree, or anywhere in between). The browser first determines where to fire the event in the tree (meaning which DOM object, such as a specific
control), and then calculates a route for the event starting from the root of the tree, then down each branch until it reaches the target (the
for example), and then back along the same path to the root. Each object along the route then has its event listeners triggered, so that listeners at the root of the tree will “see” more events than specific listeners at the leaves of the tree.
Some events can also be canceled, which provides, for example, the ability to stop a form submission if the form isn’t filled out properly. (A
submit event is fired from a