v6/site/posts/2023-12-27-parsing-html-fas...

12 KiB
Raw Permalink Blame History

title = "Parsing HTML Fast"
tags = ["swift"]
date = "2023-12-27 18:52:42 -0500"
short_desc = "Alternative title: Overengineering Is Fun."
slug = "parsing-html-fast"

The urge to overengineer something recently befell me, and the target of that urge was the way Tusker handles HTML parsing. Mastodon provides the content of posts (and profile descriptions, and some other things) as HTML, which means that, in order to display them properly, you need to parse the HTML, a notoriously straightforward and easy task. For a long time, the approach I took was to use SwiftSoup to parse the HTML and then walk the resulting node tree to build up a big NSAttributedString that could be displayed.

And while that technique worked perfectly well, there were a number of downsides that I wasn't entirely satisfied with. First: SwiftSoup alone is a big dependency, at around ten thousand lines of code. Since it's shipping in my app, I take the view that I'm responsible for making sure it doesn't break—and that's a lot of surface area to cover (moreover, since it was ported from a Java library, the code is extremely un-Swift-y). Second: it's not especially fast. Partly because of its aforementioned nature as a Java port (everything is a class, lots of unnecessary copies), but primarily because it's just doing more than I need it to (the first rule of optimization is do less work): it was parsing a string into a tree and then I was flattening that tree back into a string.

The last time I needed to do something similar, I used lol-html, Cloudflare's streaming HTML rewriter. But, while that worked (and is still in use for my RSS reader app), it's written in Rust, and I didn't want to introduce significant additional complexity to the build process for Tusker. So, writing a new HTML parser in Swift seemed like a great idea.

The Architecture

Parsing HTML is no small task, and the spec stands at some 42 thousand words. Fortunately, I get to ignore about half of that. Parsing HTML is divided into two primary stages: tokenizing and tree construction. The tokenization stage takes the character stream as input and produces a stream of tokens (text characters, comments, start/end tags, etc.). The tree construction stages then uses the token stream to build up the actual DOM tree. The main way I'm improving over SwiftSoup in terms of not doing unnecessary work is by skipping tree construction altogether.

This is based on lol-html's architecture, which usually uses a simulated tree construction stage (because the tokenizer gets feedback from the tree constructor and may have its state altered based on the current position in the tree) and only switches to the slower, real one when necessary. In my case, though, I don't even simulate tree construction, since everything I want from the output can be generated directly from the token stream.

This does mean that in certain circumstances the output of the tokenizer could be incorrect, but that's not a problem for my use case. For Tusker, I'm only ever going to use it to parse fragments—not whole documents—and the Mastodon backend sanitizes incoming HTML. What's more, from what I can tell of lol-html's simulator, most of the cases where (without the tree construction stage) ambiguity arises or feedback is needed aren't going to come up in my use case. They involve other tag namespaces (SVG, MathML), or things like <template> tags—both of which I'm comfortable with handling incorrectly.

Tokenizing

The tokenizer is the single biggest component of this project, weighing in at about 1500 lines of code. Fortunately though, the spec is quite clear. It describes in exacting detail the state machine that a spec-compliant tokenizer needs to mimic. To make things simple, all of my code is an almost one-for-one translation of the spec into Swift. There are a few places where I diverged, mostly in the name of performance, but where I didn't, the code is quite easy to follow.

Attributed Strings

Converting the token stream to an NSAttributedString is somewhat more complicated, but not terribly so. The gist of it is that the converter tracks what the style of the string should be at the current point in the token stream. Then, when a text character arrives from the tokenizer, it's added to a buffer representing characters for which the current styles apply (called a "run"). When an open/close tag token is emitted, the current style is adjusted. After the current styles change, the current run is finished and an attributed string is built for the current text and styles and is appended to the overall string.

Optimization

Named Character References

The spec requires HTML tokenizers to have some special handling for named character references (things like &amp;) which aren't well-formed (e.g., missing the trailing semicolon). Because I was sticking very closely to the spec, this was originally implemented rather inefficiently: when a named character reference began, it would consume further characters one by one while there were any named character references beginning with the accumulated characters. I was just using a regular Swift dictionary to store all of the valid character references, so this was effectively an O(n) operation for each character in the reference.

That is, to put it mildly, ridiculously inefficient. Instead, the new approach just consumes as many alphanumeric characters as possible until it hits a semicolon. Then, if there's no valid character reference, it backs up character by character until it finds a valid reference. This optimizes for the common case of a properly-formed reference, but (as far as I can tell) preserves the error-handling behavior that the spec requires.

Static Dispatch

The way I implemented the tokenizer, there's a separate method that follows the procedure for each state, each of which returns the next token. The tokenizer itself is also an Iterator that yields tokens. The Iterator.next implementation just switches over the current state and dispatches to the correct method.

It's fairly common for the procedure for tokenizing to involve switching to a different state. Originally, I did this by changing the state and then calling next(). One small optimization I made relatively early was, in instances where the new state is constant, replacing the next() call with a direct call to the appropriate method for the new state—effectively statically dispatching whenever the state changes.

state = .tagName
return next()
// vs.
state = .tagName
return tokenizeTagName()

UI/NSFont Caching

A number of the tags handled by the attributed string converter alter the current font. Originally, I implemented this by looking at the currently-applied styles and using them to assemble a SymbolicTraits that could be passed to UIFontDescriptor.withSymbolicTraits on the base font descriptor. When profiling, though, I found a non-trivial amount of time was being spent inside withSymbolicTraits to look up the actual font that should be used. So, the converter instead caches the known styles and their corresponding font objects.

Using Swift.Array

One of the first optimizations I'd implemented was a type called InlineArray3 which was a collection type that stored up to three elements inline. Growing beyond that would cause it to switch to a regular Swift array, the storage for which is out of line. The motivation for this was the fact that the vast majority of the HTML tags that the parser sees will have very few attributes, so you might be able to gain a little bit of performance by storing them inline.

After implementing a simple performance testing harness (discussed later), I went back and tested this, and it turned out that just using Array was a fair bit faster. Just goes to show that you shouldn't try to optimize things without having concrete data.

Looping in Hot Recursive Functions

For a number of states in the tokenizer (such as while tokenizing a tag name), the state machines stays in the same state for multiple iterations before emitting a token. Initially, I implemented this just by having the tokenization function recursively call itself. One small but measurable performance improvement I got came from turning hot recursive paths into loops. So, rather than the function calling itself recursively, the entire function body would be wrapped in an infinite loop and the "recursive call" would simply continue on to the next loop iteration.

This was one of the more surprising optimizations I made, since I'd expected this to make no difference. Given tail-call optimization, I'd think both approaches would generate basically identical code. And indeed, looking at the disassembly before and after this change seems to confirm that. There are differences, and though they seem very minor, I guess they're enough to make a difference (though a very, very small one—only about 5ms over 10k iterations).

Avoiding Enums with Associated Values

Lots of states of the tokenizer involve building up the current token until it's ready to emit. This means the parser needs to store the in-progress token. My Token data type is a Swift enum with a handful of different cases. So, the natural way of representing the current token was as an instance variable of an optional Token. Modifying the current token means, then, pattern matching against a Token and then constructing the updated Token and assigning it back.

Unfortunately, this does not permit in-place modification, meaning that all of the enum's associated values need to be copied. For primitive types, this might not be a problem. But for more complex types, like arrays and strings, this results in a bunch of extra copies. So, rather than having a single currentToken property, there are separate properties representing the data for each possible token—each of which can be modified in-place:

private var currentToken: Token?
// vs.
private var currentStartTag: (String, selfClosing: Bool, attributes: [Attribute])?
private var currentEndTag: String?
// etc.

Using Unicode.Scalar instead of Character

My tokenizer originally operated on a stream of Swift Characters, which represent extended grapheme clusters. The tokenizer, however, only cares about individual Unicode code points—and indeed is specified to operate on code points. So, I changed the tokenizer to operate on the Unicode.Scalar type. This resulted in a significant speed-up, since the there was no longer any time being spent on the grapheme breaking algorithm.

Grouping Character Tokens

When emitting the document text, the tokenizer is specified to emit individual characters. As an optimization, though, I introduced an additional token type which emits a whole string of characters in one go. Since long, uninterrupted runs of characters that don't require further parsing are quite common in actual HTML documents, this eliminates a bunch of the overhead associated with handling a token that comes from switching back and forth between the tokenizer and whatever's consuming the tokens (that overhead is amortized over the sequence of character tokens).

Conclusion

To both test the final performance of the new end-to-end HTML to NSAttributedString conversion process as well as guide most of the optimizations I described above, I wrote a small benchmarking harness that converts the HTML from the last ten thousand posts on my timeline to attributed strings using the old and new methods. This is basically the exact use case I have for this project, so I'm confident that the benchmarking results are fairly representative.

All told, the new process is about 2.7× faster when both are built in release mode, and a whopping 8× faster in debug (not terribly important, but nice for me since a debug build is often what's on my phone).

I'm very happy with how this project turned out. I greatly enjoyed overengineering/optimizing something fairly self-contained, and the end result meets all of the goals/requirements I had for my use case. If you're curious to see how it works in more detail, check out the source code.