``` title = "Parsing HTML Fast" tags = ["swift"] date = "2023-12-27 18:52:42 -0500" short_desc = "Alternative title: Overengineering Is Fun." slug = "parsing-html-fast" ``` The urge to overengineer something recently befell me, and the target of that urge was the way Tusker handles HTML parsing. Mastodon provides the content of posts (and profile descriptions, and some other things) as HTML, which means that, in order to display them properly, you need to parse the HTML, a notoriously straightforward and easy task. For a long time, the approach I took was to use [SwiftSoup](https://github.com/scinfu/SwiftSoup) to parse the HTML and then walk the resulting node tree to build up a big `NSAttributedString` that could be displayed. And while that technique worked perfectly well, there were a number of downsides that I wasn't entirely satisfied with. First: SwiftSoup alone is a _big_ dependency, at around ten thousand lines of code. Since it's shipping in my app, I take the view that I'm responsible for making sure it doesn't break—and that's a lot of surface area to cover (moreover, since it was ported from a Java library, the code is extremely un-Swift-y). Second: it's not especially fast. Partly because of its aforementioned nature as a Java port (everything is a class, lots of unnecessary copies), but primarily because it's just doing more than I need it to (the first rule of optimization is do less work): it was parsing a string into a tree and then I was flattening that tree back into a string. The [last time](/2022/swift-rust/) I needed to do something similar, I used lol-html, Cloudflare's streaming HTML rewriter. But, while that worked (and is still in use for my RSS reader app), it's written in Rust, and I didn't want to introduce significant additional complexity to the build process for Tusker. So, writing a new HTML parser in Swift seemed like a great idea. ## The Architecture Parsing HTML is no small task, and [the spec](https://html.spec.whatwg.org/multipage/parsing.html) stands at some 42 thousand words. Fortunately, I get to ignore about half of that. Parsing HTML is divided into two primary stages: tokenizing and tree construction. The tokenization stage takes the character stream as input and produces a stream of tokens (text characters, comments, start/end tags, etc.). The tree construction stages then uses the token stream to build up the actual DOM tree. The main way I'm improving over SwiftSoup in terms of not doing unnecessary work is by skipping tree construction altogether. This is based on lol-html's architecture, which usually uses a simulated tree construction stage (because the tokenizer gets feedback from the tree constructor and may have its state altered based on the current position in the tree) and only switches to the slower, real one when necessary. In my case, though, I don't even simulate tree construction, since everything I want from the output can be generated directly from the token stream. This does mean that in certain circumstances the output of the tokenizer could be incorrect, but that's not a problem for my use case. For Tusker, I'm only ever going to use it to parse fragments—not whole documents—and the Mastodon backend sanitizes incoming HTML. What's more, from what I can tell of lol-html's simulator, most of the cases where (without the tree construction stage) ambiguity arises or feedback is needed aren't going to come up in my use case. They involve other tag namespaces (SVG, MathML), or things like `