This commit is contained in:
Shadowfacts 2024-05-22 19:24:59 -04:00
parent 3916a6f3cf
commit 60948118da
1 changed files with 1 additions and 3 deletions

View File

@ -7,13 +7,11 @@ slug = "parsing-html-slower"
[Last time](/2023/parsing-html-fast/), I wrote about how to parse HTML and convert it to `NSAttributedString`s quickly. Unfortunately, in the time since then, it's gotten slower. It's still a good deal faster than it was before all that work, mind you. At fault is not any of the optimizations I discussed last time, fortunately. Rather, to get the correct behavior across a whole slew of edge cases, there was more work that needed to be done.
The root of all this complexity is the fact that I'm essentially trying to replicate a portion of the CSS layout algorithm using only the information provided by the HTML tokenization process (that is, the text that is emitted and the start/end tags) while flattening into a single string all the structure used to achieve those results.
<!-- excerpt-end -->
The previous version of this—which did correctly handle the initial test cases that I threw at it, but not what cropped up in the wild—worked by trying to keep track of when you had just finished one block element and then, before starting a new one, emitting like breaks to approximate the spacing between them that would otherwise be specified by CSS. Here are an assortment of issues that arise when using this strategy with real input:
The previous version of this—which did correctly handle the initial test cases that I threw at it, but not what cropped up in the wild—worked by trying to keep track of when you had just finished one block element and then, before starting a new one, emitting line breaks to approximate the spacing between them that would otherwise be specified by CSS. Here are an assortment of issues that arise when using this strategy with real input:
### Blocks can start after a closing non-block element