shadowfacts.net/site/posts/2020-04-09-syntax-highlight...

40 lines
7.1 KiB
Markdown

```
metadata.title = "Writing a JavaScript Syntax Highlighter in Swift"
metadata.category = "swift"
metadata.date = "2020-04-09 11:48:42 -0400"
metadata.shortDesc = "Things I learned while building a tiny syntax highlighter."
metadata.slug = "syntax-highlighting-javascript"
```
For [a project](https://git.shadowfacts.net/shadowfacts/MongoView) I'm currently working on, I need to display some JavaScript code[^1], and because I'm a perfectionist, I want it to be nice and pretty and display it with syntax highlighting. Originally, I was planning to use John Sundell's [Splash](https://github.com/JohnSundell/Splash) Swift syntax highlighting library (both a "(Swift syntax) highlighting library" and a "Swift (syntax highlighting) library"). It can already render to my desired output format, an `NSAttributedString`, and it has an interface for defining new grammars, which I thought would make it relatively easy to extend to support JavaScript. After getting started, it quickly became apparent that it wouldn't be quite so easy. In addition to writing all the code to parse JavaScript, I'd have to go through the Splash codebase and understand a decent amount about how it works. This grew uninteresting pretty quickly, so I decided I would try just writing everything myself. My highlighting needs were fairly simple, how hard could it be?
[^1]: Actually, some [not JavaScript code](/2020/faking-mongo-eval/) that looks for all intents and purposes like JavaScript code, so highlighting it is the same.
<!-- excerpt-end -->
The actual parse loop is fairly straightforward: it starts at the beginning of the string and tries to parse statements until it reaches the end of the string. Parsing a statement means looking at the next character, and depending what it looks like trying to parse something of that type. If it starts with a single or double quote, it tries to parse a string literal, if it starts with a digit, it tries to parse a number literal, if it starts with an alphabetical character, it tries to parse an identifier, and so on. Most of the things that can be parsed aren't all that complicated. The most difficult are template, object, and array literals all because they can all contain further expressions and you need to be careful when recursing to be sure that when parsing the inner expression, you don't start consuming part of the outer thing.
One simplifying factor is that there are a number of things my highlighter intentionally doesn't handle, including keywords and block statements. The main reason is I expect those to come up rarely, if ever, in the context I'm using this in. I also purposely didn't touch a bunch of other things that an actual JavaScript parser/interpreter would have to be concerned with in order to actually execute code. At the top of that list is things like automatic semicolon insertion (JavaScript's weird way of making semicolons optional), and operator precedence, since they have no effect on the highlighted output.
One of the more annoying parts, completely unrelated to JavaScript, is dealing with strings in Swift. Sure Swift's handling of strings is totally safe and correct, but it's an absolute pain in the ass to use. _Want to get the fifth character in a string? Just use `string[string.index(string.startIndex, offsetBy: 5)]`, it's super simple!_ So, the highlighter keeps track of `String.Index` internally and has several helper methods for moving around within the string. Furthermore, the CharacterSet class is weird and doesn't work the way you'd expect. Because it's bridged from Objective-C, its `contains(_:)` method doesn't take a Swift `Character`, it takes a `Unicode.Scalar`. Because of this, the entire highlighter doesn't care about characters as Swift views them, it only cares about Unicode scalars, using the string's `String.UnicodeScalarView`.
Also, this may be the first time I've ever used while/let in Swift. The peek function returns the next character in the string, or `nil`, if there are none remaining, so, with while/let, consuming all characters in a set is as simple as:
```swift
while let char = peek(),
CharacterSet.whitespacesAndNewlines.contains(char) {
consume()
}
```
I spent a couple days profiling it, trying to improve the performance to a point where it's usable for live-highlighting a decently large file. Right now, a full rehighlight of a 1200 line JSON object takes around 10 ms, which, while not spectacularly fast, is fast enough that there's not appreciable latency while typing. One of the single biggest changes I made was to ensure that I'm only ever using the string's Unicode scalar view. Just going from `string[currentIndex] == "\\"` to `string.unicodeScalars[currentIndex] == "\\"` in the JS-string handling code resulted in an 8 ms improvement. Another performance-driven change I made, though not to the syntax highlighter itself, was to try and only rehighlight when absolutely necessary. For the most common operations, typing or deleting a single character, I find the token that is being modified, and, if the added/removed character wouldn't cause a structural change to the rest of the text (e.g., inserting a character inside of a string), I can alter the length of the modified token and shift the locations of all subsequent tokens. This takes about 70 &mu;s for deleting a single character and 130 &mu;s for inserting a single character. Inserting, I think (but haven't verified), takes so much longer because I also have to add an attribute to the attributed string for the newly inserted character, which kicks off a bunch of work inside the text view.
## Conclusion
If you'd asked me a year ago, heck, even a couple months ago, if I'd ever think about undertaking a project like this myself, I'd have said absolutely not and proceeded to go find a third party library that could do the job adequately. But recently, I've been watching [Jonathan Blow](https://youtu.be/MnctEW1oL-E) talk about building parsers and [Andreas Kling](https://youtu.be/watch?v=byNwCHc_IIM) actually build a JavaScript interpreter starting from scratch, and there's one thing that they both mentioned on multiple occasions that really stuck with me: it's just code. Sure, its input is source code, but the operations it performs to produce syntax highlighted output aren't anything insanely complicated or out of the reach of any reasonably experienced programmer.
I'm not trying to claim that what I've written is anywhere near as complicated as a full-blown parser or interpreter that could be used to execute code. Nor is it a simple one.
But it is one that, not too long ago, I wouldn't have willingly undertaken. Parsers, particularly parsers for programming language source code have this perception that only the best of the best can build that because they're so incredibly complicated. And that's not true at all. Sure, they're complex programs, because the problem they're solving is non-trivial. But the way you go about solving it isn't insanely difficult, doesn't require any specialized knowledge, and doesn't use any uncommon techniques. The most important thing is breaking down one big problem ("how do you parse source code?") into smaller and smaller chunks that can be solved individually and then combined together.