tree-sitter-elixir/docs/parser.md

11 KiB

Parser

The AST

When it comes to AST Elixir is a rather specific language due to its macro system. From the perspective of our parser, the important implication is that a seemingly invalid code can be a valid syntax when used in a macro (or just put in the quote expression). For example:

quote do
  def Bar.foo(x), definitely_not_do: 1
  %{a}
  */2
end

As opposed to other languages, core constructs like def, if and for are not particularly special either, since they are itself regular functions (or macros rather). As a result, these constructs can be used "improperly" in a quoted expression, as shown above.

Consequently, to correctly parse all Elixir code, we need the AST to closely match the Elixir AST. See Elixir / Syntax reference for more details.

Whenever possible, we try using a more specific nodes (like binary/unary operator), but only to the extent that doesn't lose on generality. To get a sense of what the AST looks like, have a look at the tests in test/corpus/.

Getting started with Tree-sitter

For detailed introduction see the official guide on Creating parsers.

Essentially, we define relevant language rules in grammar.js, based on which Tree-sitter generates parser code (under src/). In some cases, we want to write custom C++ code for tokenizing specific character sequences (in src/scanner.cc).

The grammar rules may often conflict with each other, meaning that the given sequence of tokens has multiple valid interpretations given one token of lookahead. In many conflicts we always want to pick one interpretation over the other and we can do this by assigning different precedence and associativity to relevant rules, which tells the parser which way to go.

For example given expression1 * expression2 • * the next token we see ahead is *. The parser needs to decide whether expression1 * expression2 is a complete binary operator node, or if it should await the next expression and interpret it as expression1 * (expression2 * expression3). Since the * operator is left-associative we can use prec.left on the corresponding grammar rule, to inform the parser how to resolve this conflict.

However, in some cases looking at one token ahead isn't enough, in which case we can add the conflicting rules to the conflicts list in the grammar. Whenever the parser stumbles upon this conflict it uses its GLR algorithm, basically considering both interpretations until one leads to parsing error. If both paths parse correctly (there's a genuine ambiguity) we can use dynamic precedence (prec.dynamic) to decide on the preferred path.

Using the CLI

tree-sitter

# See CLI usage
npx tree-sitter -h

# Generate the the parser code based on grammar.js
npx tree-sitter generate

# Run tests
npx tree-sitter test
npx tree-sitter test --filter "access syntax"

# Parse a specific file
npx tree-sitter parse tmp/test.ex
npx tree-sitter parse -x tmp/test.ex

# Parse codebase to verify syntax coverage
npx tree-sitter parse --quiet --stat 'tmp/elixir/**/*.ex*'

Whenever you make a change to grammar.js remember to run generate, before verifying the result. To test custom code, create an Elixir file like tmp/test.ex and use parse on it. The -x flag prints out the source grouped into AST nodes as XML.

Additional scripts

# Format the grammar.js file
npm run format

# Run parser against the given repository
scripts/parse_repo.sh elixir-lang/elixir

# Run parser against a predefined list of popular repositories
scripts/integration_test.sh

Implementation notes

This section covers some of the implementation decisions that have a more elaborated rationale. The individual subsections are referenced in the code.

Ref 1. External scanner for quoted content

We want to scan quoted content as a single token, but it requires lookahead. Specifically the # character may no longer be quoted content if followed by {. Also, inside heredoc string tokenizing " (or ') requires lookahead to know if it's already part of the end delimiter or not.

Since we need to use external scanner, we need to know the delimiter type. One way to achieve this is using external scanner to scan the start delimiter and then storing its type on the parser stack. This approach requires the parser to allocate enough memory upfront and implement serialization/deserialization, which ideally would be avoided. To avoid this, we use a different approach! Instead of having a single quoted_content token, we have specific tokens for each quoted content type, such as _quoted_content_i_single, _quoted_content_i_double. Once the start delimiter is tokenized, we know which quoted content should be tokenized next, and from the token we can infer the end delimiter and whether it supports interpolation. In other words, we extract the information from the parsing state, rather than maintaining custom parser state.

Ref 2. External scanner for newlines

Generally newlines may appear in the middle of expressions and we ignore them as long as the expression is valid, that's why we list newline under extras.

When a newline follows a complete expression, most of the time it should be treated as terminator. However, there are specific cases where the newline is non-breaking and treated as if it was just a space. This cases are:

  • call followed by newline and a do end block
  • expression followed by newline and a binary operator

In both cases we want to tokenize the newline as non-breaking, so we use external scanner for lookahead.

Note that the relevant rules already specify left/right associativity, so if we simply added optional("\n") the conflicts would be resolved immediately rather without using GLR.

Additionally, since comments may appear anywhere and don't change the context, we also tokenize newlines before comments as non-breaking.

Ref 3. External scanner for unary + and -

Plus and minus are either binary or unary operators, depending on the context. Consider the following variants

a + b
a+b
a+ b
a +b

In the first three expressions + is a binary operator, while in the last one + is an unary operator referring to local call argument.

To correctly tokenize all cases we use external scanner to tokenize a special empty token (_before_unary_operator) when the spacing matches a +b, which forces the parser to pick the unary operator path.

Ref 4. External scanner for not in

The not in operator may have an arbitrary inline whitespace between not and in.

We cannot use a regular expression like /not[ \t]+in/, because it would also match in expressions like a not inn as the longest matching token.

A possible solution could be seq("not", "in") with dynamic conflict resolution, but then we tokenize two separate tokens. Also to properly handle a not inn, we would need keyword extraction, which causes problems in our case (https://github.com/tree-sitter/tree-sitter/issues/1404).

In the end it's easiest to use external scanner, so that we can skip inline whitespace and ensure token ends after in.

Ref 5. External scanner for quoted atom start

For parsing quoted atom : we could make the " (or ') token immediate, however this would require adding immediate rules for single/double quoted content and listing them in relevant places. We could definitely do that, but using external scanner is actually simpler.

Ref 6. Identifier pattern

See Elixir / Unicode Syntax for official notes.

Tree-sitter already supports unicode properties in regular expressions, however character class subtraction is not supported.

For the base <Start> and <Continue> we can use [\p{ID_Start}] and [\p{ID_Continue}] respectively, since both are supported and according to the Unicode Annex #31 they match the ranges listed in the Elixir docs.

For atoms this translates to a clean regular expression.

For variables however, we want to exclude uppercase (\p{Lu}) and titlecase (\p{Lt}) categories from \p{ID_Start}. As already mentioned, we cannot use group subtraction in the regular expression, so instead we need to create a suitable group of characters on our own.

After removing the uppercase/titlecase categories from [\p{ID_Start}], we obtain the following group:

[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]

At the time of writing the subtracted groups actually only remove a single character:

Mix.install([{:unicode_set, "~> 1.1"}])

Unicode.Set.to_utf8_char(
  "[[[:Ll:][:Lm:][:Lo:][:Nl:][:Other_ID_Start:]] & [[:Pattern_Syntax:][:Pattern_White_Space:]]]"
)
#=> {:ok, [11823]}

Consequently, by removing the subtraction we allow just one additional (not common) character, which is perfectly acceptable.

It's important to note that JavaScript regular expressions don't support the \p{Other_ID_Start} unicode category. Fortunately this category is a small set of characters introduces for backward compatibility, so we can enumerate it manually:

Mix.install([{:unicode_set, "~> 1.1"}])

Unicode.Set.to_utf8_char("[[[:Other_ID_Start:]] - [[:Pattern_Syntax:][:Pattern_White_Space:]]]")
|> elem(1)
|> Enum.flat_map(fn
  n when is_number(n) -> [n]
  range -> range
end)
|> Enum.map(&Integer.to_string(&1, 16))
#=> ["1885", "1886", "2118", "212E", "309B", "309C"]

Finally, we obtain this regular expression group for variable <Start>:

[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\u1885\u1886\u2118\u212E\u309B\u309C]

Ref 7. Keyword token

We tokenize the whole keyword sequence like do: as a single token. Ideally we wouldn't include the whitespace, but since we use token it gets include. However, this is an intentionally accepted tradeoff, because using token significantly simplifies the grammar and avoids conflicts.

The alternative approach would be to define keyword as seq(alias(choice(...), $._keyword_literal), $._keyword_end), where we list all other tokens that make for for valid keyword literal and use custom scanner for _keyword_end to look ahead without tokenizing the whitespace. However, this approach generates a number of conflicts because : is tokenized separately and phrases like fun fun • do or fun • {} are ambiguous (interpretation depends on whether : comes next). Resolving some of these conflicts (for instance special keywords like {} or %{}) requires the use of external scanner. Given the complexities this approach brings to the grammar, and consequently the parser, we stick to the simpler approach.