262 lines
11 KiB
Markdown
262 lines
11 KiB
Markdown
# Parser
|
|
|
|
## The AST
|
|
|
|
When it comes to AST Elixir is a rather specific language due to its macro system.
|
|
From the perspective of our parser, the important implication is that a seemingly
|
|
invalid code can be a valid syntax when used in a macro (or just put in the `quote`
|
|
expression). For example:
|
|
|
|
```elixir
|
|
quote do
|
|
def Bar.foo(x), definitely_not_do: 1
|
|
%{a}
|
|
*/2
|
|
end
|
|
```
|
|
|
|
As opposed to other languages, core constructs like `def`, `if` and `for` are not
|
|
particularly special either, since they are itself regular functions (or macros rather).
|
|
As a result, these constructs can be used "improperly" in a quoted expression, as shown above.
|
|
|
|
Consequently, to correctly parse all Elixir code, we need the AST to closely match
|
|
the Elixir AST. See [Elixir / Syntax reference](https://hexdocs.pm/elixir/syntax-reference.html)
|
|
for more details.
|
|
|
|
Whenever possible, we try using a more specific nodes (like binary/unary operator), but only
|
|
to the extent that doesn't lose on generality. To get a sense of what the AST looks like, have
|
|
a look at the tests in `test/corpus/`.
|
|
|
|
## Getting started with Tree-sitter
|
|
|
|
For detailed introduction see the official guide on [Creating parsers](https://tree-sitter.github.io/tree-sitter/creating-parsers).
|
|
|
|
Essentially, we define relevant language rules in `grammar.js`, based on which
|
|
Tree-sitter generates parser code (under `src/`). In some cases, we want to write
|
|
custom C++ code for tokenizing specific character sequences (in `src/scanner.cc`).
|
|
|
|
The grammar rules may often conflict with each other, meaning that the given
|
|
sequence of tokens has multiple valid interpretations given one _token_ of lookahead.
|
|
In many conflicts we always want to pick one interpretation over the other and we can
|
|
do this by assigning different precedence and associativity to relevant rules, which
|
|
tells the parser which way to go.
|
|
|
|
For example given `expression1 * expression2 • *` the next token we _see_ ahead is `*`.
|
|
The parser needs to decide whether `expression1 * expression2` is a complete binary operator
|
|
node, or if it should await the next expression and interpret it as `expression1 * (expression2 * expression3)`.
|
|
Since the `*` operator is left-associative we can use `prec.left` on the corresponding
|
|
grammar rule, to inform the parser how to resolve this conflict.
|
|
|
|
However, in some cases looking at one token ahead isn't enough, in which case we can add
|
|
the conflicting rules to the `conflicts` list in the grammar. Whenever the parser stumbles
|
|
upon this conflict it uses its GLR algorithm, basically considering both interpretations
|
|
until one leads to parsing error. If both paths parse correctly (there's a genuine ambiguity)
|
|
we can use dynamic precedence (`prec.dynamic`) to decide on the preferred path.
|
|
|
|
## Using the CLI
|
|
|
|
### tree-sitter
|
|
|
|
```shell
|
|
# See CLI usage
|
|
npx tree-sitter -h
|
|
|
|
# Generate the the parser code based on grammar.js
|
|
npx tree-sitter generate
|
|
|
|
# Run tests
|
|
npx tree-sitter test
|
|
npx tree-sitter test --filter "access syntax"
|
|
|
|
# Parse a specific file
|
|
npx tree-sitter parse tmp/test.ex
|
|
npx tree-sitter parse -x tmp/test.ex
|
|
|
|
# Parse codebase to verify syntax coverage
|
|
npx tree-sitter parse --quiet --stat 'tmp/elixir/**/*.ex*'
|
|
```
|
|
|
|
Whenever you make a change to `grammar.js` remember to run `generate`,
|
|
before verifying the result. To test custom code, create an Elixir file
|
|
like `tmp/test.ex` and use `parse` on it. The `-x` flag prints out the
|
|
source grouped into AST nodes as XML.
|
|
|
|
### Additional scripts
|
|
|
|
```shell
|
|
# Format the grammar.js file
|
|
npm run format
|
|
|
|
# Run parser against the given repository
|
|
scripts/parse_repo.sh elixir-lang/elixir
|
|
|
|
# Run parser against a predefined list of popular repositories
|
|
scripts/integration_test.sh
|
|
```
|
|
|
|
## Implementation notes
|
|
|
|
This section covers some of the implementation decisions that have a more
|
|
elaborated rationale. The individual subsections are referenced in the code.
|
|
|
|
### Ref 1. External scanner for quoted content
|
|
|
|
We want to scan quoted content as a single token, but it requires lookahead.
|
|
Specifically the `#` character may no longer be quoted content if followed by `{`.
|
|
Also, inside heredoc string tokenizing `"` (or `'`) requires lookahead to know
|
|
if it's already part of the end delimiter or not.
|
|
|
|
Since we need to use external scanner, we need to know the delimiter type.
|
|
One way to achieve this is using external scanner to scan the start delimiter
|
|
and then storing its type on the parser stack. This approach requires the parser
|
|
to allocate enough memory upfront and implement serialization/deserialization,
|
|
which ideally would be avoided. To avoid this, we use a different approach!
|
|
Instead of having a single `quoted_content` token, we have specific tokens for
|
|
each quoted content type, such as `_quoted_content_i_single`, `_quoted_content_i_double`.
|
|
Once the start delimiter is tokenized, we know which quoted content should be
|
|
tokenized next, and from the token we can infer the end delimiter and whether
|
|
it supports interpolation. In other words, we extract the information from the
|
|
parsing state, rather than maintaining custom parser state.
|
|
|
|
### Ref 2. External scanner for newlines
|
|
|
|
Generally newlines may appear in the middle of expressions and we ignore them
|
|
as long as the expression is valid, that's why we list newline under extras.
|
|
|
|
When a newline follows a complete expression, most of the time it should be
|
|
treated as terminator. However, there are specific cases where the newline is
|
|
non-breaking and treated as if it was just a space. This cases are:
|
|
|
|
* call followed by newline and a `do end` block
|
|
* expression followed by newline and a binary operator
|
|
|
|
In both cases we want to tokenize the newline as non-breaking, so we use external
|
|
scanner for lookahead.
|
|
|
|
Note that the relevant rules already specify left/right associativity, so if we
|
|
simply added `optional("\n")` the conflicts would be resolved immediately rather
|
|
without using GLR.
|
|
|
|
Additionally, since comments may appear anywhere and don't change the context,
|
|
we also tokenize newlines before comments as non-breaking.
|
|
|
|
### Ref 3. External scanner for unary + and -
|
|
|
|
Plus and minus are either binary or unary operators, depending on the context.
|
|
Consider the following variants
|
|
|
|
```
|
|
a + b
|
|
a+b
|
|
a+ b
|
|
a +b
|
|
```
|
|
|
|
In the first three expressions `+` is a binary operator, while in the last one
|
|
`+` is an unary operator referring to local call argument.
|
|
|
|
To correctly tokenize all cases we use external scanner to tokenize a special empty
|
|
token (`_before_unary_operator`) when the spacing matches `a +b`, which forces the
|
|
parser to pick the unary operator path.
|
|
|
|
### Ref 4. External scanner for `not in`
|
|
|
|
The `not in` operator may have an arbitrary inline whitespace between `not` and `in`.
|
|
|
|
We cannot use a regular expression like `/not[ \t]+in/`, because it would also match
|
|
in expressions like `a not inn` as the longest matching token.
|
|
|
|
A possible solution could be `seq("not", "in")` with dynamic conflict resolution, but
|
|
then we tokenize two separate tokens. Also to properly handle `a not inn`, we would need
|
|
keyword extraction, which causes problems in our case (https://github.com/tree-sitter/tree-sitter/issues/1404).
|
|
|
|
In the end it's easiest to use external scanner, so that we can skip inline whitespace
|
|
and ensure token ends after `in`.
|
|
|
|
### Ref 5. External scanner for quoted atom start
|
|
|
|
For parsing quoted atom `:` we could make the `"` (or `'`) token immediate, however this
|
|
would require adding immediate rules for single/double quoted content and listing them
|
|
in relevant places. We could definitely do that, but using external scanner is actually
|
|
simpler.
|
|
|
|
### Ref 6. Identifier pattern
|
|
|
|
See [Elixir / Unicode Syntax](https://hexdocs.pm/elixir/unicode-syntax.html) for official
|
|
notes.
|
|
|
|
Tree-sitter already supports unicode properties in regular expressions, however character
|
|
class subtraction is not supported.
|
|
|
|
For the base `<Start>` and `<Continue>` we can use `[\p{ID_Start}]` and `[\p{ID_Continue}]`
|
|
respectively, since both are supported and according to the
|
|
[Unicode Annex #31](https://unicode.org/reports/tr31/#Table_Lexical_Classes_for_Identifiers)
|
|
they match the ranges listed in the Elixir docs.
|
|
|
|
For atoms this translates to a clean regular expression.
|
|
|
|
For variables however, we want to exclude uppercase (`\p{Lu}`) and titlecase (`\p{Lt}`)
|
|
categories from `\p{ID_Start}`. As already mentioned, we cannot use group subtraction
|
|
in the regular expression, so instead we need to create a suitable group of characters
|
|
on our own.
|
|
|
|
After removing the uppercase/titlecase categories from `[\p{ID_Start}]`, we obtain the
|
|
following group:
|
|
|
|
`[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\p{Other_ID_Start}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]`
|
|
|
|
At the time of writing the subtracted groups actually only remove a single character:
|
|
|
|
```elixir
|
|
Mix.install([{:unicode_set, "~> 1.1"}])
|
|
|
|
Unicode.Set.to_utf8_char(
|
|
"[[[:Ll:][:Lm:][:Lo:][:Nl:][:Other_ID_Start:]] & [[:Pattern_Syntax:][:Pattern_White_Space:]]]"
|
|
)
|
|
#=> {:ok, [11823]}
|
|
```
|
|
|
|
Consequently, by removing the subtraction we allow just one additional (not common) character,
|
|
which is perfectly acceptable.
|
|
|
|
It's important to note that JavaScript regular expressions don't support the `\p{Other_ID_Start}`
|
|
unicode category. Fortunately this category is a small set of characters introduces for
|
|
[backward compatibility](https://unicode.org/reports/tr31/#Backward_Compatibility), so we can
|
|
enumerate it manually:
|
|
|
|
```elixir
|
|
Mix.install([{:unicode_set, "~> 1.1"}])
|
|
|
|
Unicode.Set.to_utf8_char("[[[:Other_ID_Start:]] - [[:Pattern_Syntax:][:Pattern_White_Space:]]]")
|
|
|> elem(1)
|
|
|> Enum.flat_map(fn
|
|
n when is_number(n) -> [n]
|
|
range -> range
|
|
end)
|
|
|> Enum.map(&Integer.to_string(&1, 16))
|
|
#=> ["1885", "1886", "2118", "212E", "309B", "309C"]
|
|
```
|
|
|
|
Finally, we obtain this regular expression group for variable `<Start>`:
|
|
|
|
`[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\u1885\u1886\u2118\u212E\u309B\u309C]`
|
|
|
|
### Ref 7. Keyword token
|
|
|
|
We tokenize the whole keyword sequence like `do: ` as a single token.
|
|
Ideally we wouldn't include the whitespace, but since we use `token`
|
|
it gets include. However, this is an intentionally accepted tradeoff,
|
|
because using `token` significantly simplifies the grammar and avoids
|
|
conflicts.
|
|
|
|
The alternative approach would be to define keyword as `seq(alias(choice(...), $._keyword_literal), $._keyword_end)`,
|
|
where we list all other tokens that make for for valid keyword literal
|
|
and use custom scanner for `_keyword_end` to look ahead without tokenizing
|
|
the whitespace. However, this approach generates a number of conflicts
|
|
because `:` is tokenized separately and phrases like `fun fun • do` or
|
|
`fun • {}` are ambiguous (interpretation depends on whether `:` comes next).
|
|
Resolving some of these conflicts (for instance special keywords like `{}` or `%{}`)
|
|
requires the use of external scanner. Given the complexities this approach
|
|
brings to the grammar, and consequently the parser, we stick to the simpler
|
|
approach.
|