v6/site/posts/2021-04-14-parsing.md
2022-12-10 13:15:32 -05:00

4.1 KiB

title = "Part 2: A Primitive Parser"
tags = ["build a programming language", "rust"]
date = "2021-04-14 17:00:42 -0400"
short_desc = "Building a small AST from the stream of tokens."
slug = "parsing"
preamble = '<p style="font-style: italic;">This post is part of a <a href="/build-a-programming-language/" data-link="/build-a-programming-language/">series</a> about learning Rust and building a small programming language.</p><hr>'

Now that the lexer is actually lexing, we can start parsing. This is where the Tree in Abstract Syntax Tree really comes in. What the parser is going to do is take a flat sequence of tokens and transform it into a shape that represents the actual structure of the code.

The actual nodes in the AST are represented as an enum with a different case for each node type. The individual node types could also be structs that share a common trait, but so far the enum works fine.

enum Node {
	Integer(i64),
	BinaryOp {
		left: Box<Node>,
		right: Box<Node>,
	},
}

The BinaryOp node doesn't have an operator type for now because the only supported one is addition, but it'll be added in the future. Also, the operands of the binary operator are both boxed (stored on the heap) because otherwise the Node type would be recursive and its size not knowable at compile-time.

There's a helper function that takes a slice of tokens and calls another function that does all of the actual work of parsing.

The parser is a simple recursive descent parser, so the function that does the actual parsing mutably borrows (for the same reasons as the lexer) an iterator that iterates over tokens. The actual iterator it receives is a peekable one, with the underlying iterator type specified with a generic.

I haven't the faintest idea why the lifetime is needed here, because I still haven't read that chapter of the book. But hey, adding it fixed the compiler error and it seems to work fine.

fn do_parse<'a, T: Iterator<Item = &'a Token>>(it: &mut std::iter::Peekable<T>) -> Option<Node> {
}

The do_parse function first ensures that there is a node (otherwise it returns None) and then constructs the appropriate node based on the type of the token. For now, that means just integer literals. If there's any other type of token, it panics because it can't be turned into a node (again, actual error reporting will come later).

fn do_parse<'a, T: Iterator<Item = &'a Token>>(it: &mut std::iter::Peekable<T>) -> Option<Node> {
	if it.peek().is_none() {
		return None;
	}

	let mut node: Node = match it.next().unwrap() {
		Token::Integer(n) => Node::Integer(n),
		tok => panic!("unexpected token: {:?}", tok),
	};
}

After that, it checks if there are any tokens remaining. If there aren't, the node is returned as-is.

If there is another token, and it's a plus sign, it creates a new binary operator node, making the previously-created node to the left operand. To get the right operand, it calls do_parse (putting the recursion into recursive descent) to create a node from the remaining tokens (if it doesn't find a right-hand side token, it simply panics). If the next token is of any other type, it's not able to combine it with the existing node into a new node, so it panics.

fn do_parse<'a, T: Iterator<Item = &'a Token>>(it: &mut std::iter::Peekable<T>) -> Option<Node> {
	// ...
	match it.next() {
		Some(Token::Plus) => {
			node = Node::BinaryOp {
				left: Box::new(node),
				right: Box::new(do_parse(it).expect("expression after binary operator")),
			}
		}
		Some(tok) => panic!("unexpected token: {:?}", tok),
		None => (),
	}

	Some(node)
}

And with that, it can parse a simple expression!

fn main() {
	let tokens = tokenize("12 + 34 + 56");
	let node = parse(tokens);
	println!("node: {:#?}", node);
}
$ cargo run
node: Some(
    BinaryOp {
        left: Integer(12),
        right: BinaryOp {
            left: Integer(34),
            right: Integer(56),
        },
    },
)

The eagle-eyed may notice that while we have parsed the expression, we have not parsed it correctly. What's missing is operator precedence and associativity, but that will have to wait for next time.