The Lexer lives in lexer.rs
and processes characters from source code into tokens that are easier to work with in the parsing stage. This mainly occurs in the Lexing Pipeline described below, which outputs Token
s. These can be processed into more appropriate values depending on context.
The lexing consists of several stages that implement the rules described here. Each stage maintains a queue of incoming items so that it can peek ahead and then backtrack where necessary.
It’s probably not necessary to do it in multiple stages, but this is how it made sense to set it up at first. If we need to optimize we can do that later.
Additionally, we might not need a full queue since in theory we should only have to peek a single token in advance, but this grammar is still early in the drafting phase and I haven’t been continuously verifying that the grammar is LL(1).
Well, we do need a queue in at least one instance - when resolving Punctuation
symbols, we need to find the longest possible match (since some operators like :
vs ::
can be a substring of another), and we may need to backtrack more than once
Incoming code is accepted as a Chars
object, and is referred to as the Source Stream.
UnitStream
The first stage operates on individual characters coming from the Source Stream and emits Unit
s to the next stage. These are generally char
s, with a few exceptions to make downstream processing more convenient
If the incoming character is whitespace, consume it and all of the following whitespace characters, and emit a Whitespace
unit
If there is a "
, consume all text up to the matching "
and emit a StringLiteral
unit.
This step will also perform Escaping in string literals if we add that. For now, we assume that the next "
will end the string.
If there is a /
, look for a subsequent /
or *
to process line and block comments respectively. If it’s a line comment, it skips to the end of the line, or if it’s a block comment, it skips to the first occurrence of */
. It emits a Whitespace
unit here as well.
This step is where we will process Doc Comments.
If the Source Stream ends, emit an EOF
unit.
Otherwise, emit a Char
unit downstream.
RawTokenStream
The second stage operates on units from the UnitStream
and emits Token
s to the next stage. StringLiteral
and EOF
units correspond to their own Token
kind, other characters are chunked together based on the following rules:
Word
:
-
_
Number
UnrecognizedPunctuation
error.These Token
s can therefore be delimited by whitespace, an EOF, or another character of a different type (this allows operators to be prefixed and postfixed without whitespace between them).
TokenStream
The third and final stage is the exposed API of the lexer. It’s simply a wrapper around the RawTokenStream
that handles peeking/backtracking for the rest of the parser. It exists as a separate stage because the RawTokenStream
's backtrack queue consists of Unit
s, while this stage’s backtrack queue is Token
s. In my opinion, this makes it more readable and organized.
No additional processing is performed here; it only needs next
and peek
methods to expose to the parser.