The Lexer lives in lexer.rs and processes characters from source code into tokens that are easier to work with in the parsing stage. This mainly occurs in the Lexing Pipeline described below, which outputs Tokens. These can be processed into more appropriate values depending on context.

Lexing Pipeline

The lexing consists of several stages that implement the rules described here. Each stage maintains a queue of incoming items so that it can peek ahead and then backtrack where necessary.

It’s probably not necessary to do it in multiple stages, but this is how it made sense to set it up at first. If we need to optimize we can do that later.

Additionally, we might not need a full queue since in theory we should only have to peek a single token in advance, but this grammar is still early in the drafting phase and I haven’t been continuously verifying that the grammar is LL(1).

Well, we do need a queue in at least one instance - when resolving Punctuation symbols, we need to find the longest possible match (since some operators like : vs :: can be a substring of another), and we may need to backtrack more than once

Incoming code is accepted as a Chars object, and is referred to as the Source Stream.

1) UnitStream

The first stage operates on individual characters coming from the Source Stream and emits Units to the next stage. These are generally chars, with a few exceptions to make downstream processing more convenient

  1. If the incoming character is whitespace, consume it and all of the following whitespace characters, and emit a Whitespace unit

  2. If there is a ", consume all text up to the matching " and emit a StringLiteral unit.

    This step will also perform Escaping in string literals if we add that. For now, we assume that the next " will end the string.

  3. If there is a /, look for a subsequent / or * to process line and block comments respectively. If it’s a line comment, it skips to the end of the line, or if it’s a block comment, it skips to the first occurrence of */. It emits a Whitespace unit here as well.

    This step is where we will process Doc Comments.

  4. If the Source Stream ends, emit an EOF unit.

  5. Otherwise, emit a Char unit downstream.

2) RawTokenStream

The second stage operates on units from the UnitStream and emits Tokens to the next stage. StringLiteral and EOF units correspond to their own Token kind, other characters are chunked together based on the following rules:

  1. If a word starts with an alphabetic character, consume all of the following character types and emit the result as a Word:
    1. Alphanumeric characters
    2. Dashes -
    3. Underscores _
  2. If a word starts with a numeric character, consume all of the following numeric characters and emit the result as a Number
  3. If a word starts with a punctuation character, try to find the longest possible match. We do this by progressively consuming more and more characters until the next non-punctuation character, keeping track of the longest match we’ve found along the way, and then backtracking if necessary once we’re done. If no match is found, we raise an UnrecognizedPunctuation error.

These Tokens can therefore be delimited by whitespace, an EOF, or another character of a different type (this allows operators to be prefixed and postfixed without whitespace between them).

3) TokenStream

The third and final stage is the exposed API of the lexer. It’s simply a wrapper around the RawTokenStream that handles peeking/backtracking for the rest of the parser. It exists as a separate stage because the RawTokenStream's backtrack queue consists of Units, while this stage’s backtrack queue is Tokens. In my opinion, this makes it more readable and organized.

No additional processing is performed here; it only needs next and peek methods to expose to the parser.