Build Parser Using Rust 🛁

Build Parser Using Rust 🛁

·

4 min read

Table of contents

I released an INI parser implemented in Go on GitHub. Now, I aim to implement a parser using Rust and take the tag function in Bear, my favorite note-taking software, as an example. The tag function in Bear is both powerful and easy to use, and I believe it will serve as a great example for this project.

the tags in Bear APP are denoted by double hash symbols (##). However, we have made a change to the way tags are defined. And any content immediately followed by a single hash symbol (#) will be treated as a tag and will end directly with a space.

For the parser, still do some preliminary introduction, in fact, the steps are as follows:

Source Code -> Lexer -> Token -> Parser -> Abstract Syntax Tree

so, let's begin to build a parser with Rust

Token

In Go language, Token is generally defined as String, and like this


type TokenType string

const (
    TokenTypeILLEGAL = "ILLEGAL"
    TokenTypeEOF     = "EOF"
    Ident            = "Ident"
    TokenTypeCOMMENT  = "#"
    TokenTypeCOMMENT2 = ";"
    TokenTypeASSIGN   = "="
    TokenTypeLBRACKET = "["
    TokenTypeRBRACKET = "]"
    RETURN = "RETURN"
)

share a powerful feature of the Rust programming language - enumerations. The capability of Rust enumerations goes beyond what we previously understood. Not only can they have different data types, but they can also associate data.

This opens up new possibilities for defining objects such as tokens. In Rust, we can define a Token like this:

#[derive(PartialEq, Debug, Clone)]
pub enum Token {
    ILLEGAL,
    EOF,
    INDENT(Vec<char>),

    SHARP(char), // #
    Colon(char), // ,
    MINUS(char), // -
}

Lexer

The next thing to implement is the lexical analyzer.

use super::token;
pub struct Lexer {
    input: Vec<char>,         
    pub position: usize,      
    pub read_position: usize, 
    pub ch: char,             
}

impl Lexer {
    pub fn new(input: Vec<char>) -> Self {
        Self {
            input: input,
            position: 0,
            read_position: 0,
            ch: '0',
        }
    }
}

reading Tokens in Rust - using a finite state machine. This approach involves continuously reading the next character until the corresponding identifier is found. If the identifier is not found, the process continues until a match is found or until the end of the input is reached.

If you are interested, please refer to Finite-state machine-Wikipedia


pub fn read_char(&mut self) {
    if self.read_position >= self.input.len() {
        self.ch = '0';
    } else {
        self.ch = self.input[self.read_position];
    }
    self.position = self.read_position;
    self.read_position = self.read_position + 1;
}

fn peek_char(&mut self, offset: usize) -> char {
    let position = self.read_position + offset;

    if position >= self.input.len() {
        return '0';
    } else {
        return self.input[position];
    }
}

pub fn next_token(&mut self) -> token::Token {
    let tok: token::Token;
    self.skip_whitespace();

    match self.ch {
        '#' => {
            tok = token::Token::SHARP(self.ch);
        }
        '0' => {
            tok = token::Token::EOF;
        }
        _ => {
            let ident = self.read_statements();
            if ident.len() > 0 {
                tok = token::Token::INDENT(ident);
            } else {
                return token::Token::ILLEGAL;
            }
        }
    }

    self.read_char();
    tok
}

Our focus is on identifying the tags within the content, and as a result, we do not need to fully identify and optimize other content.

We will be running this example as a lexical analyzer, which is responsible for breaking down the input into Tokens.

In the previous implementation of our Go INI parser, we used a Parser to recognize the Tokens and turn them into an abstract syntax tree. However, for the purposes of this example, we will not be constructing an abstract syntax tree.

Parser

build a parser

pub struct Parser {
    pub lexer: Lexer,
    peek_token: Token,
    pub input: Vec<char>,
    tokens: Vec<Token>,  // 将token存入Vec即可
}

impl Parser {
    pub fn new(input: Vec<char>) -> Parser {
        let l = Lexer::new(input.clone());
        Parser {
            lexer: l,
            input,
        }
    }
}
pub fn parse_documment(&mut self) -> Result<Document, ParseError> {
    self.current_token = self.lexer.next_token();
    self.peek_token = self.lexer.next_token();

    self.tokens.push(self.current_token.clone());

    while self.current_token != Token::EOF {
        self.next_token();
        self.tokens.push(self.current_token.clone());
    }
    self.transform_to_document();
}
fn transform_to_document(&mut self) -> Result<Document, ParseError> {
    let mut document = Document {
        ..Default::default()
    };
    self.read_token();
    loop {
        let mut offset = 0;
        match &self.current_token {
            Token::SHARP(_) => {
                let peek_token = self.peek_token_next1();
                match peek_token {
                    Token::Statement(chs) => {
                        let mut tag = Tag::default();
                        tag.id = chs.into_iter().collect();
                        document.tags.push(tag);
                        offset = offset + 1;
                    }
                    _ => {}
                }
            }

            Token::Statement(ident) => {
                let state_ident: String = ident.into_iter().collect();
                if document.name == "" {
                    document.name = format!("{}", state_ident);
                } else {
                    document.name = format!("{} {}", document.name, state_ident);
                };
            }
            Token::EOF => {
                break;
            }
            _ => {
                return Err(ParseError);
            }
        }

        self.read_position = self.read_position + offset;
        self.position = self.read_position;
        self.read_token();
    }

    Ok(document)
}

Ok, testing it...

#[test]
fn test_parse_note() {
    let case = "#tag/tag1 this is a note hahaha. #tag1/tag22"
    let mut parser = Parser::new(case.chars().collect());
    let document = parser.parse_documment().unwrap();
    println!("{:?}", document);
}

Result:

Document { id: "", name: "this is a note hahaha.", craete_time: 0, tags: [Tag { id: "tag/tag1", parent: "", icon: "" }, Tag { id: "tag1/tag22", parent: "", icon: "" }] }

finished the task successfully.🏄‍♀️

Did you find this article valuable?

Support levene by becoming a sponsor. Any amount is appreciated!