Baduit

A young French developper who really likes (modern) C++

About me
08 January 2023

Use PEGTL to remove my clunky homemade parser

by Baduit

Article::Article

On my first article I talked about how I cleaned up my dependencies, now it is time to clean up my clunky homemade parser for my esoteric programming language PainPerdu.

Basic architecture of the project

We can split the project in three parts:

My goal

The goal is simple: I want to totally rewrite the parser without having to modify anything else in the project.

Concretely I will only touch the .cpp file, the .hpp with the declaration of the Parser class must stay the same:

1
2
3
4
5
class Parser
{
    public:
        Definitions operator()(std::string_view input);
};

Why did I initially make a parser from scratch

I want to have fun on my free time, and it was the funnier option at the time for me.

To develop a bit: most libraries I have looked had syntaxes I didn’t like at all (like Boost Spirit for example). Most of the time they seemed totally overkill for my little need and way too complicated and learning to used them did not seem fun at all. (Did I tell you I like having fun?)

I enjoy reinventing the wheel and creating my own library seemed really fun! And honestly it was, event if the solution was not optimal.

Why refactor the parser

Presentation of Pegtl

To quote Pegtl description from the project readme, Pegtl (Parsing Expression Grammar Template Library) is a zero-dependency C++ header-only parser combinator library for creating parsers according to a Parsing Expression Grammar (PEG).

Installation with vcpkg

That’s the easiest part: I added this line into my vcpkg.json

"pegtl"

Then I have to find the package in my cmake file:

find_package(pegtl CONFIG REQUIRED)

FInally I just needed to link against the library:

target_link_libraries(PainPerdu PRIVATE taocpp::pegtl)

Defining the grammar

If you are not familiar with the parsing expression grammar I advise you to look it up, reading the Wikipedia page should be enough to understand the rest of the article. At least for me the Wikipedia article was enough to use PEGTL and write the whole grammar.

For all the snippet below we will assume that there is the following lines above:

1
2
#include <tao/pegtl.hpp>
namespace pegtl = tao::pegtl; // namespace alias because i'm lazy

To define the grammar, we need to think about how the syntax works. Some instructions are just a single character like ] to write a character but most are a single character followed by an integer or an identifier like like + to increment the current case, or the annotation : to define a label. The last kind of instruction is something between 2 characters for example to read a file, you must write the file name between 2 double quotes . The comments are based on the same system but between a { and a }.

Now we know enough to build the 1st bricks of the language. Let’s begin by defining what is an integer: simple, it is a one or more digits. We can define it like that:

1
2
3
4
5
6
struct Integer :
	pegtl::plus // Any number of
		<
			pegtl::digit // digits (thanks captain obivous)
		>
{};

Each brick (or rule but I like the term brick) of our grammar will be a structure inheriting from other bricks, already defined in the library or in our code.

The identifier is a little more complex. In the specification of the language it is defined as an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters. A valid identifier must begin with a non-digit character.

We need to define 3 different bricks:

Translated into code it looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct AlphaAndUnderscore :
    pegtl::sor // Either
        <
            pegtl::alpha, // a letter of the latin alphabet
            pegtl::one<'_'> // or an _
        >
{};

struct AlphaNumAndUnderscore:
    pegtl::sor // Either
        <
            pegtl::alnum, // a letter of the latin alphabet or a digit
            pegtl::one<'_'> // or an _
        >
{};

struct Identifier :
    pegtl::seq// A sequence
        < 
            AlphaAndUnderscore, // starting with an _ or a letter
            pegtl::opt // which can be followed
                < 
                    pegtl::plus // by any number of
                        < 
                            AlphaNumAndUnderscore // _, letters or digits
                        >
                >
        >
{};

We have all the tools to build the bricks for all the instructions!
For simple instructions like ] to write a character it is as simple as that:

1
struct PutChar : pegtl::one<']'> {};

For the instructions like + to increment the current case it is almost as simple:

1
2
3
4
5
6
7
8
9
10
11
struct Increment :
    pegtl::seq // A sequence
        <
            pegtl::one<'+'>, // of one +
            pegtl::sor // Followed by either
                <
                    Integer, // an Integer
                    Identifier // or an Indentifier
                >
        >
{};

The instruction to read a file could looks difficult, but thanks of the bricks built in the library it is really easy:

1
2
3
4
5
6
7
8
9
10
11
struct ReadFile :
    pegtl::if_must // If we match with 
        <
            pegtl::one<'"'>, // a quote "
            pegtl::until // Until there is
                <
                    pegtl::one<'"'>, // an other quote "
                    pegtl::any // take any character
                >
         >
{};

All the other instructions and annotations will be defined the same way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// List of all instructions and annotations
struct DefineLabel : pegtl::seq<pegtl::one<':'>, Identifier>  {};
struct MoveRight : pegtl::seq<pegtl::one<'>'>, pegtl::sor<Integer, Identifier>> {};
struct MoveLeft : pegtl::seq<pegtl::one<'<'>, pegtl::sor<Integer, Identifier>> {};
struct Increment : pegtl::seq<pegtl::one<'+'>, pegtl::sor<Integer, Identifier>> {};
struct Decrement : pegtl::seq<pegtl::one<'-'>, pegtl::sor<Integer, Identifier>> {};
struct ResetCase : pegtl::one<';'> {};
struct DefineReference : pegtl::seq<pegtl::one<'#'>, Identifier>  {};
struct UndefineReference : pegtl::seq<pegtl::one<'.'>, Identifier> {};
struct MoveToReference : pegtl::seq<pegtl::one<'@'>, Identifier>  {};
struct GoToLabel : pegtl::seq<pegtl::one<'*'>, Identifier>  {};
struct Rewind : pegtl::seq<pegtl::one<'&'>, Identifier> {};
struct IfCurrentValueDifferent : pegtl::seq<pegtl::one<'?'>, pegtl::opt<pegtl::sor<Integer, Identifier>>> {};
struct IfCursorIsAtReference : pegtl::seq<pegtl::one<'!'>, Identifier> {};
struct IfReferenceExists : pegtl::seq<pegtl::one<'$'>, Identifier> {};
struct GetChar : pegtl::one<'['> {};
struct PutChar : pegtl::one<']'> {};
struct ReadFile : pegtl::if_must<pegtl::one<'"'>, pegtl::until<pegtl::one<'"'>, pegtl::any>> {};
struct Comment : pegtl::if_must<pegtl::one<'{'>, pegtl::until<pegtl::one<'}'>, pegtl::any>> {};
struct Skip : pegtl::plus<pegtl::sor<pegtl::blank, pegtl::eol>> {};

Now that we have caught them all and completed our Pokedex, we can define what is an Expression: any instructions, comments and characters like spaces or tabulations between instructions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
struct Expression : // An expression is
	pegtl::sor // Either
		<
			DefineLabel, // this instruction
			MoveRight, // or that instruction
			MoveLeft, // or that one
			Increment, // I think you got it
			Decrement,
			ResetCase,
			DefineReference,
			UndefineReference,
			MoveToReference,
			GoToLabel,
			Rewind,
			IfCurrentValueDifferent,
			IfCursorIsAtReference,
			IfReferenceExists,
			GetChar,
			PutChar,
			ReadFile,
			Comment, // Also comments
			Skip // And some space we can skip
		>
{};

Finally the grammar is just a list of expressions:

1
struct Grammar : pegtl::plus<Expression> {};

Make a parse tree

To create a parse tree we need 3 things:

Creating the input is trivial, you can do it just like this:

1
2
std::string_view input = "...";
pegtl::memory_input mem_input(input.data(), input.size(), "");

We already created defined the grammar above, the last thing we have to do is create the selector. But, What is a selector? It is just a list of the part of the grammar we want in our parse tree. For example in our grammar we want 4 types of thing:

Now we just need to define our selector by creation a type alias using the helper class template included in the library: pegtl::parse_tree::selector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
template<typename Rule>
using Selector = pegtl::parse_tree::selector
    <
        Rule,
        pegtl::parse_tree::store_content::on
            <
                // Identifier and integer
                Identifier,
                Integer,
                // Annotation
                DefineLabel,
                // Instructions
                MoveRight,
                MoveLeft,
                Increment,
                Decrement,
                ResetCase,
                DefineReference,
                UndefineReference,
                MoveToReference,
                GoToLabel,
                Rewind,
                IfCurrentValueDifferent,
                IfCursorIsAtReference,
                IfReferenceExists,
                GetChar,
                PutChar,
                ReadFile
	    >
    >;

We can now assemble the 3 parts of the puzzle to create the tree:

1
const auto root = pegtl::parse_tree::parse<Grammar, Selector>(mem_input);

For a simple program looking like this:

+toto ] -55

We would have the following parse tree:

Convert the parse tree to something the interpreter understand

That’s cool, we have a parse tree, but the interpreter does not understand this parse tree and we need to return a Definitions object which is composed of a vector of annotations and a vector of instructions. We must process our parse tree, one simple way to do it is to create a visitor.

The visitor iterates over the nodes, if it is empty it does nothing, if it is an annotation, extract the information from the tree, create an Annotation, push it into the annotation vector and do the same thing with instructions. We can know the type of node with the is_type() method of the node. If it is not an instruction or an annotation, do the same routine on child nodes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
struct Visitor
{
	Visitor(Definitions& defs):
		_defs(defs)
	{}

	void operator()(const pegtl::parse_tree::node* node)
	{
		if (!node)
			return;			

		if (node->is_type<Increment>())
		{
			// Extract information here
		}
		// A big tree of if and else...
		else
		{
			for (const auto& child: node->children)
				(*this)(child.get());
		}
	}

	Definitions& _defs;
};

Now that we have the global structure of the visitor we need to extract the information.
If it is an instruction like ] followed by nothing, it is simple, there is no information to extract.

If it is an instruction like + which can be followed by an identifier or an integer we must check the type of the child node which we can do easily with the ìs_type() method. Then if it is an identifier we can directly use the string representation of this node, if it is an integer we must convert to an integer first.

1
2
3
4
5
6
7
8
9
10
11
12
const auto& child = *(node.children[0]); 
if (child.is_type<Integer>())
{
    int n;
    std::from_chars(child.string_view().data(), child.string_view().data() + child.string_view().size(), n);
    // do what we want with n
}
else
{
    std::string str = child.string();
    // do what we want with str
}

The last step is to create the instruction. Each instruction followed by an identifier, or an integer has class for each. For example for +:

We can make a generic function that takes both classes as template argument and create the good one.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
template <typename InstructionN, typename InstructionId>
Instruction createInstruction(const pegtl::parse_tree::node& node)
{
    const auto& child = *(node.children[0]);
    if (child.is_type<Integer>())
    {
        decltype(std::declval<InstructionN>().value) n; // to be sure we match the exact type, handle all integers types of any size
        std::from_chars(child.string_view().data(), child.string_view().data() + child.string_view().size(), n);

        return InstructionN{ n };
    }
    else
    {
        return InstructionId{ child.string() };
    }
}

And this function can be called like that:

1
2
3
4
if (node->is_type<Increment>())
{
	_defs.recipe.push_back(createInstruction<instructions::Increment, instructions::IncrementRef>(*node));
}

You can find the complete code here: https://github.com/Baduit/PainPerdu/blob/master/lib/PainPerdu/parser/Parser.cpp

Handle syntax errors

What does happen when there is a syntax error in the PainPerdu code? For now, it just stops the parsing, but we can do better !

But, what is a syntax error? It is anything that is not an Expression.
What do we want to do with an error? Stop the parsing and print the whole content of the error, with the line and the column.

That’s why we will define the syntax error like this:

1
2
3
4
5
6
struct SyntaxError :
    pegtl::if_must // If we match
        <
            pegtl::any, // anything, 
            pegtl::until<Expression> // match anything until we get an Expression
        > {};

And then update our grammar:

1
2
3
4
5
6
7
8
9
struct Grammar : // A grammar is
    pegtl::plus // a list of
        <
            pegtl::sor // either
                <
                    Expression, // an expression
                    SyntaxError // or a syntax error
                >
        > {};

And it works because pegtl::sor will only try to match with a syntax error (which can match with anything) only if it is can’t match we an expression, which should not happen.

The next step is to do something when the parser encounters an error. With PEGTL in addition to create a tree, you can also make some callback on some specific brick.

For that we must define a class to define what we do by default on any rule:

1
2
template<typename Rule>
struct ParserAction {};

Then specialize this templated class for the SyntaxError and in the apply function, we can throw an exception we all the information we need in the apply static method:

1
2
3
4
5
6
7
8
9
10
template<>
struct ParserAction<SyntaxError>
{
	template<typename ParseInput>
	static void apply(const ParseInput& in)
	{
                // Note: I didn't used std::format because the version of clang used in emscripten does not support it yet and I did not want to add some preprocessor in my code
		throw std::runtime_error("Error at line = " + std::to_string(in.position().line) + ", column = " + std::to_string(in.position().column) + ".\nUnknown symbols : " + in.string());
	}
};

The last step is to use our callback when creating the tree, we just need to add a template argument while calling pegtl::parse_tree::parse:

1
const auto root = pegtl::parse_tree::parse<Grammar, Selector, ParserAction>(mem_input);

And Voilà!

Article::~Article

PainPerdu has now a shiny new parser, it works like a charm. The code is smaller and more readable (also probably faster, but I did not do any benchmark).
Learning to use Pegtl did not take long, and I probably took more time to write this article than writing my new parser.

Sources

tags: cpp - parser - parsing