Use PEGTL to remove my clunky homemade parser
by Baduit
Article::Article
On my first article I talked about how I cleaned up my dependencies, now it is time to clean up my clunky homemade parser for my esoteric programming language PainPerdu.
Basic architecture of the project
We can split the project in three parts:
- The definitions: it contains the data structures for all the PainPerdu instructions
- The parser: it takes a string and output the definitions
- The vm: it takes the definitions and execute them
My goal
The goal is simple: I want to totally rewrite the parser without having to modify anything else in the project.
Concretely I will only touch the .cpp file, the .hpp with the declaration of the Parser class must stay the same:
1
2
3
4
5
class Parser
{
public:
Definitions operator()(std::string_view input);
};
Why did I initially make a parser from scratch
I want to have fun on my free time, and it was the funnier option at the time for me.
To develop a bit: most libraries I have looked had syntaxes I didn’t like at all (like Boost Spirit for example). Most of the time they seemed totally overkill for my little need and way too complicated and learning to used them did not seem fun at all. (Did I tell you I like having fun?)
I enjoy reinventing the wheel and creating my own library seemed really fun! And honestly it was, event if the solution was not optimal.
Why refactor the parser
- There was a lot of boilerplate code
- I found a library I wanted to test: Pegtl
- Refactoring is satisfying
Presentation of Pegtl
To quote Pegtl description from the project readme, Pegtl (Parsing Expression Grammar Template Library) is a zero-dependency C++ header-only parser combinator library for creating parsers according to a Parsing Expression Grammar (PEG).
Installation with vcpkg
That’s the easiest part: I added this line into my vcpkg.json
Then I have to find the package in my cmake file:
FInally I just needed to link against the library:
Defining the grammar
If you are not familiar with the parsing expression grammar I advise you to look it up, reading the Wikipedia page should be enough to understand the rest of the article. At least for me the Wikipedia article was enough to use PEGTL and write the whole grammar.
For all the snippet below we will assume that there is the following lines above:
1
2
#include <tao/pegtl.hpp>
namespace pegtl = tao::pegtl; // namespace alias because i'm lazy
To define the grammar, we need to think about how the syntax works. Some instructions are just a single character like ] to write a character but most are a single character followed by an integer or an identifier like like + to increment the current case, or the annotation : to define a label. The last kind of instruction is something between 2 characters for example to read a file, you must write the file name between 2 double quotes “. The comments are based on the same system but between a { and a }.
Now we know enough to build the 1st bricks of the language. Let’s begin by defining what is an integer: simple, it is a one or more digits. We can define it like that:
1
2
3
4
5
6
struct Integer :
pegtl::plus // Any number of
<
pegtl::digit // digits (thanks captain obivous)
>
{};
Each brick (or rule but I like the term brick) of our grammar will be a structure inheriting from other bricks, already defined in the library or in our code.
The identifier is a little more complex. In the specification of the language it is defined as an arbitrarily long sequence of digits, underscores, lowercase and uppercase Latin letters. A valid identifier must begin with a non-digit character.
We need to define 3 different bricks:
- The first character which can be an _ or a letter of the alphabet. We will call it AlphaAndUnderscore
- The other characters which can be an _, a letter of the alphabet or a digit we will call it AlphaNumAndUnderscore
- The identifier itset witch is a sequence of an AlphaAndUnderscore which can be followed by any number of AlphaNumAndUnderscore
Translated into code it looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct AlphaAndUnderscore :
pegtl::sor // Either
<
pegtl::alpha, // a letter of the latin alphabet
pegtl::one<'_'> // or an _
>
{};
struct AlphaNumAndUnderscore:
pegtl::sor // Either
<
pegtl::alnum, // a letter of the latin alphabet or a digit
pegtl::one<'_'> // or an _
>
{};
struct Identifier :
pegtl::seq// A sequence
<
AlphaAndUnderscore, // starting with an _ or a letter
pegtl::opt // which can be followed
<
pegtl::plus // by any number of
<
AlphaNumAndUnderscore // _, letters or digits
>
>
>
{};
We have all the tools to build the bricks for all the instructions!
For simple instructions like ] to write a character it is as simple as that:
1
struct PutChar : pegtl::one<']'> {};
For the instructions like + to increment the current case it is almost as simple:
1
2
3
4
5
6
7
8
9
10
11
struct Increment :
pegtl::seq // A sequence
<
pegtl::one<'+'>, // of one +
pegtl::sor // Followed by either
<
Integer, // an Integer
Identifier // or an Indentifier
>
>
{};
The instruction to read a file could looks difficult, but thanks of the bricks built in the library it is really easy:
1
2
3
4
5
6
7
8
9
10
11
struct ReadFile :
pegtl::if_must // If we match with
<
pegtl::one<'"'>, // a quote "
pegtl::until // Until there is
<
pegtl::one<'"'>, // an other quote "
pegtl::any // take any character
>
>
{};
All the other instructions and annotations will be defined the same way:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// List of all instructions and annotations
struct DefineLabel : pegtl::seq<pegtl::one<':'>, Identifier> {};
struct MoveRight : pegtl::seq<pegtl::one<'>'>, pegtl::sor<Integer, Identifier>> {};
struct MoveLeft : pegtl::seq<pegtl::one<'<'>, pegtl::sor<Integer, Identifier>> {};
struct Increment : pegtl::seq<pegtl::one<'+'>, pegtl::sor<Integer, Identifier>> {};
struct Decrement : pegtl::seq<pegtl::one<'-'>, pegtl::sor<Integer, Identifier>> {};
struct ResetCase : pegtl::one<';'> {};
struct DefineReference : pegtl::seq<pegtl::one<'#'>, Identifier> {};
struct UndefineReference : pegtl::seq<pegtl::one<'.'>, Identifier> {};
struct MoveToReference : pegtl::seq<pegtl::one<'@'>, Identifier> {};
struct GoToLabel : pegtl::seq<pegtl::one<'*'>, Identifier> {};
struct Rewind : pegtl::seq<pegtl::one<'&'>, Identifier> {};
struct IfCurrentValueDifferent : pegtl::seq<pegtl::one<'?'>, pegtl::opt<pegtl::sor<Integer, Identifier>>> {};
struct IfCursorIsAtReference : pegtl::seq<pegtl::one<'!'>, Identifier> {};
struct IfReferenceExists : pegtl::seq<pegtl::one<'$'>, Identifier> {};
struct GetChar : pegtl::one<'['> {};
struct PutChar : pegtl::one<']'> {};
struct ReadFile : pegtl::if_must<pegtl::one<'"'>, pegtl::until<pegtl::one<'"'>, pegtl::any>> {};
struct Comment : pegtl::if_must<pegtl::one<'{'>, pegtl::until<pegtl::one<'}'>, pegtl::any>> {};
struct Skip : pegtl::plus<pegtl::sor<pegtl::blank, pegtl::eol>> {};
Now that we have caught them all and completed our Pokedex, we can define what is an Expression: any instructions, comments and characters like spaces or tabulations between instructions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
struct Expression : // An expression is
pegtl::sor // Either
<
DefineLabel, // this instruction
MoveRight, // or that instruction
MoveLeft, // or that one
Increment, // I think you got it
Decrement,
ResetCase,
DefineReference,
UndefineReference,
MoveToReference,
GoToLabel,
Rewind,
IfCurrentValueDifferent,
IfCursorIsAtReference,
IfReferenceExists,
GetChar,
PutChar,
ReadFile,
Comment, // Also comments
Skip // And some space we can skip
>
{};
Finally the grammar is just a list of expressions:
1
struct Grammar : pegtl::plus<Expression> {};
Make a parse tree
To create a parse tree we need 3 things:
- an input
- a grammar
- a selector
Creating the input is trivial, you can do it just like this:
1
2
std::string_view input = "...";
pegtl::memory_input mem_input(input.data(), input.size(), "");
We already created defined the grammar above, the last thing we have to do is create the selector. But, What is a selector? It is just a list of the part of the grammar we want in our parse tree. For example in our grammar we want 4 types of thing:
- the annotations
- the instructions
- the integers associated with the instructions. It is good to know that we want to move the cursor to the right, but also knowing how much we want to move the cursor is better
- the identifiers, it is the same idea as the integers, we need to know the identifiers associated with the instructions or annotations.
Now we just need to define our selector by creation a type alias using the helper class template included in the library: pegtl::parse_tree::selector
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
template<typename Rule>
using Selector = pegtl::parse_tree::selector
<
Rule,
pegtl::parse_tree::store_content::on
<
// Identifier and integer
Identifier,
Integer,
// Annotation
DefineLabel,
// Instructions
MoveRight,
MoveLeft,
Increment,
Decrement,
ResetCase,
DefineReference,
UndefineReference,
MoveToReference,
GoToLabel,
Rewind,
IfCurrentValueDifferent,
IfCursorIsAtReference,
IfReferenceExists,
GetChar,
PutChar,
ReadFile
>
>;
We can now assemble the 3 parts of the puzzle to create the tree:
1
const auto root = pegtl::parse_tree::parse<Grammar, Selector>(mem_input);
For a simple program looking like this:
We would have the following parse tree:
Convert the parse tree to something the interpreter understand
That’s cool, we have a parse tree, but the interpreter does not understand this parse tree and we need to return a Definitions object which is composed of a vector of annotations and a vector of instructions. We must process our parse tree, one simple way to do it is to create a visitor.
The visitor iterates over the nodes, if it is empty it does nothing, if it is an annotation, extract the information from the tree, create an Annotation, push it into the annotation vector and do the same thing with instructions. We can know the type of node with the is_type()
method of the node. If it is not an instruction or an annotation, do the same routine on child nodes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
struct Visitor
{
Visitor(Definitions& defs):
_defs(defs)
{}
void operator()(const pegtl::parse_tree::node* node)
{
if (!node)
return;
if (node->is_type<Increment>())
{
// Extract information here
}
// A big tree of if and else...
else
{
for (const auto& child: node->children)
(*this)(child.get());
}
}
Definitions& _defs;
};
Now that we have the global structure of the visitor we need to extract the information.
If it is an instruction like ] followed by nothing, it is simple, there is no information to extract.
If it is an instruction like + which can be followed by an identifier or an integer we must check the type of the child node which we can do easily with the ìs_type()
method. Then if it is an identifier we can directly use the string representation of this node, if it is an integer we must convert to an integer first.
1
2
3
4
5
6
7
8
9
10
11
12
const auto& child = *(node.children[0]);
if (child.is_type<Integer>())
{
int n;
std::from_chars(child.string_view().data(), child.string_view().data() + child.string_view().size(), n);
// do what we want with n
}
else
{
std::string str = child.string();
// do what we want with str
}
The last step is to create the instruction. Each instruction followed by an identifier, or an integer has class for each. For example for +:
- instructions::Increment with an integer
- instructions::IncrementRef with an identifier
We can make a generic function that takes both classes as template argument and create the good one.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
template <typename InstructionN, typename InstructionId>
Instruction createInstruction(const pegtl::parse_tree::node& node)
{
const auto& child = *(node.children[0]);
if (child.is_type<Integer>())
{
decltype(std::declval<InstructionN>().value) n; // to be sure we match the exact type, handle all integers types of any size
std::from_chars(child.string_view().data(), child.string_view().data() + child.string_view().size(), n);
return InstructionN{ n };
}
else
{
return InstructionId{ child.string() };
}
}
And this function can be called like that:
1
2
3
4
if (node->is_type<Increment>())
{
_defs.recipe.push_back(createInstruction<instructions::Increment, instructions::IncrementRef>(*node));
}
You can find the complete code here: https://github.com/Baduit/PainPerdu/blob/master/lib/PainPerdu/parser/Parser.cpp
Handle syntax errors
What does happen when there is a syntax error in the PainPerdu code? For now, it just stops the parsing, but we can do better !
But, what is a syntax error? It is anything that is not an Expression.
What do we want to do with an error? Stop the parsing and print the whole content of the error, with the line and the column.
That’s why we will define the syntax error like this:
1
2
3
4
5
6
struct SyntaxError :
pegtl::if_must // If we match
<
pegtl::any, // anything,
pegtl::until<Expression> // match anything until we get an Expression
> {};
And then update our grammar:
1
2
3
4
5
6
7
8
9
struct Grammar : // A grammar is
pegtl::plus // a list of
<
pegtl::sor // either
<
Expression, // an expression
SyntaxError // or a syntax error
>
> {};
And it works because pegtl::sor
will only try to match with a syntax error (which can match with anything) only if it is can’t match we an expression, which should not happen.
The next step is to do something when the parser encounters an error. With PEGTL in addition to create a tree, you can also make some callback on some specific brick.
For that we must define a class to define what we do by default on any rule:
1
2
template<typename Rule>
struct ParserAction {};
Then specialize this templated class for the SyntaxError and in the apply function, we can throw an exception we all the information we need in the apply static method:
1
2
3
4
5
6
7
8
9
10
template<>
struct ParserAction<SyntaxError>
{
template<typename ParseInput>
static void apply(const ParseInput& in)
{
// Note: I didn't used std::format because the version of clang used in emscripten does not support it yet and I did not want to add some preprocessor in my code
throw std::runtime_error("Error at line = " + std::to_string(in.position().line) + ", column = " + std::to_string(in.position().column) + ".\nUnknown symbols : " + in.string());
}
};
The last step is to use our callback when creating the tree, we just need to add a template argument while calling pegtl::parse_tree::parse:
1
const auto root = pegtl::parse_tree::parse<Grammar, Selector, ParserAction>(mem_input);
And Voilà!
Article::~Article
PainPerdu has now a shiny new parser, it works like a charm. The code is smaller and more readable (also probably faster, but I did not do any benchmark).
Learning to use Pegtl did not take long, and I probably took more time to write this article than writing my new parser.