06 March 2023

Create a VSCode Extension for my esoteric programming language

by Baduit

Article::Article

I cleaned up the project, I compiled it to Wasm using emscripten, that’s nice; but something is missing : tools to develop in my langage. Who would like to use something without a highlighter and auto-completion? Even the brainfuck langage has some VSCode extensions to highlight the code!

But this issue is already something of the past now, because I created a VScode extension providing auto-completion, semantic highlighting.

Why semantic highlighting instead of syntax highlighting?

VScode offers two APIs to provide highlighting, syntactic and semantic. With the syntax one, I need to provide a TextMate grammars and it looks like the opposite of fun. With the semantic one, I need to provide a list of tokens with their meaning. It may seem more complicated, but I already have a complete parser, so it will be a piece of cake.

The informations VSCode needs

For the highlighting

As I said, for the semantic highlighting VSCode needs a list of tokens. Each token must have the following information:

the line where it starts
the column where it ends
the length
the type

VSCode already provides some types by default:

variable: for PainPerdu references
function: for PainPerdu labels
comment: (do it really need to say it?)
string: for PainPerdu file inclusion (because it is between two double quotes)
number: for integers (there is no decimal in PainPerdu)
operator: for symbols

For the auto-completion

VSCode will only need the list of labels and references. That’s it.

The C++ implementation

Define the token

The token is defined with the following structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
struct Token
{
    enum class Type
    {
        REFERENCE,
        LABEL,
        COMMENT,
        STRING,
        NUMBER,
        OPERATOR
    };

    bool operator==(const Token&) const = default;

    Type type;
    std::size_t line;
    std::size_t start_column;
    std::size_t length;
};

I won’t linger on this code; it is a very simple structure containing the information stated in the previous section.

Get the tokens

I just need to get the tokens, therefore I won’t need to build a whole parse tree, I will only use the callback system that I already used to handle the errors in this previous article to just fill a vector of token.

For the Move right operator > it would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
template <>
struct ToTokenAction<operators::MoveRight>
{
    template <typename ParseInput>
    static void apply(const ParseInput& in, std::vector<Token>& tokens)
    {
        tokens.push_back(
            Token
            {
                .type = Token::Type::OPERATOR,
                .line = in.position().line,
                .start_column = in.position().column,
                .length = in.size()
            });
    }
};

But because I’m lazy and I don’t want to write almost the same code twenty times, I will use the following macro:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#define DefineAction(Match, TokenType)                                                \
    template <>                                                                       \
    struct ToTokenAction<Match>                                                       \
    {                                                                                 \
        template <typename ParseInput>                                                \
        static void apply(const ParseInput& in, std::vector<Token>& tokens)           \
        {                                                                             \
            tokens.push_back(                                                         \
                Token                                                                 \
                {                                                                     \
                    .type = TokenType,                                                \
                    .line = in.position().line,                                       \
                    .start_column = in.position().column,                             \
                    .length = in.size()                                               \
                });                                                                   \
        }                                                                             \
    };
// You can then use it like this:
DefineAction(operators::MoveRight, Token::Type::OPERATOR)

There is still one issue, I need to differentiate the labels and the references, because for now they are only considered as identifiers.

I know that a label must be preceded by some specific operators, same for the references. Each time the parser see one of these operators, it knows that the next identifier will be either a reference or a label.

I can update the macro to reflect this and then handle correctly the labels and references:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
// Little helper struct to centralize some information
struct ToTokenState
{
    enum NextIdentifierType
    {
        NONE, // If there is an identifier with NONE, this is a bug
        LABEL,
        REFERENCE
    };

    std::vector<Token> tokens;
    NextIdentifierType next_identifier_type;
};

// A little helper function
inline Token::Type get_token_type(ToTokenState::NextIdentifierType next_identifier_type)
{
    switch (next_identifier_type)
    {
        case ToTokenState::NextIdentifierType::LABEL :
        {
            return Token::Type::LABEL;
        }
        case ToTokenState::NextIdentifierType::REFERENCE :
        {
            return Token::Type::REFERENCE;
        }
        default:
        {
            throw std::runtime_error("Someone did a shitty code trololol");
        }
    }
}

// Empty default action
template<typename Rule>
struct ToTokenAction {};

// The macro has one more argument
#define DefineAction(Match, TokenType, IdentifierType)                                \
    template <>                                                                       \
    struct ToTokenAction<Match>                                                       \
    {                                                                                 \
        template <typename ParseInput>                                                \
        static void apply(const ParseInput& in, ToTokenState& state)                  \
        {                                                                             \
            state.tokens.push_back(                                                   \
                Token                                                                 \
                {                                                                     \
                    .type = TokenType,                                                \
                    .line = in.position().line,                                       \
                    .start_column = in.position().column,                             \
                    .length = in.size()                                               \
                });                                                                   \
            state.next_identifier_type = IdentifierType; // Update accoringly         \
        }                                                                             \
    };

// Now use the macro
DefineAction(operators::DefineLabel, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::LABEL)
DefineAction(operators::MoveRight, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::MoveLeft, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::Increment, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::Decrement, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(ResetCase, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::NONE)
DefineAction(operators::DefineReference, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::UndefineReference, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::MoveToReference, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::GoToLabel, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::LABEL)
DefineAction(operators::Rewind, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::LABEL)
DefineAction(operators::IfCurrentValueDifferent, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::IfCursorIsAtReference, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(operators::IfReferenceExists, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::REFERENCE)
DefineAction(GetChar, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::NONE)
DefineAction(PutChar, Token::Type::OPERATOR, ToTokenState::NextIdentifierType::NONE)

DefineAction(Comment, Token::Type::COMMENT, ToTokenState::NextIdentifierType::NONE)
DefineAction(ReadFile, Token::Type::STRING, ToTokenState::NextIdentifierType::NONE)
DefineAction(Integer, Token::Type::NUMBER, ToTokenState::NextIdentifierType::NONE)

// Handle the identifier
template <>                                                                       
struct ToTokenAction<Identifier>
{
    template <typename ParseInput>
    static void apply(const ParseInput& in, ToTokenState& state)
    {
        if (state.next_identifier_type != ToTokenState::NextIdentifierType::NONE)
        {
            state.tokens.push_back(
                Token
                {
                    .type = get_token_type(state.next_identifier_type),
                    .line = in.position().line,
                    .start_column = in.position().column,
                    .length = in.size()
                });
            state.next_identifier_type = ToTokenState::NextIdentifierType::NONE;
        }
    }
};

// Because I'm not that dirty, I under my macro since I don't need it anymore
#undef DefineAction

And finally, I the code to use this machinery and get all the tokens:

1
2
3
4
5
6
7
8
9
std::vector<Token> Parser::get_tokens(std::string_view input)
{
    ToTokenState token_state;

    pegtl::memory_input mem_input(input.data(), input.size(), "");
    pegtl::parse<Grammar, ToTokenAction>(mem_input, token_state);

    return token_state.tokens;
}

Get the labels and references

Now that I am able to get the tokens, this part will be very easy.

The only way to define a label is like this: :my_label and the only way to define a reference is like this : #my_reference. I will detect these patterns to get the desired list.

There is two solutions to do it:

Make a parse tree and with a custom Selector to get only label/reference with identifiers
Use the same technique as the tokens

Both are really simple but I chose the second one because the code is really compact. Here’s the code for the reference, but for the labels, the code is almost the same:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
template<typename Rule>
struct GetDefinedReferencesAction {};

template<>
struct GetDefinedReferencesAction<operators::DefineReference>
{
	// This is called when a # is encountered
	template <typename ParseInput>
	static void apply(const ParseInput&, std::vector<std::string>&, bool& is_next_reference)
	{
		is_next_reference = true;
	}
};

template<>
struct GetDefinedReferencesAction<Identifier>
{
	// This is called for all identifiers
	template <typename ParseInput>
	static void apply(const ParseInput& in, std::vector<std::string>& references, bool& is_next_reference)
	{
		// If the previous caracter was a #
		if (is_next_reference)
		{
			// Add it in the list
			references.push_back(in.string());
			is_next_reference = false;
		}
	}
};

That’s it!

Create the bindings

Now to create the bindings I need a new CMake target, it is the same way as the web interpreter with the difference that the nodejs shipped with VSCode does not use the flag --experimental-wasm-eh. This mean I won’t use the flag -fwasm-exceptions in the CMakeLists.txt:

1
2
3
4
5
6
7
8
9
10
# Create the target
add_executable(ExtentionBindings ExtentionBindings.cpp)
# Basic options
target_compile_options(ExtentionBindings PRIVATE -Wextra -Wall -Wsign-conversion -Wfloat-equal -pedantic -Wredundant-decls -Wshadow -Wpointer-arith -O3)
# Don't forget to link with --bind
target_link_options(ExtentionBindings PRIVATE --bind)
# Obviously I need my PainPerdu library
target_link_libraries(ExtentionBindings PRIVATE PainPerdu)
# Put the output files directly in the good directory
set_target_properties(ExtentionBindings PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${CMAKE_CURRENT_LIST_DIR}/../../vscode_extention/painperdu-bakery/generated")

Same for the cpp files, it will be very similar as what I did in my previous article about emscripten.

Add some getter functions to the Token structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
struct Token
{
	enum class Type
	{
		REFERENCE,
		LABEL,
		COMMENT,
		STRING,
		NUMBER,
		OPERATOR
	};
	using TypeIntType = std::underlying_type_t<Type>;

    bool operator==(const Token&) const = default;

	Type type;
	std::size_t line;
	std::size_t start_column;
	std::size_t length;

	// Getter for the bindings
	Type get_type() const { return type; }
	// Little spoiler, I added this function because VSCode will only need an index :)
	TypeIntType get_type_index() const { return static_cast<TypeIntType>(type); }
	std::size_t get_line() const { return line; }
	std::size_t get_start_column() const { return start_column; }
	std::size_t get_length() const { return length; }

};

Then I can really create the bindings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#include <vector>
#include <string>

// Do not forget this inscmude
#include <emscripten/bind.h>

#include <PainPerdu/PainPerdu.hpp>

using namespace emscripten;

// Little helper functions
std::vector<PainPerdu::parser::Token> get_tokens(const std::string& input)
{
	return PainPerdu::Parser().get_tokens(input);
}

std::vector<std::string> get_defined_labels(const std::string& input)
{
	return PainPerdu::Parser().get_defined_labels(input);
}

std::vector<std::string> get_defined_references(const std::string& input)
{
	return PainPerdu::Parser().get_defined_references(input);
}

// The module
EMSCRIPTEN_BINDINGS(PainPerduParserModule) {
	// I need to declare the vectors
	register_vector<std::string>("VectorString");

	// And the Token class with all its getters
	class_<PainPerdu::parser::Token>("PainPerduToken")
		.function("get_type_index", &PainPerdu::parser::Token::get_type_index)
		.function("get_line", &PainPerdu::parser::Token::get_line)
		.function("get_start_column", &PainPerdu::parser::Token::get_start_column)
		.function("get_length", &PainPerdu::parser::Token::get_length);

	// Then declare what a vector of Token is
	register_vector<PainPerdu::parser::Token>("VectorToken");

	// Last step, declare functions I will use in javascript
	function("get_tokens", &get_tokens);
	function("get_defined_labels", &get_defined_labels);
	function("get_defined_references", &get_defined_references);

}

The javascript implementation

The setup

I just followed the vscode tutorial. To be honest if I go any deeper I would either be paraphrasing or be less clear than the tutorial itself. It is very clear and well written.

Basically, I just did:

npm install -g yo generator-code
yo code

And it generated the structure of the extension, then I created a folder to place my wasm code with its js glue code (generated from C++). In the rest of this article, my whole code will be in a file named extension.js and at the beginning it looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Include vscode stuff
const vscode = require('vscode');

// Use the bindings to my PainPerdu library
const bindings = require('./generated/ExtentionBindings')


/**
 * @param {vscode.ExtensionContext} context
 */
function activate(context) {

}

function deactivate() { }

module.exports = {
	activate,
	deactivate
}

The semantic highlighting

As I said earlier, we need to provide a list of tokens to VSCode and it provides utility classes to build it: SemanticTokensBuilder and SemanticTokensLegend.

The SemanticTokensLegend is just a little class describing the types of token that will be provided. There is a lot of predefined type of token like variable or function. You can see the list of predefined type of token or how to create your own in the documentation. The SemanticTokensLegend is created from a list of string and I will use this legend for now on:

1
2
3
4
5
6
7
8
9
const tokenTypeStrings = [
	'variable',
	'function',
	'comment',
	'string',
	'number',
	'operator'
];
const legend = new vscode.SemanticTokensLegend(tokenTypeStrings);

The SemanticTokensBuilder can then be used with the legend previously created:

1
2
3
const builder = new vscode.SemanticTokensBuilder(legend);
builder.push(start_line, start_column, length, index_of_the_type);
return builder.build();

Note that VSCode considers than the line and columns of a document start at 0. Also for the type it uses an index, the index of variable is 0, function is 1, etc.

Now I need a DocumentSemanticTokensProvider, when using TypeScript it is possible to implement the interface vscode.DocumentSemanticTokensProvider. It must have a method named the provideDocumentSemanticTokens that returns the list of tokens. I used it like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class DocumentSemanticTokensProvider {
    async provideDocumentSemanticTokens(document) {
        // Use the C++ code to create the list of tokens
        const allTokens = bindings.get_tokens(document.getText());
    
        // Create the builder using the lgend
        const builder = new vscode.SemanticTokensBuilder(legend);

        // Convert my list of tokens into one that VSCode understand
        for (var i = 0; i < allTokens.size(); ++i) {
            let token = allTokens.get(i);

            // In my C++ code, column and lines start at 1, but with VSCode it starts at 0
            builder.push(token.get_line() - 1, token.get_start_column() - 1, token.get_length(), token.get_type_index());
        }

        return builder.build();
    }
}

And the last step is to make VSCode use the DocumentSemanticTokensProvider:

1
2
3
4
5
6
7
function activate(context) {
    // I want only to use it for my language
    const selector = { language: 'painperdu' };
    // Some boiler plate code 
    
context.subscriptions.push(vscode.languages.registerDocumentSemanticTokensProvider(selector, new DocumentSemanticTokensProvider(), legend));
}

Here’s what it can look like with some code from my Brainfuck interpreter written in PainPerdu: PainPerdu code with color highlighting

The auto-completion

Auto-completion almost work the same, except that instead of a list of tokens I need to return a list of element with their type: in this case either a variable (for a PainPerdu reference) or a function (for a PainPerdu label).

I will create 2 GoCompletionItemProvider, one giving the list of reference, the other one the list of labels:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class GoCompletionItemProviderLabels {
    provideCompletionItems(document) {
        const labels = bindings.get_defined_labels(document.getText());

        let result = [];
        for (var i = 0; i < labels.size(); ++i) {
            let label = labels.get(i);
            let completionItem = new vscode.CompletionItem();
            completionItem.label = label;
            completionItem.kind = vscode.CompletionItemKind.Function;
            result.push(completionItem);
        }
        return result;
    }
}

class GoCompletionItemProviderReference {
    provideCompletionItems(document) {
        const refs = bindings.get_defined_references(document.getText());

        let result = [];
        for (var i = 0; i < refs.size(); ++i) {
            let ref = refs.get(i);
            let completionItem = new vscode.CompletionItem();
            completionItem.label = ref;
            completionItem.kind = vscode.CompletionItemKind.Variable;
            result.push(completionItem);
        }
        return result;
    }
}

And now I just need VSCode to use them:

1
2
3
4
5
6
7
8
9
{
    const selector = { language: 'painperdu' };
    context.subscriptions.push(vscode.languages.registerDocumentSemanticTokensProvider(selector, new DocumentSemanticTokensProvider(), legend));
    // These characters can only be followed by a label

    context.subscriptions.push(vscode.languages.registerCompletionItemProvider(selector, new GoCompletionItemProviderLabels(), '.', '*', '&'));
    // These characters can only be followed by a reference
    context.subscriptions.push(vscode.languages.registerCompletionItemProvider(selector, new GoCompletionItemProviderReference(), '#', '@', '$', '>', '<', '+', '-', '?', '!'));
}

And voilà! It was that simple!

Article::~Article

It is my very first VSCode extension and even if it may seem dumb, I’m proud of it. Don’t hesitate to point some errors or some things I could have done better, I’m not very comfortable using Javascript and I don’t know well the VSCode API.

The extension could be improved, but I doubt that anybody will really use this extension. The whole project around this language is just an excuse to try some tools and have some fun.

The extension is named PainPerdu Bakery and you can find it on the vscode marketplace.

Sources

tags: cpp - vscode - emscripten - webassembly