Tag Archive for: EBNF

Posts

Developing our own programming language in Java + ANTLR: diagnostics of errors

This is the second article in the series “Developing our programming language in Java”, first article can be read here.

At the current stage, we have an interpreter capable of executing the commands of our language. However, this is not enough if we want to check the code for errors and display them in a clear way to the user. In this article we will consider adding error diagnostics to the language. Conducting error analysis in your own programming language is an important step in language development. Using powerful tools, such as ANTLR, allows you to implement rather efficient code analysis tools in a short time that will help you to detect potential problems in a program at early stages of development, which helps to improve software quality and increase developer’s productivity.

Classification of errors

There are different types of errors, but in general they can be divided into three categories: syntax, semantic and runtime errors.

Syntax errors occur due to violation of the syntax rules of a particular programming language. Syntax rules define how statements and expressions in code should be organized.

Example of a syntax error (missing closing quote):

println("Hello, World!)

Semantic errors occur when a program is compiled and even executed, but the result is different from what was expected. This type of error is the most difficult. Semantic errors can be caused by the programmer’s misunderstanding of the language or the task at hand. For example, if a programmer has a poor understanding of operator precedence, then he can write the following code:

var a = 1 + 2 * 3

One could expect the variable a to be equal to 9, but in fact it will be equal to 7. This happens because the multiplication operator has higher precedence than the addition operator. A semantic error can usually be discovered during debugging or extensive testing of the program.

Runtime errors, also known as Exceptions, occur during program execution. Such errors can occur due to incorrect data entry, attempting to access a non-existent file, and in many other scenarios. Some runtime errors can be handled in a program, but if this is not done, the program usually crash.

In addition to errors, it is also important to detect potential problems or non-obvious situations that are not errors in the strict sense, but may lead to undesirable consequences. For example, it could be an unused variable, use of deprecated functions, or a meaningless operation. In all such cases, warnings can be shown to the user.

JimpleBaseVisitor

To identify errors and warnings, we need the abstract class JimpleBaseVisitor (generated by ANTLR), familiar to us from the first article, which by default implements the JimleVisitor interface. It allows you to traverse the AST tree (Abstract Syntax Tree), and based on the analysis of its nodes we will decide upon the error, warning or normal part of the code. In essence, diagnosing errors is almost no different from interpreting code, except when we need to perform I/O or access external resources. For example, if a console output command is executed, then our task is to check whether the data type is passed as an argument is valid, without directly outputting to the console.

Let’s create the JimpleDiagnosticTool class, which inherits JimleBaseVisitor and encapsulates all the logic of finding and storing errors:

class JimpleDiagnosticTool extends JimpleBaseVisitor<ValidationInfo> {
    private Set<Issue> issues = new LinkedHashSet<>();
}

record Issue(IssueType type, String message, int lineNumber, int lineOffset, String details) {}

This class contains a list of type Issue, which represents information about a specific error.

It is known that each method of a given class must return a value of a certain type. In our case, we will return information about the type of node in the tree — ValidationInfo. This class also contains information about the possible value, this will help us identify some semantic or runtime errors.

record ValidationInfo(ValidationType type, Object value) {}

enum ValidationType {
    /**
     * Expression returns nothing.
     */
    VOID,

    /**
     * Expression is String
     */
    STRING,

    /**
     * Expression is double
     */
    DOUBLE,

    /**
     * Expression is long
     */
    NUMBER,

    /**
     * Expression is boolean
     */
    BOOL,

    /**
     * Expression contains error and analysing in another context no makes sense.
     */
    SKIP,

    /**
     * When object can be any type. Used only in Check function definition mode.
     */
    ANY,

    /**
     * Tree part is function declaration
     */
    FUNCTION_DEFINITION
}

You should pay attention to the value of ValidationType.SKIP. It will be used if an error has been found and already registered in a part of the tree, and further analysis of this tree node does not make sense. For example, if one argument in a sum expression contains an error, then the second argument of the expression will not be analyzed.

ValidationInfo checkBinaryOperatorCommon(ParseTree leftExp, ParseTree rightExp, Token operator) {
    ValidationInfo left = visit(leftExp);
    if (left.isSkip()) {
        return ValidationInfo.SKIP;
    }
    ValidationInfo right = visit(rightExp);
    if (right.isSkip()) {
        return ValidationInfo.SKIP;
    }
    // code omitted
}

Listers vs Visitors

Before moving on, let’s take a look at another ANTLR-generated interface JimpleListener (pattern Observer), which can also be used if we need traverse the AST tree. What is the difference between them? The biggest difference between these mechanisms is that listener methods are always called by ANTLR on a per-node basis, whereas visitor methods must bypass their child elements with explicit calls. And if the programmer does not call visit() on child nodes, then these nodes are not visited, i.e. we have the ability to control tree traversal. For example, in our implementation, the function body is first visited once in its entirety (mode checkFuncDefinition==true) to detect errors in the entire function (all if and else blocks), and several times with specific argument values:

@Override
ValidationInfo visitIfStatement(IfStatementContext ctx) {
    // calc expression in "if" condition
    ValidationInfo condition = visit(ctx.expression());

    if (checkFuncDefinition) {
        visit(ctx.statement());
        // as it's just function definition check, check else statement as well
        JimpleParser.ElseStatementContext elseStatement = ctx.elseStatement();
        if (elseStatement != null) {
            visit(elseStatement);
        }
        return ValidationInfo.VOID;
    }

    // it's not check function definition, it's checking of certain function call
    if (condition.isBool() && condition.hasValue()) {
        if (condition.asBoolean()) {
            visit(ctx.statement());
        } else {
            JimpleParser.ElseStatementContext elseStatement = ctx.elseStatement();
            if (elseStatement != null) {
                visit(elseStatement);
            }
        }
    }

    return ValidationInfo.VOID;
}

The Visitor pattern works very well if we need to map a specific value for each tree node. This is exactly what we need.

Catching syntactic errors

In order to find some syntax errors in the code, we need to implement the ANTLRErrorListener interface. This interface contains four methods that will be called (by the parser and/or lexer) in case of an error or undefined behavior:

interface ANTLRErrorListener {
    void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e);
    void reportAmbiguity(Parser recognizer, DFA dfa, int startIndex, int stopIndex, boolean exact, BitSet ambigAlts, ATNConfigSet configs);
    void reportAttemptingFullContext(Parser recognizer, DFA dfa, int startIndex, int stopIndex, BitSet conflictingAlts, ATNConfigSet configs);
    void reportContextSensitivity(Parser recognizer, DFA dfa, int startIndex, int stopIndex, int prediction, ATNConfigSet configs);
}

The name of the first method (syntaxError) speaks for itself; it will be called in case of a syntax error. The implementation is quite simple: we need to convert the error information into an object of type Issue and add it to the list of errors:

@Override
void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
    int offset = charPositionInLine + 1;
    issues.add(new Issue(IssueType.ERROR, msg, line, offset, makeDetails(line, offset)));
}

The remaining three methods can be ignored. ANTLR also implements this interface itself (see class ConsoleErrorListener) and sends errors to the standard error stream (System.err). To disable it and other standard handlers, we need to call the removeErrorListeners method on the parser and lexer:

    // remove default error handlers
    lexer.removeErrorListeners();
    parser.removeErrorListeners();

Another type of syntax error is based on the rules of a particular language. For example, in our language, a function is identified by its name and number of arguments. When the analyzer encounters a function call, it checks whether a function with the same name and number of arguments exists. If not, then it throws an error. To do this, we need to override the visitFunctionCall method:

@Override
ValidationInfo visitFunctionCall(FunctionCallContext ctx) {
    String funName = ctx.IDENTIFIER().getText();
    int argumentsCount = ctx.expression().size();
    var funSignature = new FunctionSignature(funName, argumentsCount, ctx.getParent());
    // find a function in the context by signature (name+number of arguments)
    var handler = context.getFunction(funSignature);

    if (handler == null) {
        addIssue(IssueType.ERROR, ctx.start, "Function with such signature not found: " + funName);
        return ValidationInfo.SKIP;
    }

    // code omitted
}

Let’s check the if construct. Jimple requires that the expression in the if condition is of boolean type:

@Override
ValidationInfo visitIfStatement(IfStatementContext ctx) {
    // visit expression
    ValidationInfo condition = visit(ctx.expression());
    // skip if expression contains error
    if (condition.isSkip()) {
        return ValidationInfo.SKIP;
    }

    if (!condition.isBool()) {
        addIssue(IssueType.WARNING, ctx.expression().start, "The \"if\" condition must be of boolean type only. But found: " + condition.type());
    }

    // code omitted
}

The attentive reader will notice that in this case we added a warning, not an error. This is done due to the fact that our language is dynamic, and we don’t always know the exact information about the type of the expression.

Identify semantic errors

As mentioned earlier, semantic errors are difficult to find and can often be found only during debugging or testing the program. However, some of them can be identified at the compilation stage. For example, if we know that the function X always returns 0, then we can show a warning if a division expression uses this function as a divisor. Division by zero is usually considered a semantic error, since division by zero makes no sense in mathematics.

An example of error detection “Division by zero”: triggered when an expression is used as a divisor, which always returns the value 0:

ValidationInfo checkBinaryOperatorForNumeric(ValidationInfo left, ValidationInfo right, Token operator) {
    if (operator.getType() == JimpleParser.SLASH && right.hasValue() && ((Number) right.value()).longValue() == 0) {
        // if we have value of right's part of division expression and it's zero
        addIssue(IssueType.WARNING, operator, "Division by zero");
    }

    // code omitted
}

Runtime errors

Runtime errors are also difficult or even impossible to detect at the compilation/interpretation stage. However, some such errors can still be identified. For example, if a function calls itself (either directly or through another function), this may result in a stack overflow error (StackOverflow). The first thing we need to do is to declare a list (Set) where we will store the functions that are in the process of being called at the moment. The check itself can (and should) be placed in the handleFuncInternal method of processing the function call. At the beginning of this method, we check whether the current FunctionDefinitionContext (function declaration context) is in the list of already called functions, and if that is the case, we log a Warning and interrupt further processing of the function. If not, then we add the current context to our list, and the rest of the logic follows. When exiting handleFuncInternal, you need to remove the current function context from the list. Here it should be noted that in this case we not only identified a potential StackOverflow, but also got rid of the same error when diagnosing error, namely when looping the handleFuncInternal method.

Set<FunctionDefinitionContext> calledFuncs = new HashSet<>();

ValidationInfo handleFuncInternal(List<String> parameters, List<Object> arguments, FunctionDefinitionContext ctx) {
    if (calledFuncs.contains(ctx)) {
        addIssue(IssueType.WARNING, ctx.name, String.format("Recursive call of function '%s' can lead to StackOverflow", ctx.name.getText()));
        return ValidationInfo.SKIP;
    }
    calledFuncs.add(ctx);
    
    // other checkings

    calledFuncs.remove(ctx);

    // return resulting ValidationInfo
}

Control/data-flow analysis

For a deeper study of program code, optimization and identification of complex errors, Control-flow analysis and Data-flow analysis are also used.

Control flow analysis is focused on understanding which parts of a program are executed depending on various conditions and control structures such as conditional (if-else) statements, loops, and branches. It allows you to identify the program execution paths and detect potential errors associated with incorrect flow control logic. For example, unreachable code or potential program hangpoints.

Data flow analysis, on the other hand, focuses on how data is distributed and used within a program. It helps identify potential data problems such as the use of uninitialized variables, data dependencies, and possible memory leaks. Data flow analysis can also detect errors associated with incorrect data operations, such as the use of incorrect types or incorrect (meaningless) calculations.

Summary

In this article, we have examined the process of adding error and warning diagnostics to your programming language. We have learned what does the ANTLR provides out of the box for logging syntax errors. We have alos implemented a handling of some errors and potential problems during program execution.

The entire interpreter source code can be viewed at link.

Links

April 25, 2024/by Ruslan Absaliamov

Technical en

Developing our own programming language in Java + ANTLR: interpreter

This is the first article in the series “Developing our programming language in Java” which is aimed to show the full path of creating a language, as well as writing and maintaining tools for it. By the end of this article, we will implement an interpreter with which it will be possible to execute programs in our language. Any programming language has a syntax that needs to be converted into a data structure that is convenient for validation, transformation and execution. As a rule, such a data structure is an abstract syntax tree (AST). Each tree node represents a construct that occurs in the source code. The source code is parsed by the parser, and the output is AST.

Languages have been developed for a long time, and therefore at the moment we have quite a few mature tools, including parser generators. Parser generators take a description of the grammar of a particular language as input, and as output we get parsers, interpreters and compilers.

In this article, we will consider the ANTLR tool. ANTLR is a utility that receives a grammar in the form EBNF as input, and interfaces/classes as output (in our case, this is java code) for parsing programs. The list of languages in which parsers are generated can be found here.

Grammar example

Before moving on to real grammar, let’s try to describe some of the rules of a typical programming language in words:

VARIABLE is an IDENTIFIER
DIGIT is one of the characters 0 1 2 3 4 5 6 7 8 9
NUMBER is one or more elements of type DIGIT
EXPRESSION is NUMBER
EXPRESSION is VARIABLE
EXPRESSION is EXPRESSION ‘+’ EXPRESSION
EXPRESSION is ‘(‘ EXPRESSION ’)’

As you can see from this list, a language grammar is a set of rules that can have recursive links. Each rule can refer to itself or to any other rule. ANTLR has many operators in its arsenal to describe such rules.

: marker of the beginning of the rule
; rule end marker
| alternative operator
.. range operator
~ denial
. any character
= assignment
(...) sub-rule
(...)* repeating of a subrule 0 or more times
(...)+  repeating of a subrule 1 or more times
(...)? subrule, may be missing
{...} semantic actions (in the language used as the output - for example, Java)
[...] rule parameters

ANTLR rule examples

The following example describes the rules for integers and floating-point numbers:

NUMBER : [0-9]+ ;
FLOAT : NUMBER '.' NUMBER ;

It is very important to understand that the grammar describes only the syntax of the language on the basis of which the parser will be generated. The parser will generate an AST, with which help it will be possible to implement the semantics of the language. In the previous example, we set a rule for parsing an integer, but did not describe how much memory the number occupies (8 bits, 16, …), and whether the number is signed or unsigned. For example, in some programming languages you can start using a variable without declaring it. You can also not declare the type of the variable; in this case the type will be determined automatically at runtime. All these rules of language semantics are not described in the grammar, but are implemented in another part of the language.

ANTLR tokens and expressions

ANTLR grammar consists of two types of rules: tokens and expressions, which are used to define the structure of the grammar and parse input.

Tokens are rules that define individual lexical elements of the input language, such as numbers, identifiers, operation signs, and so on. Each token corresponds to a certain type of token, which is used for further processing by the parser. The lexical analyzer scans the input text, parses it into tokens, and creates a sequence of tokens, which are then passed to the parser. They are written in uppercase (For example: NUMBER, IDENTIFIER).

Expressions are rules that define the grammar structure of the input language. They describe how tokens are related and how they can be combined into more complex structures. Expressions can contain references to tokens as well as to other expressions. Written in camelCase notation (For example: expression, functionDefinition).

Thus, the difference between tokens and expressions in ANTLR is that tokens define individual lexical elements of the input language and convert them into tokens, while expressions define the structure of the grammar and describe how tokens are related to each other in more complex structures.

Language Requirements

Before you start implementing a language, you need to determine the set of features it should support. For our task, for educational purposes, we will use a simple grammar. The language will support the following constructs:

Variables (types String, Long, Double);
Assignment operator (=);
Arithmetic operations (+, -, *, /);
Comparison operators (>, <, >=, <=, ==, !=);
Conditional statements (if , else);
Functions;
Print to the console (built-in println operator).

Grammar

And finally, a complete description of the grammar for the language:

grammar Jimple ;

// root rule of grammar
program: (statement)* EOF;

// list of possible statements
statement: variableDeclaration
         | assignment
         | functionDefinition
         | functionCall
         | println
         | return
         | ifStatement
         | blockStatement
         ;

// list of possible expressions
expression: '(' expression ')'                                      #parenthesisExpr
          | left=expression op=(ASTERISK | SLASH) right=expression  #mulDivExpr
          | left=expression op=(PLUS | MINUS) right=expression      #plusMinusExpr
          | left=expression compOperator right=expression           #compExpr
          | IDENTIFIER                                              #idExp
          | NUMBER                                                  #numExpr
          | DOUBLE_NUMBER                                           #doubleExpr
          | STRING_LITERAL                                          #stringExpr
          | functionCall                                            #funcCallExpr
          ;

// descriptions of individual expressions and statements
variableDeclaration: 'var' IDENTIFIER '=' expression ;

assignment: IDENTIFIER '=' expression ;

compOperator: op=(LESS | LESS_OR_EQUAL | EQUAL | NOT_EQUAL | GREATER | GREATER_OR_EQUAL) ;

println: 'println' expression ;

return: 'return' expression ;

blockStatement: '{' (statement)* '}' ;

functionCall: IDENTIFIER '(' (expression (',' expression)*)? ')' ;

functionDefinition: 'fun' name=IDENTIFIER '(' (IDENTIFIER (',' IDENTIFIER)*)? ')' '{' (statement)* '}' ;

ifStatement: 'if' '(' expression ')' statement  elseStatement? ;

elseStatement: 'else' statement ;

// list of tokens
IDENTIFIER          : [a-zA-Z_] [a-zA-Z_0-9]* ;
NUMBER              : [0-9]+ ;
DOUBLE_NUMBER       : NUMBER '.' NUMBER ;
STRING_LITERAL      : '"' (~["])* '"' ;

ASTERISK            : '*' ;
SLASH               : '/' ;
PLUS                : '+' ;
MINUS               : '-' ;

ASSIGN              : '=' ;
EQUAL               : '==' ;
NOT_EQUAL           : '!=' ;
LESS                : '<' ;
LESS_OR_EQUAL       : '<=' ;
GREATER             : '>' ;
GREATER_OR_EQUAL    : '>=' ;

SPACE               : [ \r\n\t]+ -> skip;
LINE_COMMENT        : '//' ~[\n\r]* -> skip;

As you have probably already guessed, our language is called Jimple (derived from Jvm simple). It’s probably worth explaining some points that may not be obvious when you first get acquainted with ANTLR.

Labels

op label has been used to describe the rules for some operations. This allows us to use this label later as the name of a variable that will contain the value of the operator. Theoretically, we could avoid specifying labels, but in this case, we would have to write additional code to get the value of the operator from the parse tree.

compOperator: op=(LESS | LESS_OR_EQUAL | EQUAL | NOT_EQUAL | GREATER | GREATER_OR_EQUAL) ;

Named rule alternatives

In ANTLR, by defining a rule with multiple alternatives, a user can assign a name to every of them, and then in the tree it will appear in a form of a separate processing node. That is very convenient when you need to assign to the processing of every rule option a separate method. It is important that the names must be given either to all alternatives or to none of them. The following example demonstrates what it looks like:

expression: '(' expression ')'  #parenthesisExpr
          | IDENTIFIER          #idExp
          | NUMBER              #numExpr

ANTLR will generate the following code:

public interface JimpleVisitor<T> {
    T visitParenthesisExpr(ParenthesisExprContext ctx);

    T visitIdExp(IdExpContext ctx);

    T visitNumExpr(NumExprContext ctx);
}

Channels

ANTLR has such a construct as channels. Usually, channels are used to work with comments, but since in most cases we do not need to check for comments, they must be discarded with -> skip, that we have already used. However, there are cases when we need to interpret the meaning of comments or other constructs, in this case you should use pipes. ANTLR has already a built-in channel called HIDDEN that you can use, or you can also declare your own channels for specific purposes. Later, by parsing the code, you will be able to access these channels.

An example of declaring and using a channel

channels { MYLINECOMMENT }

LINE_COMMENT : '//' ~[rn]+ -> channel(MYLINECOMMENT) ;

Fragments

Along with tokens, ANTLR has such an element as fragments. Rules with a fragment prefix can only be called from other lexer rules. They are not tokens per se. In the following example, we moved the definitions of numbers for different number systems into fragments.

NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
fragment DIGITS: '1'..'9' '0'..'9'*;
fragment OCTAL_DIGITS: '0' '0'..'7'+;
fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;

Thus, a number in any number system (for example: “123”, “0762”, or “0xac1”) will be treated as a NUMBER token, not DIGITS, OCTAL_DIGITS, or HEX_DIGITS. Jimple does not use fragments.

Tools

Before we start generating the parser, we need to set up the tools to work with ANTLR. As is known, a good and convenient tool makes half the success. To accomplish that, we need to download ANTLR library and to write scripts in order to run it. There are also maven/gradle/IntelliJ IDEA plugins, which we will not be using in this article, but for productive development, they can be found useful.

We need the following scripts:

antlr4.sh script

java -Xmx500M -cp ".:/usr/lib/antlr-4.12.0-complete.jar" org.antlr.v4.Tool $@

grun.sh script

java -Xmx500M -cp ".:/usr/lib/antlr-4.12.0-complete.jar" org.antlr.v4.gui.TestRig $@

Parser generation

Save the grammar in the file Jimple.g4. Next, run the script like this:

antlr4.sh Jimple.g4 -package org.jimple.lang -visitor

-package option allows you to specify the java package in which the code will be generated. The -visitor option allows you to generate a JimpleVisitor interface that implements the Visitor pattern.

After successful execution of the script, several files will appear in the current directory: JimpleParser.java, JimpleLexer.java, JimpleListener.java, JimpleVisitor.java. The first two files contain the generated parser and lexer code, respectively. The other two files contain interfaces for working with the parse tree. In this article, we will use the JimpleVisitor interface, more specifically JimpleBaseVisitor — this one is also a generated class that implements the JimpleVisitor interface and contains implementations of all methods. It allows us to override only the methods we need.

Interpreter Implementation

Finally, we have reached the most interesting part — the implementation of the interpreter. Although in this article we will not analyze the issue of checking the code for errors, some interpretation errors will still be implemented. First of all, let’s create a JimpleInterpreter class with an eval method, which will take a string with the code for Jimple as input. Next, we need to parse the source code into tokens using JimpleLexer, then to create an AST tree using JimpleParser.

public class JimpleInterpreter {
    public Object eval(final String input) {
        // parse initial code on tokens
        final JimpleLexer lexer = new JimpleLexer(CharStreams.fromString(input));
        // create AST tree
        final JimpleParser parser = new JimpleParser(new CommonTokenStream(lexer));
        // create an object class JimpleInterpreterVisitor
        final JimpleInterpreterVisitor interpreterVisitor = new JimpleInterpreterVisitor(new JimpleContextImpl(stdout));
        // run interpreter
        return interpreterVisitor.visitProgram(parser.program());
    }
}

We have a syntax tree. Let’s add semantics with the JimpleInterpreterVisitor class we wrote, which will bypass the AST by calling up the appropriate methods. Since the root rule of our grammar is the program rule (see program: (statement)* EOF above), the tree traversal starts from it. To accomplish this, we call the visitProgram method implemented by default on the JimpleInterpreterVisitor object, to the input of which we give an object of the ProgramContext class. The ANTLR implementation consists of calling the visitChildren (RuleNode node) that traverses all the descendant items of the given tree node, calling the visit method on each of them.

// code generated by ANTLR
public class JimpleBaseVisitor<T> extends AbstractParseTreeVisitor<T> implements JimpleVisitor<T> {
    @Override
    public T visitProgram(JimpleParser.ProgramContext ctx) {
        return visitChildren(ctx);
    }

    // other methods omitted for brevity
}

As you can see, JimpleBaseVisitor is a generic class for which you need to define the type of processing for each node. In our case, this is the Object class, since expressions can return values of different types. Typically, an expression must return a value, while a statement returns nothing – this is their difference. In case of statement, we can return null. However, in order not to accidentally encounter NullPointerException, instead of null we will return an object of type Object, which is globally defined in the JimpleInterpreter class:

public static final Object VOID = new Object();

JimpleInterpreterVisitor class extends class JimpleBaseVisitor, overriding only the methods we are interested in. Let’s consider the implementation of the built-in println operator, which is described in the grammar as println: 'println' expression ;. The first thing we need to complete is to evaluate the expression, for this we need to call the visit method and pass the expression object from the current PrintlnContext into it. In the visitPrintln method, we are absolutely not interested in a calculation process of an expression; the corresponding method is responsible for calculating each rule (context). For example, the visitStringExpr method is used to evaluate a string literal.

public class JimpleInterpreterVisitor extends JimpleBaseVisitor<Object> {
    @Override
    public Object visitPrintln(final JimpleParser.PrintlnContext ctx) {
        final Object result = visit(ctx.expression());
        System.out.println(result);
        return VOID;
    }

    @Override
    public Object visitStringExpr(final JimpleParser.StringExprContext ctx) {
        // return string literal
        return cleanStringLiteral(ctx.STRING_LITERAL().getText());
    }

    private String cleanStringLiteral(final String literal) {
        // clear string from quotation marks
        return literal.length() > 1 ? literal.substring(1, literal.length() - 1) : literal;
    }

    // other methods omitted for brevity
}

By implementing only these methods, the interpreter already supports println and string literals, which allows us to execute println "Hello, Jimple!" code.

Run interpreter

To launch the interpreter, you need to create a standard main method, which after some small checks with the use of the JimpleInterpreter class, will run our code:

public class MainApp {
    public static void main(String[] args) {
        if (args.length < 1) {
            System.out.println("usage: jimple input.jimple");
            System.exit(1);
        }

        Path path = Paths.get(args[0]);
        if (!Files.exists(path)) {
            System.err.println("File not found: " + path);
            System.exit(1);
        }

        new JimpleInterpreter().eval(path);
    }
}

Implementation details

There is no need to provide the entire implementation code of the interpreter, a link to the source code can be found at the end of the article. However, I want to examine some interesting details.

As already mentioned, the interpreter is based on the Visitor pattern, which visits the nodes of the AST tree and executes the corresponding instructions. In the process of code execution in the current context it appears that new identifiers (names of variables and/or functions) need to be stored somewhere. To do this, we will write a JimpleContext class that will store not only these identifiers, but also the current context for executing nested code blocks and functions, since a local variable and/or function parameter must be deleted after leaving their scope.

@Override
public Object handleFunc(FunctionSignature func, List<String> parameters, List<Object> arguments, FunctionDefinitionContext ctx) {
    Map<String, Object> variables = new HashMap<>(parameters.size());
    for (int i = 0; i < parameters.size(); i++) {
        variables.put(parameters.get(i), arguments.get(i));
    }
    // create a new function scope and push it onto the stack
    context.pushCallScope(variables);

    // execution of function expressions omitted for brevity
    // remove the scope of function parameters from the stack
    context.popCallScope();

    return functionResult;
}

In our language, a variable stores a value of a type that is determined at runtime. Further, in the following instructions, this type may change. Thus, we got a dynamically typed language. However, some checking of types is still present in cases where performing the operation is pointless. For example, a number cannot be divided by a string.

Why do we need two passes?

The original version of the interpreter was to implement a method for each rule. For example, if a function declaration processing method finds a function with the same name (and number of parameters) in the current context, then an exception is thrown, otherwise the function is added to the current context. The function call method works the same way. If the function is not found, then an exception is thrown, otherwise the function is called. This approach works, but it does not allow you to call the function before it is defined. For example, the following code will not work:

var result = 9 + 10

println "Result is " + add(result, 34)

fun add(a, b) {
    return a + b
}

In this case, we have two approaches. The first one is to require functions to be defined before they are used (not very convenient for users of the language). The second approach is to perform two passes. The first pass is needed for finding out all the functions that have been defined in the code. And the second is directly for executing the code. In my implementation, I chose the second approach. The implementation of the visitFunctionDefinition method should be moved to a separate class that extends the generated class JimpleBaseVisitor<T> that is already known to us.

// class finds all functions in the code and registers them in the context
public class FunctionDefinitionVisitor extends JimpleBaseVisitor<Object> {
    private JimpleContext context;
    private FunctionCallHandler handler;

    public FunctionDefinitionVisitor(JimpleContext context, FunctionCallHandler handler) {
        this.context = context;
        this.handler = handler;
    }

    @Override
    public Object visitFunctionDefinition(JimpleParser.FunctionDefinitionContext ctx) {
        String name = ctx.name.getText();
        List<String> parameters = ctx.IDENTIFIER().stream().skip(1).map(ParseTree::getText).toList();
        var funcSig = new FunctionSignature(name, parameters.size());
        context.registerFunction(funcSig, (func, args) -> handler.handleFunc(func, parameters, args, ctx));
        return VOID;
    }
}

Now we have a class that we can use before executing the interpreter class directly. It will fill our context with the definitions of all the functions that we will call in the interpreter class.

What does an AST look like?

In order to visualize the AST, you need to use the grun utility (see above). To do this, run grun with the parameters Jimple program -gui (the first parameter is the name of the grammar, the second is the name of the rule). As a result, a window with the AST tree will be opened. Before executing this utility, it is important to compile the code generated by ANTLR.

 # generate a parser
antlr4.sh Jimple.g4
# compile the generated code
javac -cp ".:/usr/lib/jvm/antlr-4.12.0-complete.jar" Jimple*.java
# run grun
grun.sh Jimple program -gui
# enter code: `println "Hello, Jimple!"`
# press Ctrl+D (Linux) or Ctrl+Z (Windows)

For the Jimple code println "Hello, Jimple!", the following AST will be generated:

Summary

In this article you have got acquainted with such concepts as lexical and parser analyzers. We used the ANTLR tool to generate such analyzers. You have also learned how to write ANTLR grammar. As a result, we created a simple language, namely, developed an interpreter for it. As a bonus, we managed to visualize the AST.

The entire source code of the interpreter can be viewed by link.

Links

September 27, 2023/by Ruslan Absaliamov

Posts

Developing our own programming language in Java + ANTLR: diagnostics of errors

Classification of errors

JimpleBaseVisitor

Listers vs Visitors

Catching syntactic errors

Identify semantic errors

Runtime errors

Control/data-flow analysis

Summary

Links

Developing our own programming language in Java + ANTLR: interpreter

Grammar example

ANTLR rule examples

ANTLR tokens and expressions

Language Requirements

Grammar

Labels

Named rule alternatives

Channels

An example of declaring and using a channel

Fragments

Tools

Parser generation

Interpreter Implementation

Run interpreter

Implementation details

Why do we need two passes?

What does an AST look like?

Summary

Links

Phone

Fax

ADRESSE

BUSINESS HOURS

Tag Archive for: EBNF

Posts

Classi­fi­cation of errors

Jimple­Ba­se­Visitor

Listers vs Visitors

Catching syntactic errors

Identify semantic errors

Runtime errors

Control/data-flow analysis

Summary

Links

Grammar example

ANTLR rule examples

ANTLR tokens and expressions

Language Require­ments

Grammar

Labels

Named rule alternatives

Channels

An example of declaring and using a channel

Fragments

Tools

Parser gener­ation

Inter­preter Implementation

Run inter­preter

Imple­men­tation details

Why do we need two passes?

What does an AST look like?

Summary

Links

Phone

Fax

ADRESSE

BUSINESS HOURS

Classification of errors

JimpleBaseVisitor

Language Requirements

Parser generation

Interpreter Implementation

Run interpreter

Implementation details