Posts

This is the second article in the series “Devel­oping our programming language in Java”, first article can be read here.

At the current stage, we have an inter­preter capable of executing the commands of our language. However, this is not enough if we want to check the code for errors and display them in a clear way to the user. In this article we will consider adding error diagnostics to the language. Conducting error analysis in your own programming language is an important step in language development. Using powerful tools, such as ANTLR, allows you to implement rather efficient code analysis tools in a short time that will help you to detect potential problems in a program at early stages of development, which helps to improve software quality and increase developer’s productivity.

Classi­fi­cation of errors

There are different types of errors, but in general they can be divided into three categories: syntaxsemantic and runtime errors.

Syntax errors occur due to violation of the syntax rules of a particular programming language. Syntax rules define how state­ments and expres­sions in code should be organized.

Example of a syntax error (missing closing quote):

println("Hello, World!)

Semantic errors occur when a program is compiled and even executed, but the result is different from what was expected. This type of error is the most difficult. Semantic errors can be caused by the programmer’s misun­der­standing of the language or the task at hand. For example, if a programmer has a poor under­standing of operator prece­dence, then he can write the following code:

var a = 1 + 2 * 3

One could expect the variable a to be equal to 9, but in fact it will be equal to 7. This happens because the multi­pli­cation operator has higher prece­dence than the addition operator. A semantic error can usually be discovered during debugging or extensive testing of the program.

Runtime errors, also known as Excep­tions, occur during program execution. Such errors can occur due to incorrect data entry, attempting to access a non-existent file, and in many other scenarios. Some runtime errors can be handled in a program, but if this is not done, the program usually crash.

In addition to errors, it is also important to detect potential problems or non-obvious situa­tions that are not errors in the strict sense, but may lead to undesirable conse­quences. For example, it could be an unused variable, use of depre­cated functions, or a meaningless operation. In all such cases, warnings can be shown to the user.

Jimple­Ba­se­Visitor

To identify errors and warnings, we need the abstract class JimpleBaseVisitor (generated by ANTLR), familiar to us from the first article, which by default imple­ments the JimleVisitor interface. It allows you to traverse the AST tree (Abstract Syntax Tree), and based on the analysis of its nodes we will decide upon the error, warning or normal part of the code. In essence, diagnosing errors is almost no different from inter­preting code, except when we need to perform I/O or access external resources. For example, if a console output command is executed, then our task is to check whether the data type is passed as an argument is valid, without directly outputting to the console.

Let’s create the JimpleDiagnosticTool class, which inherits JimleBaseVisitor and encap­su­lates all the logic of finding and storing errors:

class JimpleDiagnosticTool extends JimpleBaseVisitor<ValidationInfo> {
    private Set<Issue> issues = new LinkedHashSet<>();
}

record Issue(IssueType type, String message, int lineNumber, int lineOffset, String details) {}

This class contains a list of type Issue, which repre­sents infor­mation about a specific error.

It is known that each method of a given class must return a value of a certain type. In our case, we will return infor­mation about the type of node in the tree — ValidationInfo. This class also contains infor­mation about the possible value, this will help us identify some semantic or runtime errors.

record ValidationInfo(ValidationType type, Object value) {}

enum ValidationType {
    /**
     * Expression returns nothing.
     */
    VOID,

    /**
     * Expression is String
     */
    STRING,

    /**
     * Expression is double
     */
    DOUBLE,

    /**
     * Expression is long
     */
    NUMBER,

    /**
     * Expression is boolean
     */
    BOOL,

    /**
     * Expression contains error and analysing in another context no makes sense.
     */
    SKIP,

    /**
     * When object can be any type. Used only in Check function definition mode.
     */
    ANY,

    /**
     * Tree part is function declaration
     */
    FUNCTION_DEFINITION
}

You should pay attention to the value of ValidationType.SKIP. It will be used if an error has been found and already regis­tered in a part of the tree, and further analysis of this tree node does not make sense. For example, if one argument in a sum expression contains an error, then the second argument of the expression will not be analyzed.

ValidationInfo checkBinaryOperatorCommon(ParseTree leftExp, ParseTree rightExp, Token operator) {
    ValidationInfo left = visit(leftExp);
    if (left.isSkip()) {
        return ValidationInfo.SKIP;
    }
    ValidationInfo right = visit(rightExp);
    if (right.isSkip()) {
        return ValidationInfo.SKIP;
    }
    // code omitted
}

Listers vs Visitors

Before moving on, let’s take a look at another ANTLR-generated interface JimpleListener (pattern Observer), which can also be used if we need traverse the AST tree. What is the difference between them? The biggest difference between these mecha­nisms is that listener methods are always called by ANTLR on a per-node basis, whereas visitor methods must bypass their child elements with explicit calls. And if the programmer does not call visit() on child nodes, then these nodes are not visited, i.e. we have the ability to control tree traversal. For example, in our imple­men­tation, the function body is first visited once in its entirety (mode checkFuncDefinition==true) to detect errors in the entire function (all if and else blocks), and several times with specific argument values:

@Override
ValidationInfo visitIfStatement(IfStatementContext ctx) {
    // calc expression in "if" condition
    ValidationInfo condition = visit(ctx.expression());

    if (checkFuncDefinition) {
        visit(ctx.statement());
        // as it's just function definition check, check else statement as well
        JimpleParser.ElseStatementContext elseStatement = ctx.elseStatement();
        if (elseStatement != null) {
            visit(elseStatement);
        }
        return ValidationInfo.VOID;
    }

    // it's not check function definition, it's checking of certain function call
    if (condition.isBool() && condition.hasValue()) {
        if (condition.asBoolean()) {
            visit(ctx.statement());
        } else {
            JimpleParser.ElseStatementContext elseStatement = ctx.elseStatement();
            if (elseStatement != null) {
                visit(elseStatement);
            }
        }
    }

    return ValidationInfo.VOID;
}

The Visitor pattern works very well if we need to map a specific value for each tree node. This is exactly what we need.

Catching syntactic errors

In order to find some syntax errors in the code, we need to implement the ANTLRErrorListener interface. This interface contains four methods that will be called (by the parser and/or lexer) in case of an error or undefined behavior:

interface ANTLRErrorListener {
    void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e);
    void reportAmbiguity(Parser recognizer, DFA dfa, int startIndex, int stopIndex, boolean exact, BitSet ambigAlts, ATNConfigSet configs);
    void reportAttemptingFullContext(Parser recognizer, DFA dfa, int startIndex, int stopIndex, BitSet conflictingAlts, ATNConfigSet configs);
    void reportContextSensitivity(Parser recognizer, DFA dfa, int startIndex, int stopIndex, int prediction, ATNConfigSet configs);
} 

The name of the first method (syntaxError) speaks for itself; it will be called in case of a syntax error. The imple­men­tation is quite simple: we need to convert the error infor­mation into an object of type Issue and add it to the list of errors:

@Override
void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) {
    int offset = charPositionInLine + 1;
    issues.add(new Issue(IssueType.ERROR, msg, line, offset, makeDetails(line, offset)));
}

The remaining three methods can be ignored. ANTLR also imple­ments this interface itself (see class ConsoleErrorListener) and sends errors to the standard error stream (System.err). To disable it and other standard handlers, we need to call the removeErrorListeners method on the parser and lexer:

    // remove default error handlers
    lexer.removeErrorListeners();
    parser.removeErrorListeners();

Another type of syntax error is based on the rules of a particular language. For example, in our language, a function is identified by its name and number of arguments. When the analyzer encounters a function call, it checks whether a function with the same name and number of arguments exists. If not, then it throws an error. To do this, we need to override the visitFunctionCall method:

@Override
ValidationInfo visitFunctionCall(FunctionCallContext ctx) {
    String funName = ctx.IDENTIFIER().getText();
    int argumentsCount = ctx.expression().size();
    var funSignature = new FunctionSignature(funName, argumentsCount, ctx.getParent());
    // find a function in the context by signature (name+number of arguments)
    var handler = context.getFunction(funSignature);

    if (handler == null) {
        addIssue(IssueType.ERROR, ctx.start, "Function with such signature not found: " + funName);
        return ValidationInfo.SKIP;
    }

    // code omitted
}

Let’s check the if construct. Jimple requires that the expression in the if condition is of boolean type:

@Override
ValidationInfo visitIfStatement(IfStatementContext ctx) {
    // visit expression
    ValidationInfo condition = visit(ctx.expression());
    // skip if expression contains error
    if (condition.isSkip()) {
        return ValidationInfo.SKIP;
    }

    if (!condition.isBool()) {
        addIssue(IssueType.WARNING, ctx.expression().start, "The \"if\" condition must be of boolean type only. But found: " + condition.type());
    }

    // code omitted
}

The attentive reader will notice that in this case we added a warning, not an error. This is done due to the fact that our language is dynamic, and we don’t always know the exact infor­mation about the type of the expression.

Identify semantic errors

As mentioned earlier, semantic errors are difficult to find and can often be found only during debugging or testing the program. However, some of them can be identified at the compi­lation stage. For example, if we know that the function X always returns 0, then we can show a warning if a division expression uses this function as a divisor. Division by zero is usually considered a semantic error, since division by zero makes no sense in mathematics.

An example of error detection “Division by zero”: triggered when an expression is used as a divisor, which always returns the value 0:

ValidationInfo checkBinaryOperatorForNumeric(ValidationInfo left, ValidationInfo right, Token operator) {
    if (operator.getType() == JimpleParser.SLASH && right.hasValue() && ((Number) right.value()).longValue() == 0) {
        // if we have value of right's part of division expression and it's zero
        addIssue(IssueType.WARNING, operator, "Division by zero");
    }

    // code omitted
}

Runtime errors

Runtime errors are also difficult or even impos­sible to detect at the compilation/interpretation stage. However, some such errors can still be identified. For example, if a function calls itself (either directly or through another function), this may result in a stack overflow error (Stack­Overflow). The first thing we need to do is to declare a list (Set) where we will store the functions that are in the process of being called at the moment. The check itself can (and should) be placed in the handle­Func­In­ternal method of processing the function call. At the beginning of this method, we check whether the current Function­De­f­i­n­i­tion­Context (function decla­ration context) is in the list of already called functions, and if that is the case, we log a Warning and interrupt further processing of the function. If not, then we add the current context to our list, and the rest of the logic follows. When exiting handle­Func­In­ternal, you need to remove the current function context from the list. Here it should be noted that in this case we not only identified a potential Stack­Overflow, but also got rid of the same error when diagnosing error, namely when looping the handle­Func­In­ternal method.

Set<FunctionDefinitionContext> calledFuncs = new HashSet<>();

ValidationInfo handleFuncInternal(List<String> parameters, List<Object> arguments, FunctionDefinitionContext ctx) {
    if (calledFuncs.contains(ctx)) {
        addIssue(IssueType.WARNING, ctx.name, String.format("Recursive call of function '%s' can lead to StackOverflow", ctx.name.getText()));
        return ValidationInfo.SKIP;
    }
    calledFuncs.add(ctx);
    
    // other checkings

    calledFuncs.remove(ctx);

    // return resulting ValidationInfo
}

Control/data-flow analysis

For a deeper study of program code, optimization and identi­fi­cation of complex errors, Control-flow analysis and Data-flow analysis are also used.

Control flow analysis is focused on under­standing which parts of a program are executed depending on various condi­tions and control struc­tures such as condi­tional (if-else) state­ments, loops, and branches. It allows you to identify the program execution paths and detect potential errors associated with incorrect flow control logic. For example, unreachable code or potential program hangpoints.

Data flow analysis, on the other hand, focuses on how data is distributed and used within a program. It helps identify potential data problems such as the use of unini­tialized variables, data depen­dencies, and possible memory leaks. Data flow analysis can also detect errors associated with incorrect data opera­tions, such as the use of incorrect types or incorrect (meaningless) calculations.

Summary

In this article, we have examined the process of adding error and warning diagnostics to your programming language. We have learned what does the ANTLR provides out of the box for logging syntax errors. We have alos imple­mented a handling of some errors and potential problems during program execution.

The entire inter­preter source code can be viewed at link.

This is the first article in the series “Devel­oping our programming language in Java” which is aimed to show the full path of creating a language, as well as writing and maintaining tools for it. By the end of this article, we will implement an inter­preter with which it will be possible to execute programs in our language. Any programming language has a syntax that needs to be converted into a data structure that is conve­nient for validation, trans­for­mation and execution. As a rule, such a data structure is an abstract syntax tree (AST). Each tree node repre­sents a construct that occurs in the source code. The source code is parsed by the parser, and the output is AST.

Languages have been developed for a long time, and therefore at the moment we have quite a few mature tools, including parser gener­ators. Parser gener­ators take a description of the grammar of a particular language as input, and as output we get parsers, inter­preters and compilers.

In this article, we will consider the ANTLR tool. ANTLR is a utility that receives a grammar in the form EBNF as input, and interfaces/classes as output (in our case, this is java code) for parsing programs. The list of languages in which parsers are generated can be found here.

Grammar example

Before moving on to real grammar, let’s try to describe some of the rules of a typical programming language in words:

  • VARIABLE is an IDENTIFIER
  • DIGIT is one of the characters 0 1 2 3 4 5 6 7 8 9
  • NUMBER is one or more elements of type DIGIT
  • EXPRESSION is NUMBER
  • EXPRESSION is VARIABLE
  • EXPRESSION is EXPRESSION ‘+’ EXPRESSION
  • EXPRESSION is ‘(‘ EXPRESSION ’)’

As you can see from this list, a language grammar is a set of rules that can have recursive links. Each rule can refer to itself or to any other rule. ANTLR has many operators in its arsenal to describe such rules.

: marker of the beginning of the rule
; rule end marker
| alternative operator
.. range operator
~ denial
. any character
= assignment
(...) sub-rule
(...)* repeating of a subrule 0 or more times
(...)+  repeating of a subrule 1 or more times
(...)? subrule, may be missing
{...} semantic actions (in the language used as the output - for example, Java)
[...] rule parameters

ANTLR rule examples

The following example describes the rules for integers and floating-point numbers:

NUMBER : [0-9]+ ;
FLOAT : NUMBER '.' NUMBER ;

It is very important to under­stand that the grammar describes only the syntax of the language on the basis of which the parser will be generated. The parser will generate an AST, with which help it will be possible to implement the semantics of the language. In the previous example, we set a rule for parsing an integer, but did not describe how much memory the number occupies (8 bits, 16, …), and whether the number is signed or unsigned. For example, in some programming languages you can start using a variable without declaring it. You can also not declare the type of the variable; in this case the type will be deter­mined automat­i­cally at runtime. All these rules of language semantics are not described in the grammar, but are imple­mented in another part of the language.

ANTLR tokens and expressions

ANTLR grammar consists of two types of rules: tokens and expres­sions, which are used to define the structure of the grammar and parse input.

Tokens are rules that define individual lexical elements of the input language, such as numbers, identi­fiers, operation signs, and so on. Each token corre­sponds to a certain type of token, which is used for further processing by the parser. The lexical analyzer scans the input text, parses it into tokens, and creates a sequence of tokens, which are then passed to the parser. They are written in uppercase (For example: NUMBERIDENTIFIER).

Expres­sions are rules that define the grammar structure of the input language. They describe how tokens are related and how they can be combined into more complex struc­tures. Expres­sions can contain refer­ences to tokens as well as to other expres­sions. Written in camelCase notation (For example: expressionfunctionDefinition).

Thus, the difference between tokens and expres­sions in ANTLR is that tokens define individual lexical elements of the input language and convert them into tokens, while expres­sions define the structure of the grammar and describe how tokens are related to each other in more complex structures.

Language Require­ments

Before you start imple­menting a language, you need to determine the set of features it should support. For our task, for educa­tional purposes, we will use a simple grammar. The language will support the following constructs:

  • Variables (types StringLongDouble);
  • Assignment operator (=);
  • Arith­metic operations (+, -, *, /);
  • Comparison operators (>, <, >=, <=, ==, !=);
  • Condi­tional state­ments (if , else);
  • Functions;
  • Print to the console (built-in println operator).

Grammar

And finally, a complete description of the grammar for the language:

grammar Jimple ;

// root rule of grammar
program: (statement)* EOF;

// list of possible statements
statement: variableDeclaration
         | assignment
         | functionDefinition
         | functionCall
         | println
         | return
         | ifStatement
         | blockStatement
         ;

// list of possible expressions
expression: '(' expression ')'                                      #parenthesisExpr
          | left=expression op=(ASTERISK | SLASH) right=expression  #mulDivExpr
          | left=expression op=(PLUS | MINUS) right=expression      #plusMinusExpr
          | left=expression compOperator right=expression           #compExpr
          | IDENTIFIER                                              #idExp
          | NUMBER                                                  #numExpr
          | DOUBLE_NUMBER                                           #doubleExpr
          | STRING_LITERAL                                          #stringExpr
          | functionCall                                            #funcCallExpr
          ;

// descriptions of individual expressions and statements
variableDeclaration: 'var' IDENTIFIER '=' expression ;

assignment: IDENTIFIER '=' expression ;

compOperator: op=(LESS | LESS_OR_EQUAL | EQUAL | NOT_EQUAL | GREATER | GREATER_OR_EQUAL) ;

println: 'println' expression ;

return: 'return' expression ;

blockStatement: '{' (statement)* '}' ;

functionCall: IDENTIFIER '(' (expression (',' expression)*)? ')' ;

functionDefinition: 'fun' name=IDENTIFIER '(' (IDENTIFIER (',' IDENTIFIER)*)? ')' '{' (statement)* '}' ;

ifStatement: 'if' '(' expression ')' statement  elseStatement? ;

elseStatement: 'else' statement ;

// list of tokens
IDENTIFIER          : [a-zA-Z_] [a-zA-Z_0-9]* ;
NUMBER              : [0-9]+ ;
DOUBLE_NUMBER       : NUMBER '.' NUMBER ;
STRING_LITERAL      : '"' (~["])* '"' ;

ASTERISK            : '*' ;
SLASH               : '/' ;
PLUS                : '+' ;
MINUS               : '-' ;

ASSIGN              : '=' ;
EQUAL               : '==' ;
NOT_EQUAL           : '!=' ;
LESS                : '<' ;
LESS_OR_EQUAL       : '<=' ;
GREATER             : '>' ;
GREATER_OR_EQUAL    : '>=' ;

SPACE               : [ \r\n\t]+ -> skip;
LINE_COMMENT        : '//' ~[\n\r]* -> skip;

As you have probably already guessed, our language is called Jimple (derived from Jvm simple). It’s probably worth explaining some points that may not be obvious when you first get acquainted with ANTLR.

Labels

op label has been used to describe the rules for some opera­tions. This allows us to use this label later as the name of a variable that will contain the value of the operator. Theoret­i­cally, we could avoid speci­fying labels, but in this case, we would have to write additional code to get the value of the operator from the parse tree.

compOperator: op=(LESS | LESS_OR_EQUAL | EQUAL | NOT_EQUAL | GREATER | GREATER_OR_EQUAL) ;

Named rule alternatives

In ANTLR, by defining a rule with multiple alter­na­tives, a user can assign a name to every of them, and then in the tree it will appear in a form of a separate processing node. That is very conve­nient when you need to assign to the processing of every rule option a separate method. It is important that the names must be given either to all alter­na­tives or to none of them. The following example demon­strates what it looks like:

expression: '(' expression ')'  #parenthesisExpr
          | IDENTIFIER          #idExp
          | NUMBER              #numExpr

ANTLR will generate the following code:

public interface JimpleVisitor<T> {
    T visitParenthesisExpr(ParenthesisExprContext ctx);

    T visitIdExp(IdExpContext ctx);

    T visitNumExpr(NumExprContext ctx);
}

Channels

ANTLR has such a construct as channels. Usually, channels are used to work with comments, but since in most cases we do not need to check for comments, they must be discarded with -> skip, that we have already used. However, there are cases when we need to interpret the meaning of comments or other constructs, in this case you should use pipes. ANTLR has already a built-in channel called HIDDEN that you can use, or you can also declare your own channels for specific purposes. Later, by parsing the code, you will be able to access these channels.

An example of declaring and using a channel

channels { MYLINECOMMENT }

LINE_COMMENT : '//' ~[rn]+ -> channel(MYLINECOMMENT) ;

Fragments

Along with tokens, ANTLR has such an element as fragments. Rules with a fragment prefix can only be called from other lexer rules. They are not tokens per se. In the following example, we moved the defin­i­tions of numbers for different number systems into fragments.

NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
fragment DIGITS: '1'..'9' '0'..'9'*;
fragment OCTAL_DIGITS: '0' '0'..'7'+;
fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;

Thus, a number in any number system (for example: “123”, “0762”, or “0xac1”) will be treated as a NUMBER token, not DIGITSOCTAL_DIGITS, or HEX_DIGITS. Jimple does not use fragments.

Tools

Before we start gener­ating the parser, we need to set up the tools to work with ANTLR. As is known, a good and conve­nient tool makes half the success. To accom­plish that, we need to download ANTLR library and to write scripts in order to run it. There are also maven/gradle/IntelliJ IDEA plugins, which we will not be using in this article, but for productive development, they can be found useful.

We need the following scripts:

antlr4.sh script

java -Xmx500M -cp ".:/usr/lib/antlr-4.12.0-complete.jar" org.antlr.v4.Tool $@

grun.sh script

java -Xmx500M -cp ".:/usr/lib/antlr-4.12.0-complete.jar" org.antlr.v4.gui.TestRig $@

Parser gener­ation

Save the grammar in the file Jimple.g4. Next, run the script like this:

antlr4.sh Jimple.g4 -package org.jimple.lang -visitor

-package option allows you to specify the java package in which the code will be generated. The -visitor option allows you to generate a JimpleVisitor interface that imple­ments the Visitor pattern.

After successful execution of the script, several files will appear in the current directory: JimpleParser.javaJimpleLexer.javaJimpleListener.javaJimpleVisitor.java. The first two files contain the generated parser and lexer code, respec­tively. The other two files contain inter­faces for working with the parse tree. In this article, we will use the JimpleVisitor interface, more specif­i­cally JimpleBaseVisitor — this one is also a generated class that imple­ments the JimpleVisitor interface and contains imple­men­ta­tions of all methods. It allows us to override only the methods we need.

Inter­preter Implementation

Finally, we have reached the most inter­esting part — the imple­men­tation of the inter­preter. Although in this article we will not analyze the issue of checking the code for errors, some inter­pre­tation errors will still be imple­mented. First of all, let’s create a JimpleInterpreter class with an eval method, which will take a string with the code for Jimple as input. Next, we need to parse the source code into tokens using JimpleLexer, then to create an AST tree using JimpleParser.

public class JimpleInterpreter {
    public Object eval(final String input) {
        // parse initial code on tokens
        final JimpleLexer lexer = new JimpleLexer(CharStreams.fromString(input));
        // create AST tree
        final JimpleParser parser = new JimpleParser(new CommonTokenStream(lexer));
        // create an object class JimpleInterpreterVisitor
        final JimpleInterpreterVisitor interpreterVisitor = new JimpleInterpreterVisitor(new JimpleContextImpl(stdout));
        // run interpreter
        return interpreterVisitor.visitProgram(parser.program());
    }
}

We have a syntax tree. Let’s add semantics with the JimpleInterpreterVisitor class we wrote, which will bypass the AST by calling up the appro­priate methods. Since the root rule of our grammar is the program rule (see program: (statement)* EOF above), the tree traversal starts from it. To accom­plish this, we call the visitProgram method imple­mented by default on the JimpleInterpreterVisitor object, to the input of which we give an object of the ProgramContext class. The ANTLR imple­men­tation consists of calling the visitChildren (RuleNode node) that traverses all the descendant items of the given tree node, calling the visit method on each of them.

// code generated by ANTLR
public class JimpleBaseVisitor<T> extends AbstractParseTreeVisitor<T> implements JimpleVisitor<T> {
    @Override
    public T visitProgram(JimpleParser.ProgramContext ctx) {
        return visitChildren(ctx);
    }

    // other methods omitted for brevity
}

As you can see, JimpleBaseVisitor is a generic class for which you need to define the type of processing for each node. In our case, this is the Object class, since expres­sions can return values of different types. Typically, an expression must return a value, while a statement returns nothing – this is their difference. In case of statement, we can return null. However, in order not to acciden­tally encounter NullPointerException, instead of null we will return an object of type Object, which is globally defined in the JimpleInterpreter class:

public static final Object VOID = new Object();

JimpleInterpreterVisitor class extends class JimpleBaseVisitor, overriding only the methods we are inter­ested in. Let’s consider the imple­men­tation of the built-in println operator, which is described in the grammar as println: 'println' expression ;. The first thing we need to complete is to evaluate the expression, for this we need to call the visit method and pass the expression object from the current PrintlnContext into it. In the visitPrintln method, we are absolutely not inter­ested in a calcu­lation process of an expression; the corre­sponding method is respon­sible for calcu­lating each rule (context). For example, the visitStringExpr method is used to evaluate a string literal.

public class JimpleInterpreterVisitor extends JimpleBaseVisitor<Object> {
    @Override
    public Object visitPrintln(final JimpleParser.PrintlnContext ctx) {
        final Object result = visit(ctx.expression());
        System.out.println(result);
        return VOID;
    }

    @Override
    public Object visitStringExpr(final JimpleParser.StringExprContext ctx) {
        // return string literal
        return cleanStringLiteral(ctx.STRING_LITERAL().getText());
    }

    private String cleanStringLiteral(final String literal) {
        // clear string from quotation marks
        return literal.length() > 1 ? literal.substring(1, literal.length() - 1) : literal;
    }

    // other methods omitted for brevity
}

By imple­menting only these methods, the inter­preter already supports println and string literals, which allows us to execute println "Hello, Jimple!" code.

Run inter­preter

To launch the inter­preter, you need to create a standard main method, which after some small checks with the use of the JimpleInterpreter class, will run our code:

public class MainApp {
    public static void main(String[] args) {
        if (args.length < 1) {
            System.out.println("usage: jimple input.jimple");
            System.exit(1);
        }

        Path path = Paths.get(args[0]);
        if (!Files.exists(path)) {
            System.err.println("File not found: " + path);
            System.exit(1);
        }

        new JimpleInterpreter().eval(path);
    }
}

Imple­men­tation details

There is no need to provide the entire imple­men­tation code of the inter­preter, a link to the source code can be found at the end of the article. However, I want to examine some inter­esting details.

As already mentioned, the inter­preter is based on the Visitor pattern, which visits the nodes of the AST tree and executes the corre­sponding instruc­tions. In the process of code execution in the current context it appears that new identi­fiers (names of variables and/or functions) need to be stored somewhere. To do this, we will write a JimpleContext class that will store not only these identi­fiers, but also the current context for executing nested code blocks and functions, since a local variable and/or function parameter must be deleted after leaving their scope.

@Override
public Object handleFunc(FunctionSignature func, List<String> parameters, List<Object> arguments, FunctionDefinitionContext ctx) {
    Map<String, Object> variables = new HashMap<>(parameters.size());
    for (int i = 0; i < parameters.size(); i++) {
        variables.put(parameters.get(i), arguments.get(i));
    }
    // create a new function scope and push it onto the stack
    context.pushCallScope(variables);

    // execution of function expressions omitted for brevity
    // remove the scope of function parameters from the stack
    context.popCallScope();

    return functionResult;
}

In our language, a variable stores a value of a type that is deter­mined at runtime. Further, in the following instruc­tions, this type may change. Thus, we got a dynam­i­cally typed language. However, some checking of types is still present in cases where performing the operation is pointless. For example, a number cannot be divided by a string.

Why do we need two passes?

The original version of the inter­preter was to implement a method for each rule. For example, if a function decla­ration processing method finds a function with the same name (and number of parameters) in the current context, then an exception is thrown, otherwise the function is added to the current context. The function call method works the same way. If the function is not found, then an exception is thrown, otherwise the function is called. This approach works, but it does not allow you to call the function before it is defined. For example, the following code will not work:

var result = 9 + 10

println "Result is " + add(result, 34)

fun add(a, b) {
    return a + b
}

In this case, we have two approaches. The first one is to require functions to be defined before they are used (not very conve­nient for users of the language). The second approach is to perform two passes. The first pass is needed for finding out all the functions that have been defined in the code. And the second is directly for executing the code. In my imple­men­tation, I chose the second approach. The imple­men­tation of the visitFunctionDefinition method should be moved to a separate class that extends the generated class JimpleBaseVisitor<T> that is already known to us.

// class finds all functions in the code and registers them in the context
public class FunctionDefinitionVisitor extends JimpleBaseVisitor<Object> {
    private JimpleContext context;
    private FunctionCallHandler handler;

    public FunctionDefinitionVisitor(JimpleContext context, FunctionCallHandler handler) {
        this.context = context;
        this.handler = handler;
    }

    @Override
    public Object visitFunctionDefinition(JimpleParser.FunctionDefinitionContext ctx) {
        String name = ctx.name.getText();
        List<String> parameters = ctx.IDENTIFIER().stream().skip(1).map(ParseTree::getText).toList();
        var funcSig = new FunctionSignature(name, parameters.size());
        context.registerFunction(funcSig, (func, args) -> handler.handleFunc(func, parameters, args, ctx));
        return VOID;
    }
}

Now we have a class that we can use before executing the inter­preter class directly. It will fill our context with the defin­i­tions of all the functions that we will call in the inter­preter class.

What does an AST look like?

In order to visualize the AST, you need to use the grun utility (see above). To do this, run grun with the parameters Jimple program -gui (the first parameter is the name of the grammar, the second is the name of the rule). As a result, a window with the AST tree will be opened. Before executing this utility, it is important to compile the code generated by ANTLR.

 # generate a parser
antlr4.sh Jimple.g4
# compile the generated code
javac -cp ".:/usr/lib/jvm/antlr-4.12.0-complete.jar" Jimple*.java
# run grun
grun.sh Jimple program -gui
# enter code: `println "Hello, Jimple!"`
# press Ctrl+D (Linux) or Ctrl+Z (Windows)
For the Jimple code println "Hello, Jimple!", the following AST will be generated:

Summary

In this article you have got acquainted with such concepts as lexical and parser analyzers. We used the ANTLR tool to generate such analyzers. You have also learned how to write ANTLR grammar. As a result, we created a simple language, namely, developed an inter­preter for it. As a bonus, we managed to visualize the AST.

The entire source code of the inter­preter can be viewed by link.