The coursework required to build a Lexer and Parser for an imaginary language Z. The specification for the language can be found here. As I mentioned in the title, the tools that I used were JFlex and CUP.

Let’s now dive into the theoretical aspects of the project. First, what is a compiler?

What is a compiler?

A compiler is a computer program that translates code from a source language (usually a high-level programming language) into a target language (usually a low-level language), in order to create an executable program.

A typical compiler achieves this by following a series of steps:

  1. Lexical Analysis: it takes a stream of characters (the source code) and outputs a stream of tokens.

  2. Syntax Analysis: analyses the syntactical structure of the given token stream in order to check if the source code was syntactically correct according to the rules of the source language and outputs an abstract syntax tree.

  3. Semantic Analysis: takes the abstract syntax tree and ‘decorates’ it, checking for type errors and performing object bindings.

  4. Intermediate Code Generation: convertes the decorated abstrat syntax tree into an intermediate representation (like machine code) so that it can get optimized.

  5. Optimizing: applies different optimization techniques on the intermediate representation of the source code.

  6. Code Generation: takes the optimized intermediate representation and converts it into the target language.

The diagram below helps illustrate this process.

High-Level Structure Of A Compiler

High-Level Structure Of A Compiler

Now that we have a brief overview of how a compiler works, let’s dive deeper into the two phases that we are intersted in for this article: Lexical Analysis and Syntax Analysis.

Lexical Analysis


Lexical Analysis is the process of converting a stream of characters into a sequence of tokens. A token represents a string with an assigned meaning that describes a series of related lexemes. A lexeme is a contiguous sequence of characters that form a lexical unit in the grammar of a language. (Intuitively, it can be viewed as a “word” in English). In order to understand these definitions, let’s consider the table below:

Lexemes Token Type Meaning
if, while, for, class IF, WHILE, FOR, CLASS Keywords
aux, it, temp IDENTIFIER Identifiers
192,15 INT Integers
(,}, ; LPAREN,RBRACE,SEMICOLON Symbols
%,+,-,==,= MOD,PLUS,MINUS,EQ,ASSIGN Operators

Reiterating, a lexer should take a stream of characters from the source code and output a sequence of tokens. For example, considering the following source code:

if ( a % 2 == 0 ) { b = 5 }

Our lexer should produce the sequence:

IF LPAREN ID (a) MOD INT(2) EQ INT(0) RPAREN LBRACE ID(b) ASSIGN INT (5) RBRACE

So how can our lexer do this? In other words, how can we describe and recognize the lexemes in a programming language?

How to describe and recognize the lexemes in a programming language?

Lexemes are described using Regular Expressions and recognized by Finite State Machines. For instance, the following regular expressions could be used to define a language:

Regular Expression R Language of R
(if | while | for | class …) {if, while, for, class …}
[a-zA-Z_][a-zA-Z0-9_]* C++ Identifiers
nat = 0 | [1-9][0-9]* {0,1,2 …}
int = -?nat {… -2, -1, 0, 1, 2 …}
real = int(.nat)? The set of real numbers

Note: ‘*’ means zero or infinitely many times, ‘+’ means one or infinitely many times and ‘?’ means may or may not appear.

The union of all of this regular expressions form the regular expression for an imaginary language. Let’s call this new regular expression R.

Ambiguity

It can be seen from the table that R matches overlapping sets (like int and nat), causing ambiguity. For instance, the number 12 can be recognized in three ways, either as nat, int or as real. This is not a problem for our lexer, as parsing and semantic analysis will take care of this case, but what if we encounter the following character stream:

fora=0;

This can be interpreted in two ways, each leading to different token streams:

for | a | = | 0 | ;   => FOR IDENTIFIER(a) ASSIGN NUM(0) SEMICOLON

or

fora | = | 0 | ;     => IDENTIFIER(fora) ASSIGN NUM(0) SEMICOLON

How can our lexer decide which rules to use (either IDENTIFIER or FOR)? An intuitive fix for this would be using the rule that matches more characters in the source code. In our case, IDENTIFIER will be used because it matches 4 characters (fora), compared to FOR, that only matches 3 characters (for). This solution is called maximal munch. However, there is another case of amibguity that can trick our lexer. For the stream:

for=0;

What token stream should the lexer produce?

FOR ASSIGN NUM(0) SEMICOLON

or

IDENTIFIER(for) ASSIGN NUM(0) SEMICOLON

Note that even if the above code is syntactically invalid, we are currently lexing, so we do not check for syntactic errors at this stage, making the above example a case that our lexer must handle.

To solve this, we are going to use rule priority, that is, specifying that a certain rule should be applied if two or more regular expressions match the same number of characters from a string.

At this stage, we have a more detailed perspective of how our Lexer works. Starting from a regular expression R that describes a language’s rules, our Lexer will create a Finite State Machine that can recognize every string in R’s language. From R, an NFA that recognizes it’s languages can be constructed (using Thompson’s Construction). However, computing NFA acceptance is more expansive than performing the same operation for a DFA, so the generated NFA is converted into an equivalent DFA. This DFA then gets minimized, due to the same performance consideration. This minimized DFA decides whether or not a character stream is part of R’s language, and, based on the simulation of this DFA, the token stream and potential errors are outputed.

Visual representation of the steps described above

Lexer

Now that we understand what we want to achieve and roughly how we can achieve it, let’s get to the implementation. We are going to use JFlex for our lexer.

JFlex

According to the official documentation,

JFlex is a lexical analyzer generator for Java, written in Java. It takes as input a specification with a set of regular expressions and corresponding actions. It generates a program (a lexer) that reads input, matches the input against the regular expressions in the spec file, and runs the corresponding action if a regular expression matched.

Therefore, in order to produce our lexer, we have to provide a specification file containing a set or regular esxpressions and corresponding actions. An action represents Java code that will be executed when a certain rule is matched. A JFlex specification file contains 3 sections, separated by ‘%%’, so it has the following structure:

user code                 -> Copied to the top of the generated lexer class (usually import statements)
%%
JFlex directives          -> Contains options and macros 
%%
regular expression rules  -> Rules and the corresponding actions

A rule in JFlex has the syntax [<states>] <expression> {<action>}. For example, the following would be a valid rule [YYINITIAL] -?[0-9]+ {return INT_TOKEN;}.

To enforce rule priority, which is used for dealing with ambiguity, JFlex has a simple mechanism: the earlier the rule appears, the higher priority it has. JFlex also has lexical states that specify under which states a certain rule can be matched, with the default state being YYINITIAL, an advanced feature that goes beyond the scope of this article.

We are now ready to implement the lexer for our language Z. But where to start? A wise decision would be to follow the structure of a JFlex specification file. Considering this, our first section would only need the following import statement:

import java_cup.runtime.*;

Next, we can start defining our options.

Defining options

We have a series of requirments for our lexer: the generated class should be named Lexer, there must be support for unicode characters, JFlex must know that we are going to use CUP to create the parser and the exact location (column and row) of potential syntax errors should be outputted by our parser, therefore our lexer must use columns and rows for each token. This can be achieved by the following code sequence:

%class Lexer                              
%unicode
%cup
%line
%column

%{
    private Symbol symbol(int type) {
        return new Symbol(type, yyline, yycolumn);
    }

    private Symbol symbol(int type, Object value) {
        return new Symbol(type, yyline, yycolumn, value);
    }
%}

We can now begin defining macros. The first logical step would be to define the types from our language.

Defining Types

From the specification file we can see that our language has the following types: bool, char, int, rat and float. We can look at points number 6,7 and 8 from the file to see what values each type contains and then come up with valid regular expressions that match those values. So let’s define them:

Character = '[a-zA-Z]' | '!\"#\$%&\'()\*\+\,\-\.\/:;<=>\?@\[\]\\\^_`{}\~�' | '[0-9]'
Integer = 0|[1-9][0-9]*
Float = {Integer}(\.[0-9]*)?
Rational = {Integer}"/"[0-9][0-9]* | ({Integer}_)[0-9]"/"[0-9][0-9]*
Number = {Integer} | {Rational} | {Float}
Boolean = T | F 
String = \"(\\.|[^\"])*\"

Note that \ is an escape character, ‘|’ represents or and [^"] means every character but not “.

The above definitions are pretty straightforward. Let’s consider the regex for Character. Our specification files says that:

A character is a single letter, punctuation symbol, or digit wrapped in ‘’ and has type char S. The allowed punctuation symbols are space (See http://en.wikipedia.org/wiki/Punctuation) and the ASCII symbols, other than digits, on this page http://www.kerryr.net/pioneers/ascii3.htm .

So our definition starts with single letters, wrapped by ‘ ‘, then followed by the allowed punctuation symbols, again wrapped in ‘ ‘, and of course by digits. An Integer is either a 0, or a number that starts with a digit from 1 to 9 and is followed by zero or infinitely many other digits, so it is clear that our regex matches the definition for integers correctly.

Notice that we have used [0-9] multiple times in our regexes. We would definitely gain a lot of readability by creating a special rule named Digit and using that instead of [0-9]. The same goes for letters and punctuations, so let’s do that.

Letter = [a-zA-Z]
Digit = [0-9]
Punctuation = [!\"#\$%&\'()\*\+\,\-\.\/:;<=>\?@\[\]\\\^_`{}\~�]

Character = '{Letter}' | '{Punctuation}' | '{Digit}'
Integer = 0|[1-9]{Digit}*
Float = {Integer}(\.{Digit}*)?
Rational = {Integer}"/"{Digit}{Digit}* | ({Integer}_){Digit}"/"{Digit}{Digit}*
Number = {Integer} | {Rational} | {Float}
Boolean = T | F 
String = \"(\\.|[^\"])*\"

Now our code is easier to read and understand, let’s continue by defining identifiers and comments.

Defining identifiers and comments

The point number 4 from the specification file says that

An identifier starts with a letter, followed by an arbitrary number of underscores, letters, or digits. Identifiers are case-sensitive. Punctuation other than underscore is not allowed.

By using the expressions defined above, we can easily find a regex for the identifiers in our language (i.e. Identifier = {Letter}(Letter|Digit|"_")*, which can be rewritten as:

IdentifierCharacter = {Letter} | {Digit} | "_"
Identifier = {Letter}{IdentifierCharacter}*

A comment can be either a multi line comment, or a one line comment (let’s call this an end of line comment). So, Comment = {MultilineComment} | {EndOfLineComment}. A one line comment starts with ‘#’ and can contain every type of character, but no more than one line terminator character, because it can only span for one line. Therefore, we have to define the line terminators and the characters than can be inside a comment separately. Multi line comments start with ‘/#’ and can contain every character, but it requires 2 separate definitons. Putting it all together, we get the following lines:

LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]
Comment = {MultilineComment} | {EndOfLineComment}
MultilineComment = "/#" [^#] ~"#/" | "/#" "#" + "/"
EndOfLineComment = "#" {InputCharacter}* {LineTerminator}?

There may be the case in which the comment is the last line in the file, that’s why the LineTerminator may or may not appear.

The only macro left for us to define is whitespace, which has an easy definition:

WhiteSpace     = {LineTerminator} | [   \t\f]   // \t means tab and \f is a Java escape character called form feed, used to indicate to a printer that it should start a new page.

With this extra definition, we also completed the second part of our specification, leaving only the rule definitions to complete the entire file.

Defining rules

We must keep in mind that the order of the rules matters, as the earlier a rule is declared, the higher its priority is. Also, as mentioned before, our entire logic happens inside one state, the initial state. First thing we should do is to define the rules for our operators. Below you will find the rules for a couple of them:

":="               { return symbol(sym.EQ); }
"::"               { return symbol(sym.CONCAT); }
":"                { return symbol(sym.POINTS); }
"="                { return symbol(sym.EQEQ); }
"!"                { return symbol(sym.NOT); }
"+"                { return symbol(sym.PLUS); }

Note that the symbols that we are returning (sym.EQ, sym.PLUS and so on) will be later defined in our CUP specification.

Then we will need to define the LOW and HIGH labels and the list of keywords in our language.

"L"                { return symbol(sym.L); }
"H"                { return symbol(sym.H); }

"int"              { return symbol(sym.INTEGER); }
"char"             { return symbol(sym.CHARACTER); }
"rat"              { return symbol(sym.RATIONAL); }
"seq"              { return symbol(sym.SEQ); }
"top"              { return symbol(sym.TOP); }
"main"             { return symbol(sym.MAIN); }
"tdef"             { return symbol(sym.TYPE_DEF); }
"fdef"             { return symbol(sym.FUNCTION_DEF); }

// Only a couple of the keywords are declared in this section

The last section of our rules definitions is for the macros that we previously specified:

{Boolean}          { return symbol(sym.BOOLEAN); }
{String}           { return symbol(sym.STRING); }
{Number}           { return symbol(sym.NUMBER); }
{Character}        { return symbol(sym.CHARACTER); }
{Identifier}       { return symbol(sym.IDENTIFIER);}
{Comment}          { /* ignore*/ }
{WhiteSpace}       { /* ignore */ }

And to conclude our document, we need the following line, to cover illegal characters:

[^]                    { throw new Error("Illegal character <"+yytext()+">"); }

Let’s check an example of how our lexer handles ambiguity. Consider the code:

loop:=0;

Our lexer will output the following token stream, because the rule for they keyword ‘loop’ is declared earlier than the rule for IDENTIFIER, and therefore has higher priority:

LOOP EQ NUMBER(0)

Which is what we want, because, by using LOOP and not IDENTIFIER to match the string “loop”, we can detect during the syntactical analysis that there is an error. If the lexer would have chosen IDENTIFIER instead, the source code would have been valid, a behaviour that would be incorrect. That is why all the rules for the keywords must appear before the rule for identifiers. As for the other example given earlier, JFlex will always use maximal munch.

Now all we have left to do is to combine every section in our final specification and complete the rules for the other operators and keywords. The final version of our lexer can be found here.

Syntax Analysis (Parsing)


As a general definition, parsing is the process of breaking a data block into smaller chunks by following a specific set of rules, so that it can be more easily managed or interpreted. In our context, we can define parsing as the process of taking a linear sequence of characters and building a data structure from it, usually in the form of a parse tree or an abstract syntax tree, in order to facilitate translation into another language.

In compilers, the goal of a parser is to take a token stream and check that the tokens form an allowable expression according to the source code language specification. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear.

So, a parser must determine if the language’s grammar accepts the given token stream, then build a parse tree and pass it on to the rest of the compiler. How can this be done? Let’s first consider context-free grammars and see the relation between them and parsing.

Context-free grammar

A context-free grammar is a 4-tuple G = (N, T, S, R), where:

  1. N is a finite set of nonterminals (by convention, uppercase letters)
  2. T is a finite set of terminals (by convention, lowercase or punctuation)
  3. S ∈ N is the start symbol
  4. R = N -> (N ∪ T)* is a finite relation called productions or rules of the grammar

Let’s see an example:

E -> ε
E -> Y1Y2...Yn, where E ∈ N and Yi ∈ N ∪ T ∪ {ε}

Note that the production E -> Y1Y2…Yn means that E can be replaced by Y1Y2…Yn.

A quick note on terminals. They are so-called because there are no rules for replacing them (they only appear on the right-hand side of rules). Once generated they are permanent and they are often lexemes in the language!

The language of a grammar is the set of all strings of terminals that the grammar can generate. For instance:

Let G = ({S}, {a,b}, S, P), with productions

S -> aSa
S -> bSb
S -> a
S -> b
S -> ε

The language of this grammar is the set of all palindromes over the alphabet {a, b}.

Derivations and Parse trees

A derivation is a sequence of sentential forms connected by the application of a single production rule, starting from S.

X1 ···Xa ···Xb ···Xn → X1 ···Yc ···Xb ···Xn → X1 ···Yc ···Yd ···Xn → ··· → Y1 ···Ym

where each Xi → Yj ∈ R

Note: the intermediate strings that mix terminals and nonterminals are called sentential forms.

A derivation can be drawn as a tree, where the start symbol is the root, and for a production X → Y1 ···Yn, the children Y1,··· ,Yn is added to X.

Let’s consider an example:

E -> E + E 
E -> E * E
E -> (E)
E -> id

And the string id * id + id

Derivation

There are two types of derivations:

  1. Left-most derivation -> at each step, replace the left-most nonterminal (as the example above)
  2. Right-most derivation -> at each step, replace the right-most nonterminal

Why is this fact important? For unambiguous grammars, left-most and right-most derivations produce the same parse tree, but there is a difference in the order in which the branches are added, so the sentential forms can be quite different.

Ambiguity

Considering the same example of grammar and the same input string, we can see that there are more possible parse trees, depending on which of E’s production we choose first.

Two_Parse_Trees

A grammar is said to be ambiguous if it has more than one leftmost (or righmost) parse tree for some input string. Ambiguity is bad, because it can leave programs ill-defined. For example, consider the example:

id * id + id instanstiated as 6 * 4 + 2
E + E first results in 24 + 2 = 26
E * E first results in 6 * 6 = 36

Ambiguity can be dealt with by rewriting the grammar to be unambiguous. For example this definition of the same grammar enforces the precedence of ‘*’ over ‘+’ .

E  -> E' + E | E'
E' -> id * E' | id | (E) * E' | (E)

However, in general it is impossible to convert an ambiguous grammar to an unambiguous one. Instead of rewriting the grammar, we can use disambiguating methods (defining precedence and asssociativy annotations).

So, how do we build such a parse tree? There are two main methods, top-down parsing and bottom-up parsing. Even if CUP is using bottom-up parsing and therefore top-down parsing is not the main focus of our article, I will still provide a small overview of top-down parsing, whilst focusing on bottom-up parsing.

Top-Down Parsing

Terminals are seen in order of appearance in the token stream and a parse tree is constructed from the top and from left to right. Let’s consider an example:

E -> T + E | T
T -> int | int * T | (E)

And the token stream int1 * int2.

Top-down parsing starts with top level non terminal E, and then tries the rules for E in order. That is, first try E0 -> T1 + E2. Then try a rule for T1 -> (E3) but ‘(‘ does not match the input token int1.

Now try T1 -> int. This matches the token, but the ‘+’ after T1 does not match the input token ‘*’.

Try T1->int * T2, this matches the token int1 but the ‘+’ after T1 will again be unmatched.

So, having exhausted the choices for T1, now backtrack to another choice for E0. Repeat this process untill the whole token sequence is matched.

There are cases in which this simple method does not work, for example, for left-recursive grammars (like S -> Sa | b), the recursion never ends. This can be solved by eliminating left-recursion and rewriting the grammar to use right-recursion. This procedure is detailed here.

In summary, top-down parsing is a simple and general parsing strategy but unpopular because of bakctracking. But backtracking can be often avoided by using predictive parsers. For this article, we are not intersted in more details about predictive parsers, but further reading can be done here.

Bottom-Up Parsing

Bottom-up parsing handles more grammars than top-down parsing, whilst being just as efficient. Moreover, it does not require left-factored grammars and can handle left-recursive grammars. It is also called LR parsing, where:

  • L means that tokens are read from left to right
  • R means that it constructs a rightmost derivation

The Idea

Bottom-up parsing reduces an input string to the start symbol by inverting productions.

str <- input string of terminals
repeat
    Identify the rightmost terminal b that has been examined in str, such that there is a production A -> b.
    Replace b by A in str.
until str = S (starting symbol)

This can be achieved by splitting the string into two substrings:

  • Right substring (a string of terminals), representing the unexamined part
  • Left substring, containing terminals and non-terminals, representing the examined part

The dividing point is marked by a special character (let’s use |), which is not part of the string. There are two kind of actions that need to be performed, Shift and Reduce.

Shift: Shifts a terminal to the left string (Moves the special character | one place to the right).

Reduce: Applies an inverse production at the right end of the left string. If E -> E + (E) is a production, then, by reducing the string E + (E + (E)|) the result will be E + (E |).

Let’s consider a shift reduce example for the following grammar:

E -> E + (E) | int

And the string:

int + (int) + (int)
1. | int + (int) + (int)$       shift 
2. int | + (int) + (int)$       reduce, using production E -> int
3.  E | + (int) + (int)$        shift
4.  E + | (int) + (int)$        shift
5.  E + (| int) + (int)$        shift
6.  E + (int |) + (int)$        reduce, using production E -> int
7.  E + (E |) + (int)$          shift
8.  E + (E) | + (int)$          reduce, using production E -> E + (E)
9.  E | + (int)$                shift
10. E + | (int)$                shift
11. E + ( |int)$                shift
12. E + (int |)$                reduce, using production E -> int
13. E + (E |)$                  shift
14. E + (E)|$                   reduce, using production e -> E + (E)
15. E |$                        accept

The left string can be implemented as a stack, shift pushing a terminal on the stack. Reduce then pops 0 or more symbols from the stack and pushes a non-terminal on the stack. The only issue is deciding when to shift or reduce. This can be done based on the stack’s content and the lookahead terminal and by using a finite state machine (FSM). The input of the FSM is the stack, and the language consists of terminals and non-terminals from the grammar. The FSM is run on the stack, and the resulting state and the token after the special character | (let’s call them S and t) are examined. If S has a transition with the label t, then shift, and if S is labeled with A -> b on t, then reduce. Let’s see an example:

FSM_example

This state machine can be implemented using a two dimensional table, with the states being rows and terminals and non-terminals being columns. We will not proceed with details on this topic, but more information can be found on the web and useful examples can be found here and here. More than this, there can be conflicts while constructing the finite state machine, caused by ambiguity in the grammar. We are not going to cover conflicts, but examples can be found here. For our article, it is important to know that conflicts are resolved by declaring precedence and associativity.

Parser generators construct the parsing FSM (which is a DFA) given a context-free grammar. However, they do not usually create the DFA for LR(1) parsing, because it has too many states, even for simple languages.

LALR(1)

Instead of having many similar states, like state 1 and state 5 in our example, that differ only in the lookahead tokens, LALR(1) parsers combine these similar states, resulting in 10 times fewer states than in LR(1). Again, we will not cover how LALR(1) parsers work and how they are created, we just need to know that CUP generates a LALR(1) parser.

Let’s turn our attention on CUP and start to write our implementation of the parser.

CUP

Citing the documentation:

CUP stands for Construction of Useful Parsers and is an LALR parser generator for Java. It was developed by C. Scott Ananian, Frank Flannery, Dan Wang, Andrew W. Appel and Michael Petter. It implements standard LALR(1) parser generation. As a parser author, you specify the symbols of Your grammar (terminal T1,T2; non terminal N1, N2), as well as the productions (LHS :== RHS1 | RHS2 ).

A CUP specification contains 4 sections:

  1. Misc declarations and user code
  2. Symbol lists (those are terminals and non-terminals)
  3. Precedence and associativity declarations
  4. The grammar itself

Let’s start with the first section.

Misc declarations and user code

This section usually contains import statements and specific user behaviour. For our parser, we just need to import the whole java_cup.runtime library and to add code to report the exact position of a possible syntax error. So we begin by adding these lines to the specification file:

import java_cup.runtime.*;
/* Parser code to change the way the parser reports errors (include
   line and column number of the error). */
parser code {:
    public boolean syntaxErrors;

    public void syntax_error(Symbol current_token) {
        report_error(
            "Syntax error at line " + (current_token.left+1) + ", column "
            + current_token.right, null
        );
    }
:};

Symbol lists

Let’s first define the terminals, because they are the most straighforward to define. In our case, terminals will be the tokens for operators and symbols, the tokens for keywords, types and literals. This section is created in close relation to the rules definition in the JFlex file. In fact, all the return values from the rules in JFlex are defined here as terminals:

terminal        END;                                                                                                            // End Statement
terminal        EQ, CONCAT, POINTS, EQEQ, NOT_EQ, AND, OR, NOT, PLUS, MINUS, TIMES, DIVIDE, CARET;                              // Operators and Syntax
terminal        PARENTHESESLEFT, PARENTHESESRIGHT, BRACESLEFT, BRACESRIGHT,BRACKETSLEFT, BRACKETSRIGHT, LESS_THAN_EQUAL;        // Operators and Syntax
terminal        LESS_THAN, BIGGER_THAN, COMMA, DOT;                                                                             // Operators and Syntax
terminal        L, H;                                                                                                           // Security Labels
terminal        INTEGER, CHARACTER, RATIONAL, FLOAT, BOOLEAN, SEQ, TOP;                                                         // Types
terminal        MAIN, TYPE_DEF, FUNCTION_DEF, IN, ALIAS, IF, FI, THEN, ELSE, READ, PRINT, RETURN, BREAK, LOOP, POOL;            // Keywords
terminal        IDENTIFIER, NUMBER, STRING;                                                                                     // Literals

We will add non-terminals as we are progressing towards our final grammar, so, for the moment, we can look at the next section.

Precedence and associativity declarations

From the official documentation we can see that in CUP there are three types of precedence/assoctiativity declarations:

precedence left     terminal[, terminal...];
precedence right    terminal[, terminal...];
precedence nonassoc terminal[, terminal...];

In contrast to JFlex, precedence in CUP works bottom to up. That is, the earlier the declaration, the lower the precedence. For example, the definitions:

precedence left ADD, SUBTRACT;
precedence left TIMES, DIVIDE;

Say that mutliplication and division have higher precedence and therefore will be performed before addition and subtraction. The terms ‘left’, ‘right’ and ‘nonassoc’ declare the three types of associativity in CUP. A very easy way to understand these types is by considering this example:

precedence left     SUBTRACT;      id-id-id => (id-id)-id
precedence right    POWER;         id^id^id => id^(id^id)
precedence nonassoc EQ;            id=id=id => EROOR!

Note: a more detailed explanation of precedence and associativity in CUP can be found here.

In our language, we have the following operations:

Operators Associativity type
OR, AND Left
PLUS, MINUS, DIVIDE, TIMES Left
CONCAT Left
CARET Right
IN Right
NOT Nonassoc
LESS_THAN, LESS_THAN_EQUAL Nonassoc
BIGGER_THAN Nonassoc
EQEQ, NOT_EQ Nonassoc

From the list, we would like the nonassoc operators to have the highest priority, followed by the right association type. When it comes to the left type, there are three different levels that we want to consider. First of all, OR, PLUS and MINUS should have the lowest priority, followed by AND, TIMES and DIVIDE. Finally, CONCAT must have the highest priority of the left association type operators. This translates in:

precedence left OR, PLUS, MINUS;
precedence left AND, TIMES, DIVIDE;
precedence left CONCAT;
precedence right IN, CARET;
precedence nonassoc NOT, LESS_THAN, LESS_THAN_EQUAL, BIGGER_THAN, EQEQ, NOT_EQ;

Grammar

The final section, the grammar definition, represents the most-complex part of the whole project. Let’s break it into smaller parts and build up from there.

First of all, how to declare a rule? The syntax is as follows:

NON_TERMINAL ::= V1|V2|...|Vn;

where V1,V2...Vn contain none, one or more terminals or non-terminals

For example, defining the grammar:

E -> E + (E) | int

//CUP declaration:
E ::= E + (E) | INT;

Let’s go through the specification file for our language. The second section describes declarations. It says that:

The syntax of field or variable declaration is “id : type”. A data type declaration is tdef type_id { declaration_list } ; where declaration_list is a comma-separated list of field/variable declarations

Variable declaration is pretty straightforward: we need the token for an identifier, which we get from lexing, and has the name IDENTIFIER, and then ‘:’ followed by a type. Let’s try and come up with a non-terminal to match the data type declaration, which we are going to call type definition to differentiate it from variabile declaration. First of all we want the corresponding token for ‘tdef’, and that is TYPE_DEF, then an identifier followed by the token for ‘{‘, a rule for the declaration_list, the token for ‘}’ and finally the token for ‘;’. This translates into:

typeDeclaration ::= IDENTIFIER POINTS type;
typeDefintion ::= TYPE_DEF IDENTIFIER BRACESLEFT declaration_list BRACESRIGHT END;

We now need to define a rule for type, and one for declaration_list. We can split our types into sequence, and basic types. Let’s call those two seqType and securityType (because of the security label). We come up with this:

type ::= seqType | securityType;
securityType ::= CHARACTER L | CHARACTER H | INTEGER L | INTEGER H | RATIONAL  L | RATIONAL H | FLOAT L | FLOAT H | BOOLEAN L | BOOLEAN H | IDENTIFIER ;
seqType ::= SEQ LESS_THAN auxType BIGGER_THAN;
auxType ::= type | TOP L | TOP H;

For declaration_list, the grammar is not very complicated. The declaration list is either empty, or it contains one or more type declarations separated by a comma. As we already have a rule for type declaration, the declaration_list rule is:

declaration_list ::= typeDeclaration | typeDeclaration COMMA declaration_list |;

Point number 15 in the specification file describes function definitions, which can be translated into:

function ::= FUNCTION_DEF IDENTIFIER PARENTHESESLEFT parameters PARENTHESESRIGHT BRACESLEFT codeBlock BRACESRIGHT returnType END;

If we pay attention, parameters have the same syntax as the declaration list from type definitions. Therefore, we can only use one rule for the two. So, let’s change the declaration_list rule to be named paramters and modify the typeDefinition rule as well.

type ::= seqType | securityType;
securityType ::= CHARACTER L | CHARACTER H | INTEGER L | INTEGER H | RATIONAL  L | RATIONAL H | FLOAT L | FLOAT H | BOOLEAN L | BOOLEAN H | IDENTIFIER ;
seqType ::= SEQ LESS_THAN auxType BIGGER_THAN;
auxType ::= type | TOP L | TOP H;
typeDeclaration ::= IDENTIFIER POINTS type;
typeDefintion ::= TYPE_DEF IDENTIFIER BRACESLEFT parameters BRACESRIGHT END;

function ::= FUNCTION_DEF IDENTIFIER PARENTHESESLEFT parameters PARENTHESESRIGHT BRACESLEFT codeBlock BRACESRIGHT returnType END;
parameters ::= typeDeclaration | typeDeclaration COMMA declaration_list |;

Now a codeBlock can contain both declarations and statements. So, we’ll get this:

codeBlock ::= declarationBlock statementBlock;

A declarationBlock is a list containing typeDefinitions, typeDeclarations, aliases, functions or initializations.

declarationBlock ::= declarationBlock declaration |;
declaration ::= typeDefinition
              | typeDeclaration END
	      | typeDeclaration EQ expressionList END
              | alias
              | function
              ;
alias ::= ALIAS type type END;	      

We go further and define the expressionList. What does it mean? An expressionList contains one or more expressions. Expressions are defined at point 16 in the Z language specification.

expressionList ::= expression | expression COMMA expressionList;
expression ::= MINUS expression
             | NOT expression
             | expression operator expression
             | expression comparator expression
             | PARENTHESESLEFT expression PARENTHESESRIGHT
             | seqValue
             | BOOLEAN
             | NUMBER
             | CHARACTER
             | identifier
             ;
	     
seqValue ::= BRACKETSLEFT seqReference BRACKETSRIGHT
           | BRACKETSLEFT BRACKETSRIGHT
           | STRING
           ;
	   
operator ::= PLUS | MINUS | TIMES | DIVIDE | CARET | CONCAT | IN;
comparator ::= AND | OR | EQEQ | LESS_THAN | BIGGER_THAN | LESS_THAN_EQUAL  | NOT_EQ;
seqReference ::= expression | expression COMMA seqReference;

Now, what is an identifier? Well an identifier comes from a seq, or a defined type, that is from a ‘fieldReference’.

identifier ::= fieldReference DOT identifier | fieldReference ;
fieldReference ::= IDENTIFIER
             | IDENTIFIER BRACKETSLEFT expression BRACKETSRIGHT
             | IDENTIFIER BRACKETSLEFT expressionOrNull POINTS expressionOrNull BRACKETSRIGHT
             | functionCall
             | functionCall BRACKETSLEFT expression BRACKETSRIGHT
             | functionCall BRACKETSLEFT expressionOrNull POINTS expressionOrNull BRACKETSRIGHT
             ;
	     
expressionOrNull ::= expression |;	     
functionCall ::= IDENTIFIER PARENTHESESLEFT arguments PARENTHESESRIGHT;
arguments ::= expression | expression COMMA arguments |;	     

And a statementBlock is a block of one or more statements, with statements being described at section 2.3.

statementBlock ::= statement statementBlock | statement;
statement ::= if
            | loop
            | read
            | print
            | assignment
            | functionCall END
            | return
            ;

The next stage is to describe each statement. The exact syntax for the statements are found at point 17. The rules are pretty straightforward:

if ::= IF PARENTHESESLEFT expression PARENTHESESRIGHT THEN codeBlock FI
     | IF PARENTHESESLEFT expression PARENTHESESRIGHT THEN codeBlock ELSE codeBlock FI
     ;

assignment ::= identifier EQ expressionList END;
read ::= READ identifier END;
print ::= PRINT expression END;
loop ::= LOOP codeBlockBreak POOL;
return ::= RETURN END | RETURN expression END;     

We must also define the codeblock with break, which is very similar to a normal codeblock. That is:

codeBlockBreak ::= declarationBlock statementBlockBreak;

And now a statement with break:

statementBlockBreak ::= statementBreak statementBlockBreak | statementBreak;
statementBreak ::= loop
            | break
            | ifBreak
            | read
            | print
            | assignment
            | return
            | functionCall END
            ;
	    
ifBreak ::= IF PARENTHESESLEFT expression PARENTHESESRIGHT THEN codeBlockBreak FI
          | IF PARENTHESESLEFT expression PARENTHESESRIGHT THEN codeBlockBreak ELSE codeBlockBreak FI
          ;    

break ::= BREAK END | BREAK NUMBER END;

All we have left to do is to provide a declaration for the main function, and then for the entire ‘program’:

mainFunction ::= MAIN BRACESLEFT codeBlock BRACESRIGHT END;
mostAbstract ::= declarationBlock mainFunction declarationBlock;

We are now done with the parser, all the code put together can be found here.

Conclusion

The article described how to write a lexer and a parser for an imaginary language. It also covered some theoretical aspects that I considered to be essential in understanding how lexing and parsing works. The GitHub repository will soon contain a method to test our code, by providing a series of example programs and comparing them to the output that we would expect. I hope that the code and the explanation helps people in understanding the front end of a compiler and how such a front end works and is implemented. As a note, the code in the two files managed to pass all the tests provided by my module leader and scored 100%.