In previous example scanner and parser generators LEX and YACC are used for compiler implementation, almost without explanation. The input language have not functions definition, but the expressions and conditions may be complex.
Here are a few implementations of syntax-driven parser for Tiny Context language, perhaps altered, but no third-party tools are used:
Most part of the explanations are on this page.
To omit unnecessary details compiled program should be placed in a file named c.prg, the result are written to the file named c.com. Grammar definition should be placed in the file named c.def. When DOS is not available, DOSBox emulator may be used.
First compilation of the examples may be performed by two ways. The easiest way
use Tiny Context 1.18 (c.118) compiler, to build it
command file c.bat must be executed. Compiling with DOS-version of
Context programing language A>also possible, but it
requires a few changes in the source code:
Add to the beginning of the text declaration of an Temp array (because this
compiler produce different memory allocation):
In the headers of all functions with two or more arguments commas must be
replaced with semicolons, for example:
All of presented compilers may compile each other (and themselves), but
appropriate grammar file (c.def) must be used.
As in Tiny Context function parameters and local
variables are placed in a fixed memory locations, but here it is not
a problem because no recursive functions are used.
If the original language implementation grammar rule are only implied in these
implementations are used explicitly. Grammar rules given in Backus normal form.
Initial grammar are listed below. The first part are token list (the approximate
equivalent of YACC directives %token), the second part are the rules. Each token
have corresponding code and value. A non-zero code are used to distinguish
type names form all other tokens, the value contains an opcode or the size
in bytes of the specified type. Each role have corresponding code of action
which must be performed when rule are applied (in YACC C program fragment may be
writeen). Action codes will be assigned later.
Some rules may seem excessive, for example
или
They are needed because when begin and while symbols are recognized
the current address must be saved to do jump to it later.
By performing the substitution of one rules to another acceptable
in terms of the grammar symbol sequences (programs) may be produced.
Some of these programs will be correct, others may be so under certain
circumstances, others obviously incorrect. The following three short
examples demonstrate this:
With the grammar are associated a number of issues, including questions about
Apparently, the unambiguity is desireable, but it is not necessary
achieved by grammar. After replacing the Expr rule by the following
(Op - any arithmetic operator)
not only the opertor precedence will be lose but. Grammar Modified
grammar are umbiguous. To prove that the ambiguity exists one
example are sufficient. Expression
may be given by two different ways
and
Here will be presented only the parser (and primitive code generator).
The scanner because of its simplicity will be written manually.
Scanner look for one character (i.e. one letter, number, etc.) forward while
reading the input file. For presented langhuage this is enough, for Pascal
language scanner need to look for two characters forward, for Fortran language
scanner need to look for many characters forward, and it number depends on the
input.
It should be noted that scanner may be generated by special program.
For example LEX (FLEX) that uses a regular expressions.
To implement this plan some issues must be solved. The first
(easiest) are reading the above grammar definition and filling
appropriate data structures. The corresponding code is shown
here. It will be used in all presented
compilers.
Grammar loader assign to each symbol (terminal symbol, i.e. token, nonterminal
symbol, i.e. name in the left side of the rule) a unique numeric identifier
it will replace all string comparisons by number comparisons. Respectively
scanner should return the ID, it will be determined by searching in the symbol
table.
Grammar usage much more complex. Probably, under the influence of available,
but useless literature, the first idea was reading tokens, transferring them to the stack
(this operatoin called shift) and comparing top of the stack with right parts of the
rules. When top of the stack contain same symbols as one of the rules the appropriate
number of symbols are replaced with one symbol from the left side of the rule (this
operaion called reduction, reduce). But for the above grammar this is impossible, because in some
cases more then one rule may be used, for example:
In some other cases, the proper rules ordering allows to solve conflict, for example:
Here, the appying of the second rule when applying the first rule also possible gives an error
because of next symbol sequence
can not be reduced to anything.
Changing the grammar, rejecting some useful things, introducing of additional tokens
and allowing to the scanner access to the dictionary (not only to the symbol table)
help to implement this primitive algorithm. All of the rules that are parts of other
rules will be replaced. For example, rule for Ref will be replaced by
In addition to this it was necessary to make a number other changes to
eliminate overlapping of rules. Do do this the prior or subsequent symbols
are used. For example, the fact that the expression are the actual parameter
of function is determined by the presence of a comma after it:
This changes increase number of rules and make grammar more complex and tangled,
but it may be done. An example is shown here.
To exclure matching of the parts of the rules available previous symbols may be added
to rules explicitly. For example, replacement of rule
by next set of rules
will eliminate the reduction numb to Value while parsing array definition.
An example is shown here. Clearly the repetition
of the same rules is bad, but can be add some insight into the matter.
As shown below, the parser can deal with the previous symbols themselves.
This example was made from the presented below SLR(0)-parser.
At least in theory the full search of all possible reduces can be made.
In real this search can be made only for very short programs.
Also at least in theory building set of all possible correct programs,
that not exceed length of compiled program, and search this set for the same
program can be made. In real it is possible only for very short programs.
Significantly better results may be obtained by determining the allowable
rules and allowable symbols. At the start of parsing correct program are
expected (it corresponds to the symbol Program) and at the the end of
the parsing one of the next rules will be applied
Symbols Declarations and Block can arise only as result of the next rules
Symbol Declarations ara already taken into account and give nothing new.
Symbols Declaration and Main can arise only as result of the next rules
Symbols TypeName and Header can arise only as result of the next rules
Symbol TypeName ara already taken into account and give nothing new.
New symbol Type can arise only as result of the next rules
No more non-terminal symbols can be met, respectively adding of other rules
are impossible. Here is a complete set of rules available at the beginning
of correct program
Four rules start with the terminal symbols, respectively, correct program
can start only from one of these symbols, i.e. from char, byte, word or begin.
Let's meet symbol char. This symbol clearly defines the rule that should be applied
Similarly, the symbol Type uniquely identifies rule that should be applied
But it is possible only when next symbol are name.
TypeName symbol does not uniquely determine a valid rule, but defines
the set of available rules
To reflect the fact that the symbol TypeName are taken into account
the following notation are used:
Dots are indicates the positions in the rules. Rule of N symbols have N + 1
position. Positions are called points, the set of available at the same time
points are called a state.
If next encountered symbol are semi or lsb only one rule remains possible.
If next encountered symbol are lcb, two rules remains possible, but the next
symbol will reject one ot them:
In this state non-terminal symbol Args must be analyzed like Declarations symbol above.
It should be noted that for the function header without parameters
separate rule added. An alternative to this - the empty rule for Args:
Empty rule can be recognized by following symbol (rcb in this case), but the first version
of parser can not be able to do this.
Unfortunately, this is insufficient to deal with the original grammar,
but trouble only in operator precedence. To get started, all operators
in expressions will have the same precedence, but only little effort
are required to return operator precedence. The modified grammar and
the compiler are here. This is SLR(0)-parser.
More or less accurate accounting of symbols that could appears after
a given symbol allow bulid parser for real programming languages
without resorting to a significant complication of grammar. For some
grammars including presented correct action (shift or reduce) can
be determined by taking into account the following terminal symbol without
context of its appearance. The reduce to left-side symbol of rule
are performed when following terminal symbol can appears after left-side
symbol of the rule. For example, in the expressions part of the given
grammar
one of posiiisible states are
In this state SLR(0)-parser reports an error (conflict between shift and reduction
of Term to Expr). But in the given grammar after Expr can occurs only plus, minus,
lt, le, eq, ne, ge, gt, rsb, rcb, then, do, comma and semi symbols. Accordingly,
if the following terminal symbol are one of above, the shift operaion produce
an empty state and reduction of Term to Expr are only valid action. If the following
symbol are star, slash or pct, which can not appears after Expr, shift must be done
(then reduction of terminal symbol to Op2 and next shift). If the folowing symbol
differ from all of symbols listed above, the parsed text does not match the grammar.
This is SLR(1)-parser. Implementation requires a very little changes and
this changes are here.
To understand how it works makes sense to consider the grammar for arithmetic
expressions without operator precedence and with precedence separately and
draw on paper all the possible parser states and all possible following symbols.
To separate this part of the grammar Value non-terminal symbol can be replaced
with the numb terminal symbol and add at the beginning extra rule (instead of Program):
There are unambiguous grammars, for which such analysis does not eliminate conflicts.
SLR(1)-parser are not suitable for them, but for some of them can be built parser
that takes into account the dependence of the valid folowing symbols from the preceding
context.
Its idea is as follows. As before, at the beginning of parsing correct program are
expected and at the end one of the rules are applied:
Declarations symbol can only appears as a result of one of the following rules
In addition, and this is most important, the subsequent reduction to the Program would not be possible
when the following symbol are not the first symbol of Block (i.e. begin). Addition this two rules
to the list of possible rules and in curly braces expected following symbol begin gives:
In fact, after the Declarations can follows other symbols, but from the first grammar rule it is not clear.
Further, the Block symbol can only appears as a result the rule
What symbol can be after Block? The answer might be or nothing (end of file), or any symbol,
including missing in a language. Now question mark are used insted of possible symbol or symbols.
Correct following symbol will be determined later. Consider the first of the added rules
This rule can not be not be used if the following after Declarations symbol does not meet one
of the start symbols of Declaration (i.e. char, byte or word). Addition of this to the list gives:
The second rule thet determine Declarations
consists of a single symbol Declaration, which is determined by three rules:
After the Declaration symbol the begin symbol are expected, so begin are expected after those three rules:
It's like a reduction of Block to Program. If formally attach to the rules that determine Program
trailing symbols, the structure of all valid rules became the same. If decide that, after the Program
can nor be symbols (this reduces the length of the resulting list) first two rules are:
and
Here, the symbol # are used to indicate the end of the file.
The result of this process are closure of initial state (to stop tte process repeating rules must be eliminated):
Four Rules (one of which are repeated eight times) begin with the terminal symbols,
accordingly correct program can only start with one of them, i.e. with char, byte,
word or begin. Let's meet symbol char. It uniquely identifies a suitable rule
The shift action (and transfering char symbol to stack) leads to a state of one point:
If the next symbol are name, then reduction of char to the Type are performed and
Type are transferred to the stack (shift are performed). If not, any further action
is meaningless - the text does not match to the grammar. Thus parser are going to
state of one point that repeated three times:
Transferring a symbol name to the stack (shift) will lead to state:
If the next symbol are semi, lsb or lcb, then reduction to TypeName are performed and TypeName
and transfered to the stack, otherwise error are fixed - no other symbols can appeas after Typename.
In the normal course of things parser goes to state of four points which are repeated twenty four
times:
If the following symbol are semi, parser goes to state of one point repeated four times:
If the next symbol are begin, char, byte or word, then reduction to Declaration are performed,
otherwise the error are fixed.
This process will end up with a Program symbol (or error occurs). This is LR(1)-parser.
Its implementation requires a changes in the functions that built state closure and
and analyse recuction possibility (its very simple - reduction are possible only when
actual symbol are teh same to expected), they are here.
To understand how it works makes sense to consider the sparated grammar of expressions
with operator precedence.
Also it should be noted that the program expressions can appears in different contexts,
and sets of following symbols are different in different contexts. For example,
if expression appears in the left side of comparison in the head of while loop,
the set of valid following symbols consists of six comparisons symbols, but if
expression appears in the right side of comparison - only one follownig symbol (do)
are valid.
For a given grammar these actions are excessive. Moreover, for storing information about
the states large amount of memory are required an implementation on machine with small RAM
(for example, on 8- or 16-bit machine with 64 kilobytes of memory) may be difficult. Redundancy
of an LR(1)-parser can be decreased by decreasing its generality (LALR(1)-parser). In the presence
of gigabytes of memory, this problem is insignificant, but the LALR(1)-parsers like YACC/BISON
now not disappeared.
In principle, it is possible to include into the grammar rules for language words and remove
the scaner (instead we need only function Read() that return on each call next character of
the program source). Unfortunately LR(1)-parser are not very fitted for this. Subsequent
rules follow to a shift-reduce conflict:
Presented grammar loader cannot read ranges of characters ('a'-'z'), so either every such rule must be
repeated for each of the characters in the range, or loader must be changed.
After reading 'i' character and following 'f', 'n' or 's' character it is impossible to choose
between shift and reduce to name, to make a decision we need to see more than one next character.
This is example of non-LR(1)-grammar.
Conflict will be solved after grammar conversion as shown below:
The new symbol temp can not begin with an 'i', as well as other first character
of any language word. This is also must be written, so the last two rules must be replaced.
And rules that contains uppercase letters, numbers, and possibly other characters also
may be added.
Names like inli look strange, but formally allowed. It is possible to disable them
and slightly reduce the number of these rules.
Only small part of rules are shown above and benefits are questionable.