Syntax-driven parser

In previous example scanner and parser generators LEX and YACC are used for compiler implementation, almost without explanation. The input language have not functions definition, but the expressions and conditions may be complex.

Here are a few implementations of syntax-driven parser for Tiny Context language, perhaps altered, but no third-party tools are used:

Most part of the explanations are on this page.

To omit unnecessary details compiled program should be placed in a file named c.prg, the result are written to the file named c.com. Grammar definition should be placed in the file named c.def. When DOS is not available, DOSBox emulator may be used.

First compilation of the examples may be performed by two ways. The easiest way use Tiny Context 1.18 (c.118) compiler, to build it command file c.bat must be executed. Compiling with DOS-version of Context programing languagealso possible, but it requires a few changes in the source code:

Add to the beginning of the text declaration of an Temp array (because this compiler produce different memory allocation):

char Temp [16640]; // Before char Text [16384];

In the headers of all functions with two or more arguments commas must be replaced with semicolons, for example:

word Digraph(char C1, char C2) is -> word Digraph(char C1; char C2) is

All of presented compilers may compile each other (and themselves), but appropriate grammar file (c.def) must be used.

As in Tiny Context function parameters and local variables are placed in a fixed memory locations, but here it is not a problem because no recursive functions are used.

If the original language implementation grammar rule are only implied in these implementations are used explicitly. Grammar rules given in Backus normal form. Initial grammar are listed below. The first part are token list (the approximate equivalent of YACC directives %token), the second part are the rules. Each token have corresponding code and value. A non-zero code are used to distinguish type names form all other tokens, the value contains an opcode or the size in bytes of the specified type. Each role have corresponding code of action which must be performed when rule are applied (in YACC C program fragment may be writeen). Action codes will be assigned later.

symb ; numb ; name ; lsb "[" 0 0; rsb "]" 0 0; lcb "(" 0 0; rcb ")" 0 0; assign ":=" 0 0; plus "+" 0 0; minus "-" 0 1; star "*" 0 2; slash "/" 0 3; pct "%" 0 4; lt "<" 0 0; le "<=" 0 1; eq "=" 0 2; ne "!=" 0 3; ge ">=" 0 4; gt ">" 0 5; comma "," 0 0; semi ";" 0 0; is 0 0; begin 0 0; if 0 0; then 0 0; while 0 0; do 0 0; inline 0 0; return 0 0; end 0 0; char 1 1; byte 1 1; word 1 2. Program : 0 Declarations Block | 0 Block ; Declarations : 0 Declarations Declaration | 0 Declaration ; Block : 0 Main Stmts end ; Declaration : 0 TypeName semi | 0 TypeName lsb numb rsb semi | 0 Header Stmts end ; Header : 0 TypeName lcb rcb is | 0 TypeName lcb Args rcb is ; Args : 0 Args comma Arg | 0 Arg ; Arg : 0 TypeName ; TypeName : 0 Type name ; Type : 0 char | 0 byte | 0 word ; Main : 0 begin ; Stmts : 0 Stmts Stmt | 0 Stmt ; Stmt : 0 IfBlk | 0 WhileBlk | 0 Inlines | 0 Ret | 0 Local | 0 Assign | 0 PCall ; IfBlk : 0 IfHdr Stmts end ; IfHdr : 0 if Cond then ; WhileBlk : 0 WhileHdr Stmts end ; WhileHdr : 0 Loop Cond do ; Loop : 0 while ; Cond : 0 Expr RelOp Expr ; Inlines : 0 inline OpCodes semi ; OpCodes : 0 OpCodes comma OpCode | 0 OpCode ; OpCode : 0 numb ; Ret : 0 return Expr semi ; Local : 0 LocalVar assign Expr semi | 0 LocalVar semi ; LocalVar : 0 TypeName ; Assign : 0 Ref assign Expr semi ; PCall : 0 Call semi ; Expr : 0 Expr Op1 Term | 0 Term ; Term : 0 Term Op2 Value | 0 Value ; Value : 0 symb | 0 numb | 0 Ref | 0 Call | 0 Cast | 0 lcb Expr rcb ; Ref : 0 Name | 0 Name lsb Expr rsb ; Call : 0 Name lcb Params rcb | 0 Name lcb rcb ; Name : 0 name ; Params : 0 Params comma Param | 0 Param ; Param : 0 Expr ; Cast : 0 Type lcb Expr rcb ; Op1 : 0 plus | 0 minus ; Op2 : 0 star | 0 slash | 0 pct ; RelOp : 0 lt | 0 le | 0 eq | 0 ne | 0 ge | 0 gt .

Some rules may seem excessive, for example

Main : 0 begin ;

или

Loop : 0 while ;

They are needed because when begin and while symbols are recognized the current address must be saved to do jump to it later.

By performing the substitution of one rules to another acceptable in terms of the grammar symbol sequences (programs) may be produced. Some of these programs will be correct, others may be so under certain circumstances, others obviously incorrect. The following three short examples demonstrate this:

begin word name assign numb semi end

word name semi begin name assign numb semi end

begin name assign numb semi end

With the grammar are associated a number of issues, including questions about

Apparently, the unambiguity is desireable, but it is not necessary achieved by grammar. After replacing the Expr rule by the following (Op - any arithmetic operator)

Expr : 0 Expr Op Expr | 0 Value ;

not only the opertor precedence will be lose but. Grammar Modified grammar are umbiguous. To prove that the ambiguity exists one example are sufficient. Expression

3 - 2 + 1

may be given by two different ways

(3 - 2) + 1 // (Expr - Expr) + Expr

and

3 - (2 + 1) // Expr - (Expr + Expr)

Here will be presented only the parser (and primitive code generator). The scanner because of its simplicity will be written manually.

Scanner look for one character (i.e. one letter, number, etc.) forward while reading the input file. For presented langhuage this is enough, for Pascal language scanner need to look for two characters forward, for Fortran language scanner need to look for many characters forward, and it number depends on the input.

It should be noted that scanner may be generated by special program. For example LEX (FLEX) that uses a regular expressions.

To implement this plan some issues must be solved. The first (easiest) are reading the above grammar definition and filling appropriate data structures. The corresponding code is shown here. It will be used in all presented compilers.

Grammar loader assign to each symbol (terminal symbol, i.e. token, nonterminal symbol, i.e. name in the left side of the rule) a unique numeric identifier it will replace all string comparisons by number comparisons. Respectively scanner should return the ID, it will be determined by searching in the symbol table.

Grammar usage much more complex. Probably, under the influence of available, but useless literature, the first idea was reading tokens, transferring them to the stack (this operatoin called shift) and comparing top of the stack with right parts of the rules. When top of the stack contain same symbols as one of the rules the appropriate number of symbols are replaced with one symbol from the left side of the rule (this operaion called reduction, reduce). But for the above grammar this is impossible, because in some cases more then one rule may be used, for example:

Ref : 0 Name | 0 Name lsb Expr rsb ;

In some other cases, the proper rules ordering allows to solve conflict, for example:

Declarations : 0 Declarations Declaration | 0 Declaration ;

Here, the appying of the second rule when applying the first rule also possible gives an error because of next symbol sequence

Declarations Declarations

can not be reduced to anything.

Changing the grammar, rejecting some useful things, introducing of additional tokens

type // Type var // Simple variable array // Array fn // Function

and allowing to the scanner access to the dictionary (not only to the symbol table) help to implement this primitive algorithm. All of the rules that are parts of other rules will be replaced. For example, rule for Ref will be replaced by

Ref : 0 var | 0 array lsb Expr rsb ;

In addition to this it was necessary to make a number other changes to eliminate overlapping of rules. Do do this the prior or subsequent symbols are used. For example, the fact that the expression are the actual parameter of function is determined by the presence of a comma after it:

Param : 410 symb comma | 410 numb comma | 420 Ref comma | 430 Call comma | 430 Cast comma | 430 Expr comma ;

This changes increase number of rules and make grammar more complex and tangled, but it may be done. An example is shown here.

To exclure matching of the parts of the rules available previous symbols may be added to rules explicitly. For example, replacement of rule

Value : 0 numb ;

by next set of rules

Value : 0 if < numb | 0 Loop < numb | 0 assign < numb | 0 return < numb | 0 array lsb < numb | 0 lcb < numb | 0 Op < numb | 0 RelOp < numb | 0 Params < numb ;

will eliminate the reduction numb to Value while parsing array definition. An example is shown here. Clearly the repetition of the same rules is bad, but can be add some insight into the matter. As shown below, the parser can deal with the previous symbols themselves.

This example was made from the presented below SLR(0)-parser.

At least in theory the full search of all possible reduces can be made. In real this search can be made only for very short programs.

Also at least in theory building set of all possible correct programs, that not exceed length of compiled program, and search this set for the same program can be made. In real it is possible only for very short programs.

Significantly better results may be obtained by determining the allowable rules and allowable symbols. At the start of parsing correct program are expected (it corresponds to the symbol Program) and at the the end of the parsing one of the next rules will be applied

Program : 0 Declarations Block | 0 Block ;

Symbols Declarations and Block can arise only as result of the next rules

Declarations : 0 Declarations Declaration | 0 Declaration ; Block : 0 Main Stmts end ;

Symbol Declarations ara already taken into account and give nothing new. Symbols Declaration and Main can arise only as result of the next rules

Declaration : 0 TypeName semi | 0 TypeName lsb numb rsb semi | 0 Header Stmts end ; Main : 0 begin ;

Symbols TypeName and Header can arise only as result of the next rules

TypeName : 0 Type name ; Header : 0 TypeName lcb rcb is | 0 TypeName lcb Args rcb is ;

Symbol TypeName ara already taken into account and give nothing new. New symbol Type can arise only as result of the next rules

Type : 0 char | 0 byte | 0 word ;

No more non-terminal symbols can be met, respectively adding of other rules are impossible. Here is a complete set of rules available at the beginning of correct program

Program : 0 Declarations Block | 0 Block ; Declarations : 0 Declarations Declaration | 0 Declaration ; Block : 0 Main Stmts end ; Declaration : 0 TypeName semi | 0 TypeName lsb numb rsb semi | 0 Header Stmts end ; Main : 0 begin ; TypeName : 0 Type name ; Header : 0 TypeName lcb rcb is | 0 TypeName lcb Args rcb is ; Type : 0 char | 0 byte | 0 word ;

Four rules start with the terminal symbols, respectively, correct program can start only from one of these symbols, i.e. from char, byte, word or begin. Let's meet symbol char. This symbol clearly defines the rule that should be applied

Type : 0 char ;

Similarly, the symbol Type uniquely identifies rule that should be applied

TypeName : 0 Type name ;

But it is possible only when next symbol are name.

TypeName symbol does not uniquely determine a valid rule, but defines the set of available rules

Declaration : 0 TypeName semi | 0 TypeName lsb numb rsb semi ; Header : 0 TypeName lcb rcb is | 0 TypeName lcb Args rcb is ;

To reflect the fact that the symbol TypeName are taken into account the following notation are used:

Declaration : 0 TypeName . semi | 0 TypeName . lsb numb rsb semi ; Header : 0 TypeName . lcb rcb is | 0 TypeName . lcb Args rcb is ;

Dots are indicates the positions in the rules. Rule of N symbols have N + 1 position. Positions are called points, the set of available at the same time points are called a state.

If next encountered symbol are semi or lsb only one rule remains possible. If next encountered symbol are lcb, two rules remains possible, but the next symbol will reject one ot them:

Header : 0 TypeName lcb . rcb is | 0 TypeName lcb . Args rcb is ;

In this state non-terminal symbol Args must be analyzed like Declarations symbol above.

It should be noted that for the function header without parameters separate rule added. An alternative to this - the empty rule for Args:

Args : 0 Args comma Arg | 0 Arg | 0 ;

Empty rule can be recognized by following symbol (rcb in this case), but the first version of parser can not be able to do this.

Unfortunately, this is insufficient to deal with the original grammar, but trouble only in operator precedence. To get started, all operators in expressions will have the same precedence, but only little effort are required to return operator precedence. The modified grammar and the compiler are here. This is SLR(0)-parser.

More or less accurate accounting of symbols that could appears after a given symbol allow bulid parser for real programming languages without resorting to a significant complication of grammar. For some grammars including presented correct action (shift or reduce) can be determined by taking into account the following terminal symbol without context of its appearance. The reduce to left-side symbol of rule are performed when following terminal symbol can appears after left-side symbol of the rule. For example, in the expressions part of the given grammar

Expr : 0 Expr Op1 Term | 0 Term ; Term : 0 Term Op2 Value | 0 Value ; Op1 : 0 plus | 0 minus ; Op2 : 0 star | 0 slash | 0 pct ;

one of posiiisible states are

Expr : 0 Term . ; Term : 0 Term . Op2 Value ; Op2 : 0 . star | 0 . slash | 0 . pct ;

In this state SLR(0)-parser reports an error (conflict between shift and reduction of Term to Expr). But in the given grammar after Expr can occurs only plus, minus, lt, le, eq, ne, ge, gt, rsb, rcb, then, do, comma and semi symbols. Accordingly, if the following terminal symbol are one of above, the shift operaion produce an empty state and reduction of Term to Expr are only valid action. If the following symbol are star, slash or pct, which can not appears after Expr, shift must be done (then reduction of terminal symbol to Op2 and next shift). If the folowing symbol differ from all of symbols listed above, the parsed text does not match the grammar. This is SLR(1)-parser. Implementation requires a very little changes and this changes are here.

To understand how it works makes sense to consider the grammar for arithmetic expressions without operator precedence and with precedence separately and draw on paper all the possible parser states and all possible following symbols. To separate this part of the grammar Value non-terminal symbol can be replaced with the numb terminal symbol and add at the beginning extra rule (instead of Program):

Result : 0 Expr semi ;

There are unambiguous grammars, for which such analysis does not eliminate conflicts. SLR(1)-parser are not suitable for them, but for some of them can be built parser that takes into account the dependence of the valid folowing symbols from the preceding context.

Its idea is as follows. As before, at the beginning of parsing correct program are expected and at the end one of the rules are applied:

Program : 0 . Declarations Block | 0 . Block ;

Declarations symbol can only appears as a result of one of the following rules

Declarations : 0 . Declarations Declaration | 0 . Declaration ;

In addition, and this is most important, the subsequent reduction to the Program would not be possible when the following symbol are not the first symbol of Block (i.e. begin). Addition this two rules to the list of possible rules and in curly braces expected following symbol begin gives:

Declarations : 0 . Declarations Declaration { begin } | 0 . Declaration { begin } ;

In fact, after the Declarations can follows other symbols, but from the first grammar rule it is not clear.

Further, the Block symbol can only appears as a result the rule

Block : 0 . Main Stmts end ;

What symbol can be after Block? The answer might be or nothing (end of file), or any symbol, including missing in a language. Now question mark are used insted of possible symbol or symbols.

Block : 0 . Main Stmts end {?} ;

Correct following symbol will be determined later. Consider the first of the added rules

Declarations : 0 . Declarations Declaration { begin } ;

This rule can not be not be used if the following after Declarations symbol does not meet one of the start symbols of Declaration (i.e. char, byte or word). Addition of this to the list gives:

Declarations : 0 . Declarations Declaration { char } | 0 . Declarations Declaration { byte } | 0 . Declarations Declaration { word } | 0 . Declaration { char } | 0 . Declaration { byte } | 0 . Declaration { word } ;

The second rule thet determine Declarations

Declarations : 0 . Declaration { begin } ;

consists of a single symbol Declaration, which is determined by three rules:

Declaration : 0 TypeName semi | 0 TypeName lsb numb rsb semi | 0 Header Stmts end ;

After the Declaration symbol the begin symbol are expected, so begin are expected after those three rules:

Declaration : 0 . TypeName semi { begin } | 0 . TypeName lsb numb rsb semi { begin } | 0 . Header Stmts end { begin } ;

It's like a reduction of Block to Program. If formally attach to the rules that determine Program trailing symbols, the structure of all valid rules became the same. If decide that, after the Program can nor be symbols (this reduces the length of the resulting list) first two rules are:

Program : 0 . Declarations Block {#} | 0 . Block {#} ;

and

Block : 0 . Main Stmts end {#} ;

Here, the symbol # are used to indicate the end of the file.

The result of this process are closure of initial state (to stop tte process repeating rules must be eliminated):

Program : 0 . Declarations Block { # } | 0 . Block { # } ; Declarations : 0 . Declarations Declaration { begin } | 0 . Declaration { begin } ; Block : 0 . Main Stmts end { # } ; Declarations : 0 . Declarations Declaration { char } | 0 . Declarations Declaration { byte } | 0 . Declarations Declaration { word } | 0 . Declaration { char } | 0 . Declaration { byte } | 0 . Declaration { word } ; Declaration : 0 . TypeName semi { begin } | 0 . TypeName lsb numb rsb semi { begin } | 0 . Header Stmts end { begin } ; Main : 0 . begin { inline } | 0 . begin { return } | 0 . begin { if } | 0 . begin { while } | 0 . begin { name } | 0 . begin { char } | 0 . begin { byte } | 0 . begin { word } ; Declaration : 0 . TypeName semi { char } : 0 . TypeName lsb numb rsb semi { char } : 0 . Header Stmts end { char } : 0 . TypeName semi { byte } : 0 . TypeName lsb numb rsb semi { byte } : 0 . Header Stmts end { byte } : 0 . TypeName semi { word } : 0 . TypeName lsb numb rsb semi { word } : 0 . Header Stmts end { word } ; TypeName : 0 . Type name { semi } | 0 . Type name { lsb } ; Header : 0 . TypeName lcb rcb is { inline } | 0 . TypeName lcb rcb is { return } | 0 . TypeName lcb rcb is { if } | 0 . TypeName lcb rcb is { while } | 0 . TypeName lcb rcb is { name } | 0 . TypeName lcb rcb is { char } | 0 . TypeName lcb rcb is { byte } | 0 . TypeName lcb rcb is { word } | 0 . TypeName lcb Args rcb is { inline } | 0 . TypeName lcb Args rcb is { return } | 0 . TypeName lcb Args rcb is { if } | 0 . TypeName lcb Args rcb is { while } | 0 . TypeName lcb Args rcb is { name } | 0 . TypeName lcb Args rcb is { char } | 0 . TypeName lcb Args rcb is { byte } | 0 . TypeName lcb Args rcb is { word } ; Type : 0 . char { name } | 0 . byte { name } | 0 . word { name } ; TypeName : 0 . Type name { lcb } ;

Four Rules (one of which are repeated eight times) begin with the terminal symbols, accordingly correct program can only start with one of them, i.e. with char, byte, word or begin. Let's meet symbol char. It uniquely identifies a suitable rule

Type : 0 . char { name } ;

The shift action (and transfering char symbol to stack) leads to a state of one point:

Type : 0 char . { name } ;

If the next symbol are name, then reduction of char to the Type are performed and Type are transferred to the stack (shift are performed). If not, any further action is meaningless - the text does not match to the grammar. Thus parser are going to state of one point that repeated three times:

TypeName : 0 Type . name { semi } | 0 Type . name { lsb } | 0 Type . name { lcb } ;

Transferring a symbol name to the stack (shift) will lead to state:

TypeName : 0 Type name . { semi } | 0 Type name . { lsb } | 0 Type name . { lcb } ;

If the next symbol are semi, lsb or lcb, then reduction to TypeName are performed and TypeName and transfered to the stack, otherwise error are fixed - no other symbols can appeas after Typename. In the normal course of things parser goes to state of four points which are repeated twenty four times:

Declaration : 0 TypeName . semi { begin } | 0 TypeName . lsb numb rsb semi { begin } | 0 TypeName . semi { char } | 0 TypeName . lsb numb rsb semi { char } | 0 TypeName . semi { byte } | 0 TypeName . lsb numb rsb semi { byte } | 0 TypeName . semi { word } | 0 TypeName . lsb numb rsb semi { word } ; Header : 0 TypeName . lcb rcb is { inline } | 0 TypeName . lcb rcb is { return } | 0 TypeName . lcb rcb is { if } | 0 TypeName . lcb rcb is { while } | 0 TypeName . lcb rcb is { name } | 0 TypeName . lcb rcb is { char } | 0 TypeName . lcb rcb is { byte } | 0 TypeName . lcb rcb is { word } | 0 TypeName . lcb Args rcb is { inline } | 0 TypeName . lcb Args rcb is { return } | 0 TypeName . lcb Args rcb is { if } | 0 TypeName . lcb Args rcb is { while } | 0 TypeName . lcb Args rcb is { name } | 0 TypeName . lcb Args rcb is { char } | 0 TypeName . lcb Args rcb is { byte } | 0 TypeName . lcb Args rcb is { word } ;

If the following symbol are semi, parser goes to state of one point repeated four times:

Declaration : 0 TypeName semi . { begin } | 0 TypeName semi . { char } | 0 TypeName semi . { byte } | 0 TypeName semi . { word } ;

If the next symbol are begin, char, byte or word, then reduction to Declaration are performed, otherwise the error are fixed.

This process will end up with a Program symbol (or error occurs). This is LR(1)-parser. Its implementation requires a changes in the functions that built state closure and and analyse recuction possibility (its very simple - reduction are possible only when actual symbol are teh same to expected), they are here.

To understand how it works makes sense to consider the sparated grammar of expressions with operator precedence.

Also it should be noted that the program expressions can appears in different contexts, and sets of following symbols are different in different contexts. For example, if expression appears in the left side of comparison in the head of while loop, the set of valid following symbols consists of six comparisons symbols, but if expression appears in the right side of comparison - only one follownig symbol (do) are valid.

For a given grammar these actions are excessive. Moreover, for storing information about the states large amount of memory are required an implementation on machine with small RAM (for example, on 8- or 16-bit machine with 64 kilobytes of memory) may be difficult. Redundancy of an LR(1)-parser can be decreased by decreasing its generality (LALR(1)-parser). In the presence of gigabytes of memory, this problem is insignificant, but the LALR(1)-parsers like YACC/BISON now not disappeared.

In principle, it is possible to include into the grammar rules for language words and remove the scaner (instead we need only function Read() that return on each call next character of the program source). Unfortunately LR(1)-parser are not very fitted for this. Subsequent rules follow to a shift-reduce conflict:

if : 0 'i' 'f' ; inline : 0 'i' 'n' 'l' 'i' 'n' 'e' ; is : 0 'i' 's' ; name : 0 name 'a'-'z' | 0 'a'-'z' ;

Presented grammar loader cannot read ranges of characters ('a'-'z'), so either every such rule must be repeated for each of the characters in the range, or loader must be changed.

After reading 'i' character and following 'f', 'n' or 's' character it is impossible to choose between shift and reduce to name, to make a decision we need to see more than one next character. This is example of non-LR(1)-grammar.

Conflict will be solved after grammar conversion as shown below:

i : 0 'i' ; if : 0 i 'f' ; in : 0 i 'n' ; inl : 0 in 'l' ; inli : 0 inl 'i' ; inlin : 0 inli 'n' ; inline : 0 inlin 'e' ; is : 0 i 's' ; name : 0 temp | 0 i | 0 in | 0 inl | 0 inli | 0 inlin ; temp : 0 temp 'a'-'z' | 0 i 'a'-'e' | 0 i 'g'-'m' | 0 i 'o'-'r' | 0 i 't'-'z' | 0 if 'a'-'z' | 0 in 'a'-'k' | 0 in 'm'-'z' | 0 inl 'a'-'h' | 0 inl 'j'-'z' | 0 inli 'a'-'m' | 0 inli 'o'-'z' | 0 inlin 'a'-'d' | 0 inlin 'f'-'z' | 0 inline 'a'-'z' | 0 is 'a'-'z' | 0 'a'-'h' | 0 'j'-'z' ;

The new symbol temp can not begin with an 'i', as well as other first character of any language word. This is also must be written, so the last two rules must be replaced. And rules that contains uppercase letters, numbers, and possibly other characters also may be added.

Names like inli look strange, but formally allowed. It is possible to disable them and slightly reduce the number of these rules.

Only small part of rules are shown above and benefits are questionable.

Сайт создан в системе uCoz