previous         next         contents

2. An example: nanoPascal

This chapter illustrates the construction of a language processor by use of Depot4 for a tiny language. We start from a parser for that language and continue by developing it into a translator, which produces semantically equivalent C code.

2.1. The language

We take a small language called nanoPascal (a very poor subset of Pascal) from R. Gerady: Experimental Comparison of Some Parsing Methods. SIGPLAN Notices 22, 79-88 (1987) . Its syntax is given in Listing 1.

The taken EBNF notation bases on N. Wirth's suggestions. All literal terminals must be quoted. Identifiers for productions (= nonterminals) must start with a letter and can contain digits or letters. There are some restrictions regarding the number of significant characters and forbidden names. See the installation guide for detailed infromation. (Identifiers are kept into the host language programs to make them better readable. Thus, one has to avoid clashes with the host langauge's keywords.)

	program = block '.'.
	block = ['VAR' ident {',' ident} ';']
	    'BEGIN' statement {';' statement } 'END' .
	statement = 'WRITE' expression |
	    'READ' ident|
	     ident ':=' expression |
	    'BEGIN' statement {';' statement } 'END' |
	    'IF' expression 'THEN' statement ['ELSE' statement] |
	    'WHILE' expression 'DO' statement | .
	expression = simpleExpr
	    [('='|'<'|'>'|'<='|'>='|'<>') simpleExpr ].
	simpleExpr = ['+'|'-'] term
	    {('+'|'-'|'OR') term }.
	term = factor {('*'|'DIV'|'AND') factor}.
	factor = ident | num | '(' expression ')'| 'NOT' factor.
Listing 1: Syntax of nanoPascal

2.2. The Parser

In producing a parser, it is the first step to adjust the grammar due to the restrictions of the parsing scheme. First, this means elimination of left recursion for which some algorithms are known. The second restriction comes from the sequential evaluation of alternative branches: If two (or more) of them start with equal prefixes then the longer must precede to shorter. Our example needs only two changes with regard to these restrictions:

(1) In expression the order of comparisons must be rearranged to:

expression = simpleExpr [('='|'<='|'>='|'<'|'>'|'<>') simpleExpr ].

(2) Rather similar, in factor one has to move the ident branch to the end (or at least after the 'NOT' factor branch):

factor = num | '(' expression ')'| 'NOT' factor | ident.

Here, this is enough to create a parser for the nanoPascal language from the slightly modified EBNF description.

Now you can generate a parser for this simple language by processing each of its rules, i.e. by their translation. In fact, you are producing a bundle of mutually dependent parsers as - in principle - every nonterminal can act as root of a (sub-)language.

If this language processor parses some input it delivers just a boolean value (true - text belongs to the defined language resp. false - otherwise). This state may be sent to the log.

2.3. Translation

2.3.1. Basics

The description of translations is based on a fairly singular approach. Its main focus is the similarity between source and target description. This is achieved by recalling that grammar productions can serve not only for parsing but also for synthesing text. Each rule of the source language is supplied with a production for the target language. By this, the application of a rule in parsing gets a value: its translation, which is defined by the associated target production. To distinguish this value from the rule, an underscore is added. For simple tokens such as identifier, number, etc. this value is equal to the accepted source text. E.g., a rule ex1= ident -> ident_. would accept an identifier and translate it into itself. The translation of sequence of numbers like 1 2 3 into a reverted list 3,2,1 would be described by
ex2= num [ ex2] -> [ex2_ ','] num_.
Although we are speaking about structural translations, every implementation must start with an attempt to map the semantic concepts of the source onto those of the target. Obviously, this holds without regard to the tools applied. The easiness with that structural translations may be achieved must not suppress a predecessing conceptional design.

2.3.2. Defining the translation

In our case this conceptional consideration is simple. All involved concepts, e.g., variable declaration and statement types, have clearly their respective counterparts in C.

We will demonstrate the step by step implementation of a translator. It is useful to insert some default (maybe empty) target generation into each rule first to avoid error messages when applying productions, which have not been supplemented yet. Now it is possible to check the target description of every translation rule individually.

Let us begin with the translation of variable declarations, which are an optional component of block. Here nothing is known about their type but it seems sensible to assume them as integer ones. I.e., some text VAR a,b,c; should be translated into int a,b,c;.

Thus we will get

block = ['VAR' id:ident {',' ident[i]} ';']
    'BEGIN' statement {';' statement} 'END'
-> ['int ' id_ {',' ident_[i]} ';\n'].
Remember, spaces and line feeds have to be added explicitly. The notion ident occurs twice in the source part of rule block, once in an iterative clause, where the single incarnations are distinguished by indexing. This pattern is also applied for statement, but as the default variable for number of iterations N may already be used by declarations, here the value is stored in the (otherwise unused) c. Now the complete translation for block is:

block = ['VAR' ident {',' ident[i]} ';']
    'BEGIN' statement {/..c/';' statement[i] } 'END' 
-> ['int ' ident_ {',' ident_[i]} ';\n']
    'main() \n{\n' statement_ {/..c/ '\n' statement_[i] } '}\n'.
Making block the root production, one gets the following translation (assuming ->'STATEMENT' the default target of statement):
VAR a,b,c;		int a,b,c;
BEGIN		main()
 a:= 2;	into	{
 IF b<3 THEN c:= a		STATEMENT
END		STATEMENT
		}
Next is to figure out a translation for statement. One also may start with a simplified translation like
statement = 'WRITE' expression |
    'READ' ident|
    'BEGIN' statement[1] N:=0;  {';' statement[i+1] } 'END' |
    'IF' expression 'THEN' st:statement ['ELSE' ste:statement] |
    'WHILE' expression 'DO' st:statement | 
    ident ':=' expression |
-> 'WRITE ' |
   'READ ' |
   'COMPOUND ' |
   'IF '|
   'WHILE ' |
   ':= '|
   'EMPTY '.
and then substitute the strings by correct descriptions. The handling of semicolons and line feeds needs some care for a fairly pretty looking result. The translation of the compound statement uses an index shift to avoid renaming. The empty statement is silently included as last (empty) branch in both the source and the target.

In the same way a translator for expression etc. can be developed. Listing 2 shows the complete description. The rule for term illustrates the use of an additional data structure for preserving branch information in an iterative alternative structure.

program = block '.' -> '#include <stdio.h>\n' block_ .
block = ['VAR' ident {',' ident[i]} ';']
    'BEGIN' statement {/..c/';' statement[i] } 'END' 
->  ['int ' ident_ {',' ident_[i]} ';\n']
    'main() \n{\n'
    statement_ {/..c/ '\n' statement_[i] }
    '}\n'.
statement = 'WRITE' expression |
    'READ' ident|
    'BEGIN' statement[1] {';' statement[i+1] } 'END' |
    'IF' expression 'THEN' st:statement ['ELSE' ste:statement] |
    'WHILE' expression 'DO' st:statement | 
    ident ':=' expression |
-> 'printf("%d",' expression_ ');' |
   'scanf("%d", &' ident_ ');' |
   '{'  {/..N+1/ statement_[i] '\n' } '}' |
   'if (' expression_ ') ' st_  ['else ' ste_ ]|
   'while (' expression_ ') ' st_ |
   ident_ '=' expression_ ';'| .
expression = simpleExpr
   [('='|'<='|'>='|'<>'|'<'|'>') sE: simpleExpr ]
-> simpleExpr_
   [('=='|'<='|'>='|'!='|'<'|'>') sE_ ].
simpleExpr =
   VAR op: FLEX OF INT;
   ['+'|'-'] term
   {(/op[i]/'+'|'-'|'OR') term[i] }
-> ['+'|'-'] term_
   {(/op[i]/'+'|'-'|'||') term_[i] }.
term =
   VAR op: FLEX OF INT;
   factor {(/op[i]/'*'|'DIV'|'AND') f2:factor[i]}
-> factor_ {(/op[i]/ '*'|'/'|'&&') f2_[i]}.
factor = num | '(' expression ')'| 'NOT' factor | ident
->  num_ |'(' expression_ ')'|'!' factor_ | ident_ .
Listing 2: Complete nanoPascal to C translator

A small example translation is given in Listing 3. Of course, some things could still be improved, but let us mention that up to now this translation is defined on a purely syntactic level. Whether this is a pro or con, depends from the application context. The main advantage of the restriction to syntax is its simplicity. A possible drawback may be that so called semantic errors, e.g., missing declaration, in the source are not detected. The consequences may vary depending from how strict generated targets are checked in subsequent steps. If errors are reliably caught later on, it may be a matter of taste where to find them out in the most sensible way.

      SOURCE                      TARGET

VAR X, Y, H, GGT;               #include <stdio.h>
BEGIN                           int X,Y,H,GGT;
  READ X; READ Y;               main()
  WHILE X<>0 DO                 {
    BEGIN                       scanf("%d", &X);
     IF X<Y THEN BEGIN          scanf("%d", &Y);
       H:= X; X:= Y; Y:= H      while (X!=0) {if (X<Y) {H=X;
     END;                       X=Y;
     X:= X-Y;                   Y=H;
     WHILE X<Y DO               X:= X-Y }
       X:= X-Y                  X=X-Y;
    END;                        while (X<Y) X=X-Y;
  GGT:= Y; WRITE GGT            }
END.                            GGT=Y;
                                printf("%d",GGT);}
Listing 3: Example translation


    previous         next         contents

© J. Lampe 1997-2002   juergen_lampe@firemail.de               (14-May-2002)