previous         next         contents

3. The Ml4 language

3.1. Structure of Ml4

Ml4 (Meta-Language of Depot4) is based on EBNF. In fact, it is a true extension of one of its variants (introduced by N. Wirth).

A Ml4 program in this form does not exist, instead there is a set of Ml4 productions, which can be translated independently of each other. Thus, Ml4 features production (resp. rule) based modularization. Translators are configured dynamically by selecting one of the rules as root production. I.e., the nonterminal on the left-hand side of the production is declared as start symbol of the grammar. By this an applicable language processor is formed. The settlement of the language root can always be changed dynamicly. Together with the dynamic loading of the modules this enables the testing of parts of the language processor before finishing the implementation of all the productions.
The formal description of the EBNF in section 3.3. is already a set of valid Ml4 productions, which can be translated by the Depot4 metalanguage translator into executable code. By choosing Rule as start symbol we get an acceptor for EBNF productions.

A Ml4 production has the general structure:

   identifier = sourceExpression -> targetExpression .
where the part starting with -> is called target production and may occur repeatedly.
The possibility to describe the source as well as the target language by use of similar means is one of Ml4's unique features. All the structure operators of the EBNF are available both on the source and on the target side. There are further extensions such as declarations, assign and call statements, etc.
Elements of Ml4, which are not part of the basic EBNF (extensions) are separated among themselves and from those basic elements by semicolons. (In fact, semicolons may be used within the EBNF parts, too.)

3.2. Lexical elements

The Ml4 language is free-format, that is whitespaces and newlines can be used anywhere between lexemes.
Ml4 is case sensitive.
Comments are deliminated by (* and *) and may be nested.

3.2.1 Identifiers

Identifiers (in the meta-language) must start with a letter and can contain only letters and digits. They can be of arbritrary length but there may be an (implementation dependent) limit in the number of characters that are recognised as significant.
To make the intermediate code as readable as possible, most identifiers are kept during translation into the host language. Therefore it might be wise to avoid such identifiers that may conflict with keywords of possible host languages. (E.g., DO or do are very likely to clash and thus should be avoided.)
There is also a set of reserved identifiers of the Ml4 language:
ARR DCL END FLEX GLOBVAR IMPORTS INIT MODULE REC TYPE TYPEND USE VAR

3.2.2 Literal terminals

Literals are written as strings and have to be enclosed with apostrophs, e.g. ':=' or 'BEGIN'. If the character ' itself is needed in a literal, it has to be written twice, e.g. '''Hallo!'''.
For special (non-printable) symbols there are substitutions:
	\n	newline		choosen corresponding to the actual operating system
	\c	carriage return
	\l	line feed
	\f	form feed
	\t	horizontal tabulator
	\v	vertical tabulator
	\B	bell
	\b	backspace
	\\	\
	\0	Nullbyte
In the source part of a Ml4 production one can make use of an additional feature:
For literals consisting of several characters it is possible to allow abbreviations. For that the string has to start with $i, where i is substituted by one of the digits 1...9 describing the number of symbols at least needed. So '$3INTEGER' accepts the strings INT, INTE,... INTEGER, but not INTEGERS. If a literal starts with the character '$', then it has to be written twice, e.g. '$$a$' accepts the string $a$. To guarantee the separation of a literal from the succeeding text the separating symbol $ can be used after the string. For instance 'REAL' also accepts the beginning of the string REALUM, but 'REAL' $ does not accept it.
Literals may also be given as value of a variable of type SYM.

3.3. Base EBNF

EBNF is a production-based notation, i.e., the grammar is described by the set of productions P. The sets N (nonterminals) and T (terminals) are implicitly given by the occurence of its elements. The start symbol n0 has to be marked. All productions with the same nonterminal on the left-hand side are collected to one EBNF production. An EBNF production has the general form:
   identifier = expression.
identifier is the name of a nonterminal. The dot marks the end of the production. expression is the collection of all right-hand sides of the productions with identifier on the left-hand side.
For this it is possible to use the following structure operators:
  1. Sequencing
    A sequence is simply represented by concatenation of elements (terminals and nonterminals).
    	B = A1 A2 ... An.
  2. Alternation
    Alternates are separated by vertical bars.
    	B = A1 | A2 | ... | An.
  3. Option (Zero-or-one occurrence)
    Optionals parts are enclosed in square brackets.
       B = [ A ].
    Due to an intersection with indexing in in the enhanced language (Ml4), an option following an identifier must be separated by at least one space (or other deliminator).

  4. Iteration (Zero-or-many occurrence)
    Iteration may be directly represented (without using recursion) by curly braces. Iteration is useful to express left association when left recursion is forbidden (as in Depot4).
       B = { A }.
  5. Parentheses (grouping)
    Parentheses may be used to override the relative priority of alternation and sequencing. E.g.
       B = 'a' ('b'|'c') 'd'.
    describes the language {'abd', 'acd'}.
These structure operators can be combined in any production. An example is the syntax of the EBNF itself:
	Rule   = ident '=' Expr '.'.
	Expr   = Term { '|' Term }.
	Term   = { Factor }.
	Factor = string | ident |'('Expr')'|'['Expr']'|'{'Expr'}'.
Ml4 allows empty productions, i.e. empty = . is valid.

3.4. Types

There are three kinds of data types in Ml4: primitive types, structured types, and opaque types. The latter are of interest only in connection with the import feature and allow a simple handling (declaration, parameter passing) of foreign data.

3.4.1 Primitive types

This types are predefined in every Ml4 production.
INT - actually $3INTEGER
Integer type is mapped on the respective type of the host language.
REAL
Floating point type, mapped on a real type, too. There is only a limited support for this type, e.g., no conversions are available.
BOOL - actually $4BOOLEAN
The boolean type, whose values are TRUE and FALSE
SYM
A type, whose values are symbols, i.e., possibly limited strings of characters. They can be, at least, concatenated and compared.
TXT
This is the basic target type. Values of this type can only be concatenated.

3.4.2 Structured types

There are three kinds of data structures: records, arrays, and flexible arrays.
RECORD - actually $3RECORD
The syntax of a record definition follows that of Pascal/Modula (without variants).
ARRAY - actually $3ARRAY
An array is a constant sized vector of elements (which may be in turn of array type again). Only the number of elements is given, their counts start with zero.
FLEX - also FLEX1, resp. FLEX2
Flexible arrays (FLEXes) are suited to store information in connection with EBNF's iteration construct. They have no upper limit for the number of their elements. Accessing a non-existing element f[i] will create it.
The index range of FLEX starts with one.
The use of this data type requires runtime management of the associated data structures and, thus, is expected to be in most host languages less efficient than ordinary arrays.
Flexible array may be of dimension one (FLEX/FLEX1) or two (FLEX2).
Examples:
ARR 20 OF INT
REC name, town: SYM; age: INT; gen: BOOLEAN END
FLEX OF SYM
FLEX2 OF RECORD F: FLEX OF INTEGER;
                AAR: ARRAY 10 OF ARRAY 5 OF REAL END

3.5. Productions and modules

Productions are the standard units of Ml4. Because of efficiency reasons Ml4 allows to combine several productions into a module. This is restricted to groups of nonterminals, where only one is called from outside, but the remaining are needed only locally. The name of the module has to be the name of the nonterminal called from outside.
The use of modules should be restricted to closed parts of grammars, which are not expected to change. Especially overriding a production in a module can have surprising results.

Productions resp. modules are translated separately. There is no need for any used nonterminal (i.e., a nonterminal on the right-hand side of the rule) to be defined yet.
The nonterminal's identifier (i.e., the left-hand side of the rule) becomes the identifier of all the generated entities (host language source file, object file, etc.). This means, if there are two productions with the same left-hand side, translating one of them will possibly overwrite the implementation of the other.

There exists just one global name space for all productions. Thus, it is useful to follow a naming convention when defining new rules.
Depot4 supports prefixing, i.e., if an identifier contains a small letter or digit followed by a capital letter, all the part before the first such capital is regarded as common prefix. (E.g. Dp4 is the prefix of Dp4ExAmPlE1.) This avoids name collisions and is also applied for automatic structuring (into subsystems/packages) if the host systems offers this.

3.6. Source elements

This are all those elements that may occur in the source part of a Ml4 production. The basic structure of this part is given by EBNF.

3.6.1 Class terminals

Class terminals are entities like identifier, number, etc. that are usually called terminals although they stand for a whole class of symbols. There is no real distinction here. Just for efficiency, a set of prefabricated class terminals, which are implemented directly in the host language is supplied.
Class terminals play an important role with respect to extensibility. That's why there is no special tool. As for nonterminals, for every class teminal there is a module containing the corresponding acceptor procedure, called scanning procedure. Detailed information about the terminal is needed only in this module, outwards it is known only by its name. This fact enables the flexible extension of Depot4, also beyond the processing of texts.
The following terminals are supplied with every Depot4 implementation:
id, ident
Identifier due to letter {letter|digit} See also Threadment of Keywords
str, string
String: sequence of characters, quoted by ' or ", not containing the quoting character (similar to Oberon/Modula)
integer
Integer number in decimal or hexadecimal format:
digit{digit} | digit{hexdigit}'H'
num
Integer number in decimal format:
digit{digit}
number
Real number:
digit{hexdigit}'H'
| digit{digit}['.'{digit}[('E'|'D') ['+'|'-']digit{digit}]]
filename
Filename according to the actual operating system
line
Accepts all characters to the end-of-line (inclusive), for an example see Make linefeed ended comments
ident4root
Accepts an identifier and takes its value as name of a nonterminal, which is called afterwards. should be useful for tests, for an example see Individual test of productions in complex environments
any
This is no real class terminal as it has no translation. It's purpose is to skip one significant character. (Example: Skip text until keyword/symbol)
The target of any class terminal is similar to its source. Usually class terminals deliver a symbol value (target no. 0), too. This symbol value is in most cases the sequence of accepted characters. However, for id/ident all letters are converted into capital ones if capitalisation is insignificant. Thus, delivering a canonic representation which is of good use in symbol tables etc.
Threadment of Keywords
In general, there is no syntactical difference between ordinary identifiers and keywords. So one is likely to run into trouble with an expression like [ident] 'END' as the closing end will be accepted as an identifier. There are at least two ways to overcome this. First one can change the grammar, e.g. into (ident 'END'|'END') which solves the problem.
Depot4 has a more convenient solution now. One can write all these words that are not identifiers into a file. As a default Depot4 looks in the current directory for a file NoIdent.lst (can be changed in module Dp4Config) and excludes all the words that it contains from being recognized as identifiers.
It is also possible to change these list dynamically. This is achieved by calling procedure NoIdents from module Dp4Stdlex. The argument is the filename string. This call discards the previous list and installs a new one, which will be empty if no file was found.
There are two more procedures that can be useful in the case of nesting. Procedure pushNoIdents(filenameString) saves the old settings in addition, while popNoIdents() restores the saved status.

The syntax of an exclusion file is simple: just list the words, separated by spaces or newlines.

Example:
IMPORTS Dp4Stdlex;
lextst = Dp4Stdlex.NoIdents('PascalNoIdent.lst');
  { ident } 'END' Dp4Stdlex.pushNoIdents('CNoIdent.lst')
  { ident } 'end' Dp4Stdlex.popNoIdents(); { ident } 'UNTIL'
.
with file PascalNoIdent.lst containing at least END and file CNoIdent.lst containing end will accept
alfa beta END END ELSE end end UNTIL

3.6.2 Nonterminals

As mentioned earlier, there is - in technical respect - no real distinction between class terminals and nonterminals. Thus, all features described here may also be applied with them. However, while one can compile a reasonable set of basic class terminals, there is not such basic set of nonterminals.
Nonterminals are called by their names. It is possible to call them recursively, i.e., the left-hand side of a production may appear in the right-hand side. Then it has to be ensured that the recursion is finite. (Due to Ml4's operational model automatic detection is not possible in general.)
Often it is necessary to distinguish in a production several instances of the same nonterminal.

There are two possibilities to modify nonterminals in the description of the source:

  1. Renaming
    By Name:NT the nonterminal NT gets the new designation Name. Renaming is usually used if a nonterminal occurs on several positions in a production:
        Prod = F1:Fact [ Op F2:Fact]
          -> F1_ [Op_ F2_].
    But renaming can also be used in the reversed way. It is possible to give different nonterminals in different branches of an alternative the same name if they are to be treated equally:
        Stat = S:IfStat | S:AssStat | S:ForStat
          -> S_.
  2. Indexing
    By NT[index] it is possible to provide nonterminals with indices. This is usually used in connection with iterations:
        DclSeq = { Dcl[i] }
          -> { Dcl_[i] }.
    Every nonterminal can get at most two indices. To distinguish between the parentheses for indices and for options the following has to be obeyed: There must not be a space, newline or comment between the nonterminal and the opening index parenthesis. In contrast there has to be a delimiter between a nonterminal and an opening option parenthesis.
Indexing and renaming may be combined:
    Seq = { D:ConstDef[i] | D:TypeDef[i] }
      -> { D_[i] }.

3.6.3 Skipping

Normally, there can be an arbitrary number of delimiters, i.e. spaces, newlines and comments between two successive terminals in the source. They are automatically skipped. But sometimes it is necessary to suppress this behaviour. Then all source elements in front of which delimiters are not allowed have to be enclosed in < and >. In this way class terminals can easily be implemented, too.
   Integer = digit < { digit } >.
By the enclosure in < ... > delimiters inside the number are prohibited. An exception is the first digit, so that delimiters in front of the number can be ignored.
This feature may also be used to parse formatted, e.g., tab separated, input.
Skipping areas may be nested. Skipping is then disabled until the outest scope is left.
Remark: It is essential, to select the correct text stretch because < as > imply internal actions. So, e.g., <digit [digit>] will not work correctly if only one digit was accepted.

3.6.4 Procedure calls

Although Ml4 aims at the goal of translation descriptions which are highly independent from the system's actual host language it does not take a purist's view and offers an interface to those basic system features. The interface is defined by procedures (or routines or methods) encapsulated in an unity called module, e.g. a class in Java or an Ada package. Calls to such procedures may be embedded in the source text of the parsing part. The import of modules is described in 3.14.1.
Procedure calls must contain a (possibly empty) parameter list.
Due to the generality there is no simple way of type checking which, therefore is deferred to the host language compiler. This solution is not fully satisfactory. Nevertheless, it is usually not too hard to link the error message with the appropriate Ml4 code position.
Further versions may offer additional means.

Intrinsic procedures are described in 3.7.2, independently if they are proper procedures (i.e. have no return value) or not.

3.6.5 Assignments

Any variable can be assigned to a value of its type. There are some automatic conversions into type SYM. Be aware that the translator does not know anything about the type of imported entities. Thus it cannot insert any conversion or check compatibility.

The result of an assignment is not reverted during back-tracking.

3.7. Expressions

Expressions can be build similar to the rules of Pascal, i.e., with three levels of priority. Unary operators (sign, NOT) are of the highest level.

3.7.1 Operators

Add operators
+, -, OR
+ serves for concatination (types SYM and TXT) too
Mul operators
*, DIV, MOD, &
& is the logical AND operator
Comparisons
=, #, <=, >=, <, >
# stands for not equal

3.7.2 Intrinsic procedures

Functional procedures
Proper procedures

3.7.3 Predeclared variables and constants

The following variables are predeclared in every Ml4 production and, thus, must not be explicitly redeclared. They serve as default control variables (see there), but can - with some care - be applied elsewhere, too.
Integer: N, O, i, c

Variables with special function (all of type SYM):

Boolean constants: FALSE, TRUE


    previous         next         contents


© J. Lampe 1997-2002   juergen_lampe@firemail.de               (18-Mar-2002)