The first job of the parser is to try and figure out what each line of the source code is referring to. It reads in each line then decides if it contains any keyword, variables or constants. It then converts the text into what are known as tokens. It recognises different strings and then attempts to convert them into a token. Anything which does not match the pattern for a keyword or a name will be flagged up as an error.

IFFY A=1 THEN PLINT "moo"

The lexical analyser will pick up "IFFY" and "PLINT" as errors as they do not match any known keyword.

IF 1=A PRINT THEN "moo"

The lexical analyser will not pick up this error as it does not check to see if things are in the correct order. That is the job of syntax analysis. However, confusingly, both are refered to as syntax errors! Remeber if it is spelt wrong then it is caused my lexical analysis!

For example the line 2 * 3 + 4 would get converted to.

  • NUMBER <2>
  • MULTIPLIES
  • NUMBER <3>
  • ADD
  • NUMBER <4>

A token may have a value for example variable names, procedure names or constants. Variable names, constants and literals are stored in a lookup table known as the symbol table. That way a complicated name such as "This_is_my_variable_name" can be simplified. The computer will replace the variable with a token which will be a index to the lookup table. The symbol table stores -

  • An index or identyfyer
  • The name
  • Data type
  • Scope and restrictions

Comments and white space are removed at this stage. Spaces, carriage returns and comments are simply there to aid the task of the programmer. Indentation, which is vital to produce clean code for the programmer is completely useless to the compiler.

Consider the following code snippet.

// comment

  1. IF A>5 THEN A = A * 2
  2. ELSE A = A * 3
  3. A = A + 2

This code produces the following tokens.

IF VARIABLE GREATER_THAN THEN VARIABLE EQUALS VARIABLE MULTIPLIES CONSTANT ELSE VARIABLE EQUALS VARIABLE MULTIPLIES CONSTANT VARIABLE EQUALS VARIABLE PLUS CONSTANT

This string of tokens provides the input to the next phase. When an error is produced a program kicks in known as the report generator. It will return an error to the developer which allows them to debug the system. This process is known as translator diagnostics.

At this stage it may encounter some errors which will be reported to the user. For example consider the following statement

1Var = 10

Looking at this as a human we can properly guess that this is a variable called "Var" which is going to be assigned the value 10. However the compiler will not have such an easy task reading this. The first thing it sees is a number so straight away it will assume we are dealing with a number. The next thing it sees is the letter V. The computer now has to make a decision. Is this an error or is it a variable. By creating some regular expressions for the possible variable would be

[0-9a-zA-z] *

Numbers would be

[0-9] *

It is possible to write them. However consider the following

123 = 50

According the to the regular expression we have written "123"could be either a variable or a number. What token should we assign? We could say as there is an equals sign after it should be a variable but then we are doing syntax analysis which is the next stage. Effectively we have an ambiguous set of definitions. As such it is common to see languages not accept a number as the first part of a variable definition. As such a common variable definition would look like

[a-zA-Z][0-9a-zA-Z]*

Consider the definition we started with

1Var = 10

As it starts with a number the lexical analyser will throw up an error for 1Var. This will be reported back through translator diagnostics.

Technical notes -

For the interested reader, the lexical analyser is based on regular expressions. Each token has a regular expression assigned to it. These regular expressions are then converted to a finite state machine or finite automaton. All of these finite automatons are then placed into one single state machine through a well defined algorithm. It is the finite state machine which handles the conversion of source code to tokens.