This is an excerpt from my ongoing book project on Alex and Happy — which are Haskell tools for building parsers and lexers. The code examples are available here on GitHub.
In this chapter we discuss how to parse two features of programming language grammars: strings and comments.
Given the text
/*
This whole section commented out
/* A sample program */
*/
"Bo\"bcat" cat cat cat
"/*Comment*/" cat cat /*cat*/
we expect our program to tell us that the person called Bo"bcat
has three cats and the person called /*Comment*/
has two cats. The names of the cat owners are given as strings delimited by double quote characters, with the escape sequence \"
being used for a double quote within a name. /*
and */
denote the beginning and end of comments. Comments can span multiple lines and nested comments allowed.
Comments and strings have very different rules of lexical analysis than the rest of the program. Most text within strings and comments are not interpreted. On the other hand escape sequences within strings are interpreted . The nesting of comments poses another problem since a regular expression cannot remember arbitrary depth of nesting and the depth of nested comments must be tracked elsewhere.
Token definitions
Startcodes are the solution provided by alex
to this problem. A startcode is just a number with a set of these numbers being attached to each rule in an alex
specification. When calling alex
to find a token we specify the startcode to use and alex
then considers only the rules belonging to that startcode.
A lexer and parser for the language of this chapter is in the startcode/
folder of the Github repository of this book. It uses the monadic interface of alex
and happy
which is discussed in Chapter 3.
Let’s look at the token defintions in Lexer.x
first. The initial definitions are ordinary ones that skip whitespace and define the cat
token:
<0> $white+ ;
<0> cat {plainTok TCat}
Note the <0>
. This specifies the startcode for which these rules apply. 0
is the default startcode which we will be using for the main program text. Beware of a common error: if you leave out the startcode at the beginning of a rule then that rule applies to all startcodes. This can bite you when you have a lexer with multiple start codes.
Now for the patterns for strings
<0> \" {beginString}
<stringSC> \" {endString}
<stringSC> . {appendString}
<stringSC> \\[nt\"] {escapeString}
The first rule, which has startcode 0
matches the double quote in ordinary program text that starts a string. stringSC
defines a new start code. In the Haskell code generated by alex
, stringSC
will be defined with a numeric value which can be used in other parts of the program. The action beginString
, whose implementation we will discuss below, ensures that later invocations of alexScan
use the startcode stringSC
.
The second rule, which also matches a double quote but in the context of the startcode stringSC
recognized the end of the string. The action endString
yields the token constructed out of the string and ensures that future invokations of alexScan
happen with startcode 0
.
The last two rules accumulate characters in the string. The first matches ordinary characters and the second the escape sequences \n
, \t
and \"
.
The patterns for the comments are similar:
<0,commentSC> "/*" {beginComment}
<commentSC> "*/" {endComment}
<commentSC> [.\n] ;
The first rule recognizes the beginning of a comment. The prefix <0,commentSC>
specifies that it applies both in the middle of ordinary program text as well as in the middle of a comment when the start code is commentSC
. This is to handle nested comments. The second rule handles the ending of a comment. Between then beginComment
and endComment
must keep track of the nesting level, with endComment
switching back to start code 0
only when all levels of comments have been closed.
The last rule just ignores all characters in a comment. Remember that the catch-all character class .
does not match newlines. Therefore we include an explicit \n
to allow comments to span lines.
The state
We use the State
monad as our parser/lexer monad. The state is of type ParseState
which keeps track of the nesting level of comments and accumulate characters in strings. The accompanying type AlexInput
keeps tracks of line numbers. The implementation of these types are in the file LexParseCommon.hs
.
data AlexInput
= AlexInput {
aiprev::Char,
aibytes::[Word8],
airest::String,
ailineno::Int}
deriving Show
-- [...]
data ParseState =
ParseState {input::AlexInput,
lexSC::Int, --Lexer start code
commentDepth::Int,--Comment depth
stringBuf::String --Temporary storage for strings
}
deriving Show
The Lexer Actions
Let’s now go back to the lexer actions in Lexer.h
. The lexer actions have type
type LexAction = Int->String->P (Maybe Token)
with the arguments being the number of characters matched and the matching string. The result is Maybe Token
since some actions, such as those marking the beginning of a string, or the accumulation of characters within a string, may not yield a token.
String lexing actions
beginString
sets the startcode to stringSC
. appendString
and escapeString
add characters to the beginning of stringBuf
. endString
sets the start code back to 0
and returns the string saved in stringBuf
reversed (since the characters were added to the beginning of the string). The characters are added to the beginning rather than the end of the string as the former is more efficient. The actions get
and put
for reading and writing the state are provided by the State
monad.
beginString::LexAction
beginString _ _ = do
s <- get
put s{lexSC = stringSC}
return Nothing
appendString::LexAction
appendString _ (c:_) = do
s <- get
put s{stringBuf = c:(stringBuf s)}
return Nothing
escapeString::LexAction
escapeString _ (_:c:_) = do
let unesc =
case c of
'n' -> '\n'
't' -> '\t'
'"' -> '"'
s <- get
put s{stringBuf = unesc:(stringBuf s)}
return Nothing
endString::LexAction
endString _ _ = do
s <- get
let buf = stringBuf s
put s{lexSC = 0, stringBuf = ""}
return $ Just $ TString (reverse buf)
Comment lexing actions
The implementation of the comment lexing actions is similar. The beginning of a comment changes the startcode to commentSC
and increments commentDepth
. The ending of a comment decrements commentDepth
, setting the start code back to 0
when commentDepth
reaches 0
.
beginComment::LexAction
beginComment _ _ = do
s <- get
put s {lexSC = commentSC,
commentDepth = (commentDepth s)+1}
return Nothing
endComment _ _ = do
s <- get
let cd = commentDepth s
let sc' = if cd==1 then 0 else commentSC
put s {lexSC=sc',commentDepth=cd-1}
return Nothing
The driver
The monadic action readToken
in Lexer.x
provides the interface with alex
. The following line calls alexScan
with the startcode saved in the state
s <- get
case alexScan (input s) (lexSC s) of
When alex
returns AlexToken
indicating a match we call the corresponding action, looping back by calling ourselves if the action returns Nothing
indicating that no token is generated.
-- [...]
AlexToken inp' n act -> do
let (AlexInput{airest=buf}) = input s
put s{input = inp'}
res <- act n (take n buf)
case res of
Nothing -> readToken
Just t -> return t
The rest of the code is routine. lexer
in Lexer.h
wraps up readToken
in the interface expected by happy
. The happy
parser itself is in Parser.y
. main
in Main.hs
parses texts read in from the standard input.