Tuesday, February 12, 2008

Command Line Parsing using JFlex

What started out as a small set of commands for a tool I'm writing is slowing growing unwieldy to have to warrant enough repetitious code to parse the command line manually, and to wade through lines of if/else or switch statements (Don't you preach to me about the virtues of using the Command design pattern, for it is still unwieldy because it does not handle the parsing of arguments even the hash saves you from having long branching segments of code, which I don't mind. In my opinion, it's visually easier for me using folds, rather than to have file fragmentation of one command per file.)

Rather than having to deal with the unwieldy mess of buggy, manual coding using an ad-hoc mixture of Regular Expressions and StringTokenizers, I decided to start using a lexical analyzer instead. The one that I'm using is called JFlex, which is probably the most popular (or only?) one around.

Barring the initial learning curve, certainly having the lexical analyzer certainly makes life much easier, by automatically breaking down the command string into tokens each, without having to intervene to deal with handling white spaces and separators and such. A simple example for a lexical analyser that breaks up commands and arguments looks something like this:


/** The lexer for scanning command tokens. */
%%

%class CommandLexer

Parameter = [:jletterdigit:]+
WhiteSpace = [ \n\t\f]

%%

[:digit:]+ { return new Yytoken(Integer.parseInt(yytext())); }
{Parameter} { return new Yytoken(yytext()); }
{WhiteSpace} { /* Ignore Whitespace */ }
"-" { return new Yytoken('-'); }
"," { return new Yytoken(','); }


The example should hopefully be simple enough not to cause a 'cringe factor' or the need to refer to the Dragon Book.

There are 3 different sections in JFlex's definition file, separated by '%%' symbols. The first section is straightforward, it just allows you to include whatever that you wanted to include in the generated file.

The second section, is a list of definitions and directives that tells JFlex what to do. In this case, I've told JFlex to generate the the output to a file called 'CommandLexer[.java]'. Subsequently, the next two lines allows me to put in what I defined as 'WhiteSpace' or 'Parameter'.

The last section is where you define the grammar that helps the generated scanner code to discern what is a token, and in my case, what type of a token it is. In my example, rule 1 '[:digit:]+', matches 1 or more number and transforms that into a token, rule 2, matches what I call a parameter (which has either one or more digits or letters, and contains at least 1 letter in it). Rule 3, just tells the scanner to ignore all WhiteSpace characters, while Rule 4, 5 indicates what I define as separators, in my case the characters '-' and ','.

It must be noted that ordering is important. If I actually swapped the order of rule 2 with 1, because numbers will match the {Parameter} rule first, the [:digit:]+ rule will never match. JFlex will tell you that if that's the case (highlighted in red below):


Reading "commandlexer.jflex"
Constructing NFA : 16 states in NFA
Converting NFA to DFA :
.....

Warning in file "commandlexer.jflex" (line 13):
Rule can never be matched:
[:digit:]+ { return new Yytoken(Integer.parseInt(yytext())); }

7 states before minimization, 5 states in minimized DFA
Old file "CommandLexer.java" saved as "CommandLexer.java~"
Writing code to "CommandLexer.java"


The next thing to do is to actually create a actual token class, which is called Yytoken by default. An example of a typical Yytoken.java file looks somewhat like this:


/** A single scanner token. */
public class Yytoken {
public boolean is_separator = false;
public boolean is_int = false;
public boolean is_token = false;

public char separator;
public String token = null;
public int value = 0;

/** Default for range separator. */
public Yytoken(char c) {
is_separator = true;
separator = c;
}

public Yytoken(int value) {
is_int = true;
this.value = value;
}

public Yytoken(String token) {
is_token = true;
this.token = token;
}

public String toString() {
if (is_separator) return "Range Token("+separator+")";
else if (is_int) return "Int Token("+value+")";
else return "Token ("+token+")";
}
}


To test it, you can write a simple harness to read from stdin:


/** Test class to try out the command lexer. */
public class UseCommandLexer {
public static void main(String args[]) throws Exception {
CommandLexer command_lexer = new CommandLexer(System.in);
Yytoken token = null;
do {
token = command_lexer.yylex();
System.out.println("token = " + token);
}
while (token!=null);
}
}


That's probably a really basic tutorial in using JFlex, and to learn all of it probably requires having more of RTFM, but in the meantime, have fun in processing your command line!

0 comments:

Post a Comment