Custom Language: Check syntax with lexer


Hello everyone,

I noticed that I probably have issues understanding the sense of states of a JFlex lexer.

I used Grammar-Kits .bnf file to generate a parser, so I described the programs syntax in the .bnf file.
I understand the general uses of lexer states, but in what terms do I need to implement / check the syntax of the program again?
Is there a point against having a single state (if the syntax allows it) which returns every possible token for each keyword, number, ...?

How complex should the lexer be compared to the parser?

1 comment


Quick intro:

The main requirement for lexers is to be able to split the file text into language tokens as they appear in the file, without any order/syntax validation. For example:

any + ++ return void boolean "any"

in Java language could be tokenized by lexer to identifier, plus_opeator, increment_operator, return_keyword, void_keyword, boolean_keyword, string_literal, and it is a correct lexer’s output.

States are not used to validate the syntax in the lexer. It is the parser’s responsibility to validate if they appear in the correct order and make sense as a program code.

States should be used in cases when:

  • the same sets of characters can have different meanings in different contexts (states are actually required in this case). For example, there could be language with braces delimiting method bodies like in C-like languages and string interpolation like "counter: {value}". You may need to have different token types for braces in a method body and string.
  • it makes the lexer easier to maintain - for example, when rules become very complex and are hard to understand.

If your language doesn’t have any corner cases and is simple, a single-state lexer with simple rules that match tokens is enough.


Please sign in to leave a comment.