Grammar-Kit context-based tokens
I was wondering if there was any documentation regarding how to create context-sensitive tokens. I'm trying to write a grammar for a language that supports preprocessor directives, one of which has a regex like: #pragma deprecated( .+)?(\r|\n|\r\n)
In plain text: "#pragma deprecated" is required, and either EOL, EOF, or a token containing the remaining characters on the line is expected after.
E.g.,
#pragma deprecated This is an example
'#' 'pragma' 'deprecated' 'This is an example'
The problem I'm running into is that while tokens are generated fine for '#'. "pragma" and "deprecated", the optional string at the end is causing me trouble, since if I have a token with the regex ".+(\r|\n|\r\n)" and make that token optional, then that token is generated for the entire lexer (which makes sense). I also tried prepending a positive lookbehind assertion to the 'math until eol' regex, which does't seem to be supported (regex:(?<="deprecated")).
Is there a way I can embed some logic to generate this token contextually within the bnf grammar itself? I believe this is possible within JFlex, but I wanted to know if this were possible within the bnf grammar file.
If I can't, then I'm thinking my best alternative is to make a regex for all preprocessor directives "^#.*$" and then parse that regex manually by passing the token to determine validity. I think I could then generate syntax highlighting by making the preprocessor an embedded language.
I read in the change log notes that there is support for empty tokens (i.e., tokens without a string:
tokens = [
empty
]
Does anyone know of an example of how to use this feature? I think this might be another answer to my problem.
Please sign in to leave a comment.
The basics are the layers: Lexer -> Parser -> AST -> PSI
Tokens in BNF (Parser level) are just for constants generation (*Types class) and playing with LivePreview. The real lexer work (Lexer layer) is within *.flex file. What you want is some '#pragma' detection and state management within a lexer. Lexers can be quite long and complicated so there's no real lexer specification in Grammar-Kit *.bnf files.
Note also that PsiBuilder pre-caches all tokens before any parsing so there's no way to communicate information between parser and lexer. In short: Lexer passes information up the layers via tokens only. Parser in its turn passes information up the layers via ASTNodes only.
Is there any information/tutorials regarding the Grammar Kit empty tokens feature? I haven't been able to find any implementations that I could look at to figure out how it works and what it's for.
I'm also curious about language embedding and came across ILazyParseableElementType. Is there any documentation regarding implementing this?
I already applied some statefulness to my parser to determine the validity of more generic tokens using static members (i.e., evaluate the token text to determine if that token was what the parser was looking for [I have to do this because the escape character is dynamic]), but there are some features of the language that are odd and no modern languages do anymore (probably a reason for that), so I'm beginning to think that it might be best to write my own JFlex specification or lexer, however the Live Preview, generation abilities, and general support within IntelliJ of Grammar Kit are very appealing. I've written a low-level lexers before, so I'm experimenting this time by trying to use a higher level abstraction such as JFlex and the Grammar Kit.
You can just try and add an empty token in a BNF, regenerate, and see what happens.
Just to repeat myself: tokens in BNF are just for IElementType constant generation that will be expected by a generated parser from the token stream.
Zero-length tokens shall not be confused with "no-returning" characters in JFlex which are now can also be used. See http://jflex.de/manual.html starting with "If an action does not return a value".
Grammar-Kit generates the simplest flex spec possible which is not sufficient for any serious language. I use that ability only to quickly start working on a lexer spec.
Language embedding via ILazyParseableElementType is *almost* not possible because both languages shall be carefully implemented so that they share same PSI tree, containing file & etc. without throwing CCE. ILazyParseableElementType is used mostly as parser performance optimization: i.e. quickly skip brace-blocks in Java.
What you looking for is most probably language injection.
User doc: https://www.jetbrains.com/help/idea/2016.1/using-language-injections.html.
Main plugin dev doc: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support.html
There's some javadoc here and there regarding language injections. You can start with:
com.intellij.psi.LanguageInjector and com.intellij.psi.PsiLanguageInjectionHost
Or see how PsiLanguageInjectionHost API is implemented for BNF strings so that IntelliLang plugin is able to temporary inject any language there via 'Inject language' (Alt-Enter) intention:
https://github.com/JetBrains/Grammar-Kit/blob/master/support/org/intellij/grammar/psi/impl/BnfStringImpl.java
Thanks for all of the help! I have a lot to read up on now :)
I decided to start writing a custom JFlex specification and plug that into Grammar-Kit's parser.
Is there a clean way to return the Lexer that a Parser is using? It seems that this has been abstracted within the PsiBuilder, only allowing to see the current token and text, but no direct access to the Lexer reference itself, which makes sense to encapsulate that information. However, I had another problem where I decided to move some logic to the lexical analyzer, and need a reference to it in order to change its state based on what the parser is seeing.
In the language, there is a dynamic escape sequence ctrl character. I decided the best thing to do was to move that to the lexical analyzer phase, so that the tokens that it produces are complete (i.e., so all string and char literals are escaped with their correct ctrl characters). This works fine, however, the lexical analyzer needs to know what that ctrl character is when it is reading from the character stream. The way it's set/unset is via
// Sets ctrl char to char value of {number}
#pragma ctrlchar {number} $
// Unsets ctrl char (i.e., back to default value)
#pragma ctrlchar $
I tried experimenting by doing some really rough parsing and lookaheads within the lexical analyzer itself, but I did not like that implementation much, and thought it would be best to move it into the parsing stage instead, so if one of the rules above are matched, then the appropriate method within the Lexer should be called, re-configuring its state for future tokens.
The only other alternative that I could come up with was to parse each character as its own token (e.g., STRING_LITERAL_CHARACTER), however without knowing the escape sequence ctrl character, there does not seem to be any way of determining the right bounds of the string.
How would you suggest going about supporting this feature in a language?
The first part is already covered in my first answer: not possible.
The rest of the question is pretty advanced. Clone https://github.com/JetBrains/intellij-community/ repository and look for com.intellij.lexer.LookAheadLexer and how it is used. With it '#pragma' detection and state management can be done entirely in java, if you don't like spending too much time in *.flex file.
Alright, I'll give it a shot. I think I will try making a wrapper class based on (or using LookAheadLexer directly) with the JFlex-generated lexer as a delegate. I think this approach should fit my needs without bloating my lexer unnecessarily with extra state information.
Thank you for all of your help! You're awesome!
I was able to write a lexer and figure out how to plug it into Grammar-Kit. Is there a way to plug in my token constants into Grammar-Kit so that I can have their names auto-completed when writing the grammar? This would be an awesome feature if it's not already implemented.
I'm currently adding my custom JFlex lexer tokens via the parserImports directive, but it would be nice if there were some directive such as importTokens which would have the same formatting as parserImports (add the imports to the generated parser class), except with the additional behavior that (via reflection) all public static final fields of type IElementType are imported into the tokens directive behind the scenes, referenced via their field name.
Is there a way that I can affect the return value of FlexAdapter#getTokenText()? I have been able to parse the strings and store a copy of their values myself (in some cases, using a regex is not possible), however I have no way to pass this along to the FlexAdapter, as it appears to use it's own buffer reference. Is the typical method to generate a IElementType subclass with a 'value' field and never use FlexAdapter#getTokenText() for these tokens? I have never seen an implementation where the Lexer cannot pass the token value to the consumer(s).
I have also seen implementations where users use PsiBuilder#getOriginalText() to retrieve the token text. However, with this method, I'll need to re-parse the token to strip unnecessary characters.
Collin: Did you get anywhere with affecting the return value of `FlexAdapter#getTokenText()`?
I'm struggling with the same problem as you describe here. I'm working with a LookAheadLexer that looks at existing defines and uses "pre-processor logic" to expand macros to underlying text and tokens. When I insert `IElement IDENTIFIERS` in the `LookAheadLexer`, then I need to associate these `IDENTIFIERS` with a generated text. My LookAheadLexer has all the information I need.
I don't know how to pass it along to other elements.
At the moment upper layer PSI Element end up with no text which causes issues around resolve etc.