BNF Live Preview rule ordering issue
Answered
I'm experiencing a weird rule ordering issue in the BNF Live Preview in IDEA Ultimate 2023.2.
The following BNF file is the beginnings of my attempt to model Tcl.
{
tokens = [
braced_open = '{'
braced_close = '}'
quoted_delimiter = '"'
command_substitution_open = '['
command_substitution_close = ']'
newline = 'regexp:\n'
spacing = 'regexp:[^\S\n]+'
backslash_substitution = 'regexp:\\(?:[0-7]{1,3}|x\p{XDigit}{1,2}|u\p{XDigit}{1,4}|U\p{XDigit}{1,8}|.)'
]
}
tcl ::= (statement newline?)*
statement ::= word (spacing word?)*
private word ::= braced | quoted | unquoted
braced ::= '{' (braced | 'regexp:(?:\\[{}]|[^{}])+')* '}' {pin=1}
quoted ::= '"' (substitution | 'regexp:[^\[$\\"]+' )* '"' {pin=1}
unquoted ::= (substitution | 'regexp:[^\[$\\\s]+' )+
private substitution ::= command_substitution | backslash_substitution
command_substitution ::= '[' tcl? ']' {pin=1}
For the following simple preview editor contents:
a
the Live Preview makes no matches. If I move the `braced` & `quoted` rules below the `unquoted` rule, however, the `unquoted` rule matches 'a'. Why would the order make a difference? Neither `braced` nor `quoted` should match 'a' because 'a' doesn't start with '{' or '"', respectively.
Please sign in to leave a comment.
I also tried the following bnf with the regexes as tokens (and with non-capturing group markers removed, as non-capturing groups don't seem to be supported by JFlex), but the text matches `braced_text`. Things work if I remove `braced_text` & `quoted_text` from the global tokens list. Why are tokens being matched? Is there any way to have a token that doesn't get matched? If not, there's something very wrong with Live Preview / the bnf format.
It seems that in Live Preview, regexes in a .bnf file are matched by themselves against the input based on where they're located in the file, regardless of them being either an implicit or explicit token that is only referenced in a rule following other preceding tokens in the same rule that don't match the input text.
Is this intentional behavior or is it a bug? If intentional, why? Is there any way working around this short of embedding all the logic from other tokens within the regex, to ensure that the regex doesn't match input by itself? If that's the only workaround, that makes regexes much less reusable, because most uses would require rewriting the same regex with additional match restrictions prepended and/or appended.
In a standard IntelliJ language support the whole text is lexed first, and then the tokes are fed into the parser skipping whitespaces and comments (as defined in the ParserDefinition implementation). There's no need to specify them in a grammar. The lexer is usually defined separately using the customized JFlex lib. JFlex token definitions are different from plain java regexps of the Live Preview lexer.
Live Preview in Grammar Kit creates a toy lexer by regexp and keyword tokens in specific order. The first token matcher wins. See LivePreviewLexer.java. The code is pretty straightforward. Common token types are detected - whitespaces, strings, and numbers. It is just a toy to quickly check some ideas. In production all those regexps are not used.
In rule definitions quoted symbols are matched as text, so 'regexp:[^\[$\\"]+' will match only tokens that are equal to 'regexp:[^\[$\\"]+'. So the first grammar is incorrect.
The regexp tokens you provided in the second grammar seem off to me. I've used the following patterns and got some file structure immediately.
That seems to make Live Preview practically useless. Its incorrect usage of .bnf files should be clearly documented. Did I miss that information in the docs?
Are there any types of tokens other than regexp: & literals?
Quoted explicit regexp value (instead of name) tokens seem to work in Live Preview, so it seems that quoted symbols aren't always matched as literals; it's implicit tokens that are always matched as literals.
Does the documentation say that a regexp: only works as a regex if it's an explicit token?
My `braced_text` & `quoted_text` regexps are correct (excepting one testing simplification that I accidentally left in the code that I posted) for Tcl; yours are not.
For `quoted_text`: First, only double quotes, not single quotes, denote quoted strings in Tcl. Second, your regexp doesn't handle backslash substitutions, command substitutions, etc. that my regexp along with other rules can handle (I left out my handling of variable substitutions to simplify the bnf for my post, so that's why '$' isn't allowed in `quoted_text` but isn't handled elsewhere).
For `braced_text`: Your regexp:
- doesn't handle escaped braces
- doesn't handle escaped backslashes (my posted code didn't handle escaped backslashes either due to a version with a simpler regexp having been accidentally posted in place of the correct regexp because I had been simplifying things to track down the source of Live Preview's odd behavior, which you explained in your recent post)
- doesn't understand that braces must be matched nested pairs. e.g., your regexp matches '{{}', but that is missing a closing '}' for the outer pair. My .bnf handles nested paired braces via the recursive `braced` rule.
I know what I'm doing with regexps (except for perhaps oddities of the JFlex variant of regexp, as I've never used JFlex before) & with Tcl, I just didn't know a few things about Live Preview, which seems to be broken for most .bnf files that people might write (unless I misunderstood your explanation of its behavior in your recent post).