support for #include "file.ext" directive

Created September 29, 2020 12:55

Unfortunately there is currently no sample code for your setup I'm aware of that we can share. You might try and browse for existing language plugins and find a similar use case https://plugins.jetbrains.com/search?products=idea&tags=Languages

For managing file include information at later stages, com.intellij.psi.impl.include.FileIncludeProvider could be used.

Yann Cebron

Created October 12, 2020 08:24

Good news, today I can share some instructions. HTH.

===== Theory ===================================================================

1. Pre-processing happens "before" so it belongs to the lexer level, not parser.
    Otherwise all those "consumeToken" checks & etc. are needed (very bad performance-wise).

2. Ok, the lexer-level is good, the parser handles the language semantics, not preprocessor.

3. IntelliJ Lexer API can work with zero-length tokens quite good, so a LookAheadLexer can
    easily spot a preprocessor directive and expand it to a required token sequence.

4. We also need to think a bit ahead: how multiple very large header files are to be handled when
    a tiny change in top level #DEFINE may change a file AST significantly.
    So there must be a common lazy-built symbol cache known to the lexer. The lexer would then
    feed tokens from the cache to the parser.

5. "One more thing": a group of #DEFINE inside #IFDEF can be simplified to a number of "switches"
    packed in a zero-length token so that there's no need to feed "real" tokens from a header file.


[4] is the most tricky part, everything must be aggressively optimised (like [5]) to achieve reasonable performance.

I suggested that c-like preprocessor handing code can be generalised to be used in various languages.
And the colleague of mine agreed but added "not in the near future".


Grammar-Kit has little to do with all the above as we keep lexer and parser layers clearly separated.
LookAheadLexer would lazily employ parser logic for header files via shared symbol cache.

Note that there are registered & non-registered IElementTypes, the custom "packed" types shall be non-registered.
Preprocessor tokens can be kept as-is for in-editor features and skipped by the parser like white-spaces.
That's it.



===== Practice ===================================================================


Here's a step by step procedure (in C), so consider header.h and source.c:

header.h
#define FOO 4

source.c
#include "header.h"
int main() {  int i = FOO; }

I. Lexer
1. The lexer slowly goes through source.c minding all includes and stops at #include header.h
2. After returning all the <#include> and <string> back, the lexer queries all symbols from header.h
    and especially all #DEFINES.
3. Suppose the cache is empty, so before we proceed any further the cache finds header.h and
    parses it first from step [1], then builds the symbols map and returns to step [2]
    Use RecursionGuard API here not to fall into StackOverflowException
4. Now the lexer can proceed, and it goes all the way to FOO matching any symbols against all
    available #DEFINE symbols, and here's the one: FOO
5. Instead of returning FOO as <id> it returns FOO as some <#define reference> which is marked as
    whitespace and will be skipped by the parser, but we need it to provide navigation/resolve/etc. later.
6. And then return zero-length sequence of tokens that comprises the expansion of the FOO macro,
    i.e. <integer>
7. Now the lexer part is done (PsiBuilder.cacheLexems) and we are feeding tokens to the parser.

II.
8. The parser works through the tokens and build AST for #include as you mention it
9. Then the main().. and in its body the FOO part is skipped as whitespace and the <integer> is processed
    as assignment rvalue.

IIa.
10. The parser routing for header.h is exactly same as for source.c
11. Work through all the tokens, build AST for all parts, all characters must be covered,
     #DEFINE FOO 4 shall become OCDefineDirective(foo) & etc.

The interesting part is how symbol table/per-file cache is designed, but it is up to you.
For sure it must not be PSI-based, but one shall be able to navigate to PSI via a Symbol.

This way you can include any ".h" into ".c" or ".c" into ".h", the parser and lexer are the same.

Speaking of Lexer API:

ParserDefinition.createLexer(Project) is sometimes called with "null"
But "real" files are lexed/parsed via IFileElementType or IStubFileElementType API which is also
ILazyParseableElementType, so there're some methods to override where you have all the project configuration at hand:
   com.intellij.psi.tree.ILazyParseableElementType#parseContents
   com.intellij.psi.tree.ILazyParseableElementType#doParseContents
So you can build a scope to look for header files in and create the lexer with it.
The default lexer constructor may create an instance that operates on empty scope and provide no preprocessing facilities.

Also to highlight the file correctly you can skip some preprocessing stuff, like zero-length expansion to save CPU.
See com.intellij.openapi.fileTypes.SyntaxHighlighterFactory API



===== More Practice ===================================================================


As far as FOO shall be a reference, and #define / #include should become ASTNodes
they all could not be simply skipped as whitespaces but instead
they shall be "parsed" on each consumeToken()/advance()/etc. before the raw SV parser kicks in.

```
public class MyBuilder extends PsiBuilderAdapter {

  public MyBuilder(PsiBuilder delegate) { super(delegate);}

  private void parseMacros() {
    // preprocessor rules can be defined in the same Grammar-Kit grammar
    // so we just call the "macros" rule
    // macros ::= macro_define | macro_include | macro_call | ..
    SVGeneratedParser.macros(this, 1);
  }
  public void advanceLexer() { parseMacros(); super.advanceLexer();}
  public String getTokenText() { parseMacros();return super.getTokenText();}
  public IElementType getTokenType() { parseMacros();return super.getTokenType();}
  public boolean eof() { parseMacros();return super.eof();}
  public int getCurrentOffset() { parseMacros();return super.getCurrentOffset();}
}
```

Permanently deleted user

Created October 12, 2020 21:33

Hi Yann,

Many thanks for your very detailed explanation! It helped me to confirm that my approach is correct as it is almost the same now. I hope I will not get stuck somewhere dramatically.

Thanks also for linking this answer to my previous question. Both questions are related to preprocessing but this is more specific and assumes that the general solution is already worked out, so let me follow up on it here. I will ask some additional questions in the previous thread.

--------------------

My general problem raised in this thread is still not clear to me.

Let me quote relevant fragments from your answer:

After returning all the <#include> and <string> back, the lexer queries all symbols from header.h and especially all #DEFINES.

...

ParserDefinition.createLexer(Project) is sometimes called with "null" But "real" files are lexed/parsed via IFileElementType or IStubFileElementType API

What does it mean “lexed via IFileElementType”? Lexer interface accepts CharSequence only and there is no information about the containing file. Also Lexer instance is created using ParserDefinition.createLexer(Project) which gives us the context of the project, not the file.

In the presented example header.h is (or could be) relative to source.c. How Lexer can find header.h file if it does not about source.c?

Colin Fleming

Created October 13, 2020 09:19

This is great stuff Yann, thanks for such a detailed answer. I've filed it away in case I ever need to do this.

I have one question - as far as I am aware, it's a basic restriction that index information for a file should only be based on that file's contents, not on any other file's content. This is so that the index invalidation works correctly when files are modified. This implementation is a clear violation of that, how should that be handled? Can the index information for source.c be invalidated somehow when header.h is modified? Or is some other approach recommended?

Yann Cebron

Created July 26, 2021 10:04

Colin Fleming The suggestion would be to pass VirtualFile in the constructor.

Bradan

Created October 14, 2023 12:03

Thanks for sharing this information. Replacing the macro by whitespace followed by it's content as zero length tokens is really smart. Are you using a different JFlex skeleton for this in CLion (JFlex has a nested skeleton with yypushStream/yypopStream for that purpose)? I've currently decided not to change the skeleton, but to wrap the “bare lexer” by one which will expand macro tokens for the language plugin I'm writing, because this way I can still easily decide not to care about macros e.g. for my syntax highlighter, but I'm not sure if that's the way to go, yet.

The problem Colin Fleming has is also going through my mind a bit and the lazy-cache you've mentioned is still very problematic. Of course the Lexer knows the VirtualFile or PsiFile object now after overriding doParseContents, but that doesn't give you the ability to get the outer scope, yet. I mean if you have the following constellation:

// source.c

#define X
#include "header.h"

// header.h

#ifdef X
...
#else
...
#endif

Then the header.h has to know that X is defined (outer scope). The symbol cache is depending on the include order and the contents of other files outside it's inclusion. Even worse, a single file (here header.h for example) can be included twice in different places if there is no guard block. So it has two different “scopes”. The lazy cache cannot be so "lazy" then.

That leads to another question: is the lazy cache built during indexing step and is the psi tree of all open editors rebuilt after indexing has finished? Or can we tell the IDE to forget about the previously built psi trees (with and without the stubs in the index) after a background task has finished or so in case we want our own separate indexing process?

And Colin Fleming, maybe if we have a file based index with psi stubs where only the preprocessor statements are defined you could essentially trace the path of inclusion back to get the beginning scope of the file you are looking at. So when starting to lex header.h you get the different scopes and macros where it is inside by doing a ReferenceSearch on the PsiFile, to the #include “header.h” statements and recursively gather the scope from there until there is no more inclusion. This way you can still stay strictly file based and it might still be quick depending on the language and the depth of the includes of course (but humans should also be able to know their code, so a depth of 100000 is unlikely, isn't it? Endless recursion should be prevented though).

And I'd like to point out a funny but interesting limitation of the method shown here, as it still requires that each file in itself is syntactical “complete” although such a constellation would be valid and compile:

Of course that's an absurd one and you'd probably never write this. However, the outer scope is only known to a certain extent that your lazy-built symbol cache can establish.

Update:

When editing the file the lexer starts in the middle of the text. The reset method gets called with the integer state, but how do you handle recursive states when traversing included files? You only have an integer.