support for #include "file.ext" directive

Answered

I need to implement support for #include “file.ext” directive in my custom language.

Idea behind this directive is the same as in other languages like C or C++.

I need to have access to included file at lexer and parser level but that path of included file is relative to current document. Lexer interface takes as an input CharSequence and it does not understand files, so I cannot resolve relative file name to absolute path.

Do you know any solution to this problem?

 

4 comments
Comment actions Permalink

Unfortunately there is currently no sample code for your setup I'm aware of that we can share. You might try and browse for existing language plugins and find a similar use case https://plugins.jetbrains.com/search?products=idea&tags=Languages

For managing file include information at later stages, com.intellij.psi.impl.include.FileIncludeProvider could be used.

0
Comment actions Permalink

Good news, today I can share some instructions. HTH.

 

===== Theory ===================================================================

1. Pre-processing happens "before" so it belongs to the lexer level, not parser.
    Otherwise all those "consumeToken" checks & etc. are needed (very bad performance-wise).

2. Ok, the lexer-level is good, the parser handles the language semantics, not preprocessor.

3. IntelliJ Lexer API can work with zero-length tokens quite good, so a LookAheadLexer can
    easily spot a preprocessor directive and expand it to a required token sequence.

4. We also need to think a bit ahead: how multiple very large header files are to be handled when
    a tiny change in top level #DEFINE may change a file AST significantly.
    So there must be a common lazy-built symbol cache known to the lexer. The lexer would then
    feed tokens from the cache to the parser.

5. "One more thing": a group of #DEFINE inside #IFDEF can be simplified to a number of "switches"
    packed in a zero-length token so that there's no need to feed "real" tokens from a header file.


[4] is the most tricky part, everything must be aggressively optimised (like [5]) to achieve reasonable performance.

I suggested that c-like preprocessor handing code can be generalised to be used in various languages.
And the colleague of mine agreed but added "not in the near future".


Grammar-Kit has little to do with all the above as we keep lexer and parser layers clearly separated.
LookAheadLexer would lazily employ parser logic for header files via shared symbol cache.

Note that there are registered & non-registered IElementTypes, the custom "packed" types shall be non-registered.
Preprocessor tokens can be kept as-is for in-editor features and skipped by the parser like white-spaces.
That's it.



===== Practice ===================================================================


Here's a step by step procedure (in C), so consider header.h and source.c:

header.h
#define FOO 4

source.c
#include "header.h"
int main() {  int i = FOO; }

I. Lexer
1. The lexer slowly goes through source.c minding all includes and stops at #include header.h
2. After returning all the <#include> and <string> back, the lexer queries all symbols from header.h
    and especially all #DEFINES.
3. Suppose the cache is empty, so before we proceed any further the cache finds header.h and
    parses it first from step [1], then builds the symbols map and returns to step [2]
    Use RecursionGuard API here not to fall into StackOverflowException
4. Now the lexer can proceed, and it goes all the way to FOO matching any symbols against all
    available #DEFINE symbols, and here's the one: FOO
5. Instead of returning FOO as <id> it returns FOO as some <#define reference> which is marked as
    whitespace and will be skipped by the parser, but we need it to provide navigation/resolve/etc. later.
6. And then return zero-length sequence of tokens that comprises the expansion of the FOO macro,
    i.e. <integer>
7. Now the lexer part is done (PsiBuilder.cacheLexems) and we are feeding tokens to the parser.

II.
8. The parser works through the tokens and build AST for #include as you mention it
9. Then the main().. and in its body the FOO part is skipped as whitespace and the <integer> is processed
    as assignment rvalue.

IIa.
10. The parser routing for header.h is exactly same as for source.c
11. Work through all the tokens, build AST for all parts, all characters must be covered,
     #DEFINE FOO 4 shall become OCDefineDirective(foo) & etc.

The interesting part is how symbol table/per-file cache is designed, but it is up to you.
For sure it must not be PSI-based, but one shall be able to navigate to PSI via a Symbol.

This way you can include any ".h" into ".c" or ".c" into ".h", the parser and lexer are the same.

Speaking of Lexer API:

ParserDefinition.createLexer(Project) is sometimes called with "null"
But "real" files are lexed/parsed via IFileElementType or IStubFileElementType API which is also
ILazyParseableElementType, so there're some methods to override where you have all the project configuration at hand:
   com.intellij.psi.tree.ILazyParseableElementType#parseContents
   com.intellij.psi.tree.ILazyParseableElementType#doParseContents
So you can build a scope to look for header files in and create the lexer with it.
The default lexer constructor may create an instance that operates on empty scope and provide no preprocessing facilities.

Also to highlight the file correctly you can skip some preprocessing stuff, like zero-length expansion to save CPU.
See com.intellij.openapi.fileTypes.SyntaxHighlighterFactory API



===== More Practice ===================================================================


As far as FOO shall be a reference, and #define / #include should become ASTNodes
they all could not be simply skipped as whitespaces but instead
they shall be "parsed" on each consumeToken()/advance()/etc. before the raw SV parser kicks in.

```
public class MyBuilder extends PsiBuilderAdapter {

  public MyBuilder(PsiBuilder delegate) { super(delegate);}

  private void parseMacros() {
    // preprocessor rules can be defined in the same Grammar-Kit grammar
    // so we just call the "macros" rule
    // macros ::= macro_define | macro_include | macro_call | ..
    SVGeneratedParser.macros(this, 1);
  }
  public void advanceLexer() { parseMacros(); super.advanceLexer();}
  public String getTokenText() { parseMacros();return super.getTokenText();}
  public IElementType getTokenType() { parseMacros();return super.getTokenType();}
  public boolean eof() { parseMacros();return super.eof();}
  public int getCurrentOffset() { parseMacros();return super.getCurrentOffset();}
}
```
1
Comment actions Permalink

Hi Yann,

Many thanks for your very detailed explanation! It helped me to confirm that my approach is correct as it is almost the same now. I hope I will not get stuck somewhere dramatically.

Thanks also for linking this answer to my previous question. Both questions are related to preprocessing but this is more specific and assumes that the general solution is already worked out, so let me follow up on it here. I will ask some additional questions in the previous thread.

--------------------

 

My general problem raised in this thread is still not clear to me.

Let me quote relevant fragments from your answer:

After returning all the <#include> and <string> back, the lexer queries all symbols from header.h and especially all #DEFINES.

...

ParserDefinition.createLexer(Project) is sometimes called with "null" But "real" files are lexed/parsed via IFileElementType or IStubFileElementType API

 

What does it mean “lexed via IFileElementType”? Lexer interface accepts CharSequence only and there is no information about the containing file. Also Lexer instance is created using ParserDefinition.createLexer(Project) which gives us the context of the project, not the file.

In the presented example header.h is (or could be) relative to source.c. How Lexer can find header.h file if it does not about source.c?

 

0
Comment actions Permalink

This is great stuff Yann, thanks for such a detailed answer. I've filed it away in case I ever need to do this.

I have one question - as far as I am aware, it's a basic restriction that index information for a file should only be based on that file's contents, not on any other file's content. This is so that the index invalidation works correctly when files are modified. This implementation is a clear violation of that, how should that be handled? Can the index information for source.c be invalidated somehow when header.h is modified? Or is some other approach recommended?

0

Please sign in to leave a comment.