Enhance Spellchecker

Created April 07, 2021 15:04

You can provide custom com.intellij.spellchecker.tokenizer.SpellcheckingStrategy and/or provide custom bundled dictionary via com.intellij.spellchecker.BundledDictionaryProvider

Michael Wölk

Created April 08, 2021 09:57

Ive created a class:

package application;

import com.intellij.psi.PsiElement;
import com.intellij.spellchecker.tokenizer.SpellcheckingStrategy;
import com.intellij.spellchecker.tokenizer.Tokenizer;
import org.jetbrains.annotations.NotNull;

public class spellchecker extends SpellcheckingStrategy {

    @NotNull
    @Override
    public Tokenizer getTokenizer(PsiElement element) {
        System.out.println("+++");
        System.out.println(element);
        return EMPTY_TOKENIZER;

    }

}

And Registered it

<extensions defaultExtensionNs="com.intellij">
    <spellchecker.support language="HTML" implementationClass="application.spellchecker"/>
</extensions>

But i dont see any debug output. *scratch head*

I have also registered a "ApplicationStart"-Component which can be reached, so the Plugin is working i general

Yann Cebron

Created April 08, 2021 10:00

That's because there's an existing strategy for HTML already com.intellij.spellchecker.xml.HtmlSpellcheckingStrategy

try adding _order="first"_ in your plugin.xml declaration for EP

Michael Wölk

Created April 09, 2021 09:36

I ended up by adding a new custom inspection (extends SpellCheckingInspection).

To solve my problem i needed some Recursion and smart Algorithms. Not the easiest task :-D

I try to find so much valid words as i can first and put them in a child-parent-word-tree (...). But there are a lot of Words which are valid then. Example:

Word: "emailserver"

validWordsTree: {em=null, ai=em, ls=ai, er=erv, erv=emails, ail=em, server=email, ails=em, ema=null, il=ema, ilse=ema, rv=ilse, email=null, emails=null}

As you can see many "words" with only two letters are valid. So i decided for my Plugin to use a minimum length of 3 chars to reduce false positives:

validWordsTree: {ema=null, ilse=ema, email=null, server=email, emails=null, erv=emails}

This works well so far. But i figured out that some words are not valid where there should be. Example "dnsserver":

isValidWord: dnsserver: false
isValidWord: dns: false (<= Should be true)
isValidWord: dnss: false
isValidWord: dnsse: false
isValidWord: dnsser: false
isValidWord: dnsserv: false
isValidWord: dnsserve: false
isValidWord: dnsserver: false

I check these with

private boolean isValidWord(String word) {
    return !myManager.hasProblem(word);
}

So the Question is: Why is "dns" or "php" not a valid word with that check "!myManager.hasProblem(word);" ?

These words are valid standalone.

Michael Wölk

Created April 11, 2021 16:10

TL;DR. The Quesion was:

Why is "dns" or "php" not a valid word with that check "!myManager.hasProblem(word);" ?

These words are valid standalone.

Yann Cebron

Created April 13, 2021 14:40

So basically now you have default Spellchecking inspection working and yours in addition? Or do you suppress default's "false positives" in your plugin?

And what is "myManager"? Please always share full code.

Michael Wölk

Created April 17, 2021 10:00

yes, my in addition. to eliminate the false ones i add words that i found with my plugin as correct to the dict (if you have a better idea to override the origin instead, please tell).

myManager is com.intellij.spellchecker.SpellCheckerManager. As i said i extends everything from SpellCheckingInspection (From here comes the myManager in MyTokenConsumer (I Dont know why it has the my-prefix when it comes from you :-) )

Full Code for now:

(most is copy paste, see end of file for relevant changes)

package application;

import com.intellij.codeInspection.ProblemDescriptor;
import com.intellij.codeInspection.ProblemDescriptorBase;
import com.intellij.codeInspection.ProblemHighlightType;
import com.intellij.codeInspection.ProblemsHolder;
import com.intellij.lang.*;
import com.intellij.lang.refactoring.NamesValidator;
import com.intellij.openapi.util.TextRange;
import com.intellij.psi.PsiElement;
import com.intellij.psi.PsiElementVisitor;
import com.intellij.psi.tree.IElementType;
import com.intellij.spellchecker.SpellCheckerManager;
import com.intellij.spellchecker.inspections.SpellCheckingInspection;
import com.intellij.spellchecker.inspections.Splitter;
import com.intellij.spellchecker.quickfixes.SpellCheckerQuickFix;
import com.intellij.spellchecker.tokenizer.LanguageSpellchecking;
import com.intellij.spellchecker.tokenizer.SpellcheckingStrategy;
import com.intellij.spellchecker.tokenizer.TokenConsumer;
import com.intellij.spellchecker.tokenizer.Tokenizer;
import com.intellij.spellchecker.util.SpellCheckerBundle;
import com.intellij.util.Consumer;
import gnu.trove.THashSet;
import org.jetbrains.annotations.Nls;
import org.jetbrains.annotations.NonNls;
import org.jetbrains.annotations.NotNull;
import org.jetbrains.annotations.Nullable;

import java.util.*;

public class MySpellCheckingInspection extends SpellCheckingInspection {

    public static final String SPELL_CHECKING_INSPECTION_TOOL_NAME = "MySpellCheckingInspection";
    public static final int MIN_WORD_LENGTH = 3;

    @Override
    @NonNls
    @NotNull
    public String getShortName() {
        return SPELL_CHECKING_INSPECTION_TOOL_NAME;
    }

    @Nls(capitalization = Nls.Capitalization.Sentence)
    @NotNull
    @Override
    public String getDisplayName() {
        return SPELL_CHECKING_INSPECTION_TOOL_NAME;
    }

    @Nls(capitalization = Nls.Capitalization.Sentence)
    @NotNull
    @Override
    public String getGroupDisplayName() {
        return SPELL_CHECKING_INSPECTION_TOOL_NAME;
    }

    private static SpellcheckingStrategy getSpellcheckingStrategy(@NotNull PsiElement element, @NotNull Language language) {
        for (SpellcheckingStrategy strategy : LanguageSpellchecking.INSTANCE.allForLanguage(language)) {
            if (strategy.isMyContext(element)) {
                return strategy;
            }
        }
        return null;
    }

    private static ProblemDescriptor createProblemDescriptor(PsiElement element, int offset, TextRange textRange,
                                                             SpellCheckerQuickFix[] fixes,
                                                             boolean onTheFly) {
        SpellcheckingStrategy strategy = getSpellcheckingStrategy(element, element.getLanguage());
        final Tokenizer tokenizer = strategy != null ? strategy.getTokenizer(element) : null;
        if (tokenizer != null) {
            textRange = tokenizer.getHighlightingRange(element, offset, textRange);
        }
        assert textRange.getStartOffset() >= 0;

        final String description = SpellCheckerBundle.message("typo.in.word.ref");
        return new ProblemDescriptorBase(element, element, description, fixes, ProblemHighlightType.GENERIC_ERROR_OR_WARNING, false, textRange, onTheFly, onTheFly);
    }

    private static void addBatchDescriptor(PsiElement element,
                                           int offset,
                                           @NotNull TextRange textRange,
                                           @NotNull ProblemsHolder holder) {
        System.out.println("addBatchDescriptor");
        SpellCheckerQuickFix[] fixes = SpellcheckingStrategy.getDefaultBatchFixes();
        ProblemDescriptor problemDescriptor = createProblemDescriptor(element, offset, textRange, fixes, false);
        holder.registerProblem(problemDescriptor);
    }

    private static void addRegularDescriptor(PsiElement element, int offset, @NotNull TextRange textRange, @NotNull ProblemsHolder holder,
                                             boolean useRename, String wordWithTypo) {
        System.out.println("addRegularDescriptor");
        SpellcheckingStrategy strategy = getSpellcheckingStrategy(element, element.getLanguage());

        SpellCheckerQuickFix[] fixes = strategy != null
                ? strategy.getRegularFixes(element, offset, textRange, useRename, wordWithTypo)
                : SpellcheckingStrategy.getDefaultRegularFixes(useRename, wordWithTypo, element);

        final ProblemDescriptor problemDescriptor = createProblemDescriptor(element, offset, textRange, fixes, true);
        holder.registerProblem(problemDescriptor);
    }

    @Override
    @NotNull
    public PsiElementVisitor buildVisitor(@NotNull final ProblemsHolder holder, final boolean isOnTheFly) {
        final SpellCheckerManager manager = SpellCheckerManager.getInstance(holder.getProject());

        return new PsiElementVisitor() {
            @Override
            public void visitElement(final PsiElement element) {
                if (holder.getResultCount()>1000) return;

                final ASTNode node = element.getNode();
                if (node == null) {
                    return;
                }

                // Extract parser definition from element
                final Language language = element.getLanguage();
                final IElementType elementType = node.getElementType();
                final ParserDefinition parserDefinition = LanguageParserDefinitions.INSTANCE.forLanguage(language);

                // Handle selected options
                if (parserDefinition != null) {
                    if (parserDefinition.getStringLiteralElements().contains(elementType)) {
                        if (!processLiterals) {
                            return;
                        }
                    }
                    else if (parserDefinition.getCommentTokens().contains(elementType)) {
                        if (!processComments) {
                            return;
                        }
                    }
                    else if (!processCode) {
                        return;
                    }
                }

                tokenize(element, language, new MySpellCheckingInspection.MyTokenConsumer(manager, holder, LanguageNamesValidation.INSTANCE.forLanguage(language)));
            }
        };
    }

    private static class MyTokenConsumer extends TokenConsumer implements Consumer<TextRange> {
        private final Set<String> myAlreadyChecked = new THashSet<>();
//        HashMap<String, String> validWords = new HashMap<>();
        Map<String, String> validWords = new LinkedHashMap<>();

        private final SpellCheckerManager myManager;
        private final ProblemsHolder myHolder;
        private final NamesValidator myNamesValidator;
        private PsiElement myElement;
        private String myText;
        private boolean myUseRename;
        private int myOffset;

        MyTokenConsumer(SpellCheckerManager manager, ProblemsHolder holder, NamesValidator namesValidator) {
            myManager = manager;
            myHolder = holder;
            myNamesValidator = namesValidator;
        }

        @Override
        public void consumeToken(final PsiElement element,
                                 final String text,
                                 final boolean useRename,
                                 final int offset,
                                 TextRange rangeToCheck,
                                 Splitter splitter) {
            myElement = element;
            myText = text;
            myUseRename = useRename;
            myOffset = offset;
            splitter.split(text, rangeToCheck, this);
        }

        @Override
        public void consume(TextRange textRange) {
            String word = textRange.substring(myText);
            if (!myHolder.isOnTheFly() && myAlreadyChecked.contains(word)) {
                return;
            }

            boolean keyword = myNamesValidator.isKeyword(word, myElement.getProject());
            if (keyword) {
                return;
            }

            System.out.println(word);

            boolean hasProblems = !isValidWord(word);

            if (hasProblems) {
                hasProblems = !multiWordCheck(word);
//                if (!hasProblems) {
//                    myManager.acceptWordAsCorrect(word, myManager.getProject());
//                }
            }
            if (hasProblems) {
                int aposIndex = word.indexOf('\'');
                if (aposIndex != -1) {
                    word = word.substring(0, aposIndex); // IdentifierSplitter.WORD leaves &apos;
                }
                hasProblems = myManager.hasProblem(word);
            }
            if (hasProblems) {
                if (myHolder.isOnTheFly()) {
                    addRegularDescriptor(myElement, myOffset, textRange, myHolder, myUseRename, word);
                }
                else {
                    myAlreadyChecked.add(word);
                    addBatchDescriptor(myElement, myOffset, textRange, myHolder);
                }
            }
        }

        private boolean multiWordCheck(String originWord) {
            System.out.println("==== multiWordCheck : " + originWord);

            validWords.clear();

            createValidWordTree(originWord, null);

            System.out.println("validWords: " + validWords);

            return canResolveWord(originWord);

        }

        private boolean canResolveWord(String originWord) {
            for (Map.Entry<String, String> entry : validWords.entrySet()) {
                String childWord = entry.getKey();
                String parentWord = entry.getValue();
                String resolvedWord = resolveWordFromTree(childWord);
//                System.out.println("resolvedWord: " + resolvedWord);

                if (originWord.equals(resolvedWord)) {
//                    System.out.println("resolvedWord: " + resolvedWord);
                    return true;
                }

            }

            return false;
        }

        private String resolveWordFromTree(String word) {
            String parentWord = getParentWordFromTree(word);
            if (parentWord != null) {
                return resolveWordFromTree(parentWord) + word;
//                return resolveWordFromTree(parentWord) + "|" + word;
            }
            return word;
        }

        private String getParentWordFromTree(String matchWord) {
            for (Map.Entry<String, String> entry : validWords.entrySet()) {
                String childWord = entry.getKey();
                String parentWord = entry.getValue();
//                System.out.println("matchWord:" + matchWord + " === childWord:" + childWord);
                if (matchWord.equals(childWord)) {
//                    System.out.println("=>" + parentWord);
                    return parentWord;
                }
            }

            return null;
        }


        private void createValidWordTree(String word, @Nullable String parentWord) {
//            System.out.println("=>" + word + ": " + isValidWord(word));
            if (isValidWord(word) && word.length() >= MIN_WORD_LENGTH) {
                validWords.put(word, parentWord);
                return;
            }

            ArrayList<String> words = splitWords(word);

            for (String subWord : words) {
                if (isValidWord(subWord)) {
                    validWords.put(subWord, parentWord);
                    String leftWordPart = word.replace(subWord, "");
                    createValidWordTree(leftWordPart, subWord);
                }
            }
        }

        private ArrayList<String> splitWords(String word) {
            int strLen = word.length();
            ArrayList<String> words = new ArrayList<>();
            for (int i=MIN_WORD_LENGTH; i <= strLen; i++) {
                String subWord = word.substring(0, i);
                words.add(subWord);
            }
            return words;
        }

        private boolean isValidWord(String word) {
            return !myManager.hasProblem(word);
//            boolean isValid = !myManager.hasProblem(word);
//            System.out.println("isValidWord: " + word + ": " + isValid);
//            return isValid;
        }
    }

}

Vladislav Tankov

Created April 20, 2021 10:24

As for the question about spellchecker and words -- spellchecker ignores words with length <= 3. So it actually never calls `hasProblem` for `dns`. And actually it may not include a lot of valid words of length <=3 in the end.

As for the whole problem -- we are planning to migrate to Lucene Java-pure Hunspell implementation that would solve this problem out of the box :)

Michael Wölk

Created April 25, 2021 08:09

If youre plan something, then i can wait - Thanks