Format of the *.dic dictionary files of the 9.0 spell checker

Hi,

in the new Intellij 9.0 you can add own dictionary files for the build-in spell checker. Bundled are two of them - an english one and one with special terms called jetbrains.

I like to add a german *.dic file. So I started downloading the igerman98 sources [1] and tried the generated *.dic files for ispell/hunspell. But no such file seems to work. Than I downloaded the old german dictionary [2] from the old spell checker plugin. In the JAR, I found a german.0 file. I renamed it into german.dic and added it to Intellij 9.0. It works. But than I noticed, that all words with german umlauts and ß won't recordnized.

The difference between the german.0 file inside the old german plugin dictionary is simply a list of newline separated words, where the ispell/hunspell *.dic files contain also some control information for conjugation and case. I suppose that the spell checker in Intellij 9.0 doesn't support this.

So is there a way to convert ispell/hunspell *.dic files in the more simple format suported by Intellij 9.0?

[1]: http://www.j3e.de/ispell/igerman98/
[2]: http://plugins.intellij.net/plugin/?idea&id=1658

0
9 comments
Official comment

@Pokriefke @Richard Hunspell plugin is available now, please install it and you will be able to add hunspell dictionaries to spellchecker 

Hi Alexander,

As regards german dictionary - we are going to bundle it in next
available IDEA public build (IDEA 9.0.1 will be released before the New
Year). And any custom dictionary with encoding set to UTF-8 will be
supported in full. (Currently default system encoding is used during
dictionary loading).
As regards dictionary format and converters - in the nearest future we
are not going to create a public converter for ispell/hunspell
dictionaries. The simple newline separated words list to create custom
dictionary looks like a simple way to give a possibility to extend
spell-checker knowledge base. Special .dic extension was selected just
to simplify user folder screening.

--
Ekaterina Shliakhovetskaja
Senior Software Developer
JetBrains, Inc.
http://www.jetbrains.com/
"Develop with Pleasure!"

On 10.12.2009 18:03, Alexander Kiel wrote:

Hi,

>

in the new Intellij 9.0 you can add own dictionary files for the build-in spell checker. Bundled are two of them - an english one and one with special terms called jetbrains.

>

I like to add a german *.dic file. So I started downloading the igerman98 sources and tried the generated *.dic files for ispell/hunspell. But no such file seems to work. Than I downloaded the old german dictionary from the old spell checker plugin. In the JAR, I found a german.0 file. I renamed it into german.dic and added it to Intellij 9.0. It works. But than I noticed, that all words with german umlauts and ß won't recordnized.

>

The difference between the german.0 file inside the old german plugin dictionary is simply a list of newline separated words, where the ispell/hunspell *.dic files contain also some control information for conjugation and case. I suppose that the spell checker in Intellij 9.0 doesn't support this.

>

So is there a way to convert ispell/hunspell *.dic files in the more simple format suported by Intellij 9.0?

>

: http://www.j3e.de/ispell/igerman98/
: http://plugins.intellij.net/plugin/?idea&id=1658

>

---
Original message URL: http://www.jetbrains.net/devnet/message/5252314#5252314


0

Hi Ekaterina,

thanks for your replay.

The issue which I see with the simple word list format is, that you have to add every case of conjugation, lower and upper case.

As I asked for a converter, I had this issue in mind. The converter could expand a ispell/myspell/hunspell dictionary into a word list. I'm not sure if there is already one available. A quick google search did not reveal anything.

I think hunspell is widely used [1] and more advanced as a simple wordlist. Unfortnately there is no Pure Java implementation available [1, 2]. I think a JNI wrapper is not a good solution for IDEA.

[1]: http://en.wikipedia.org/wiki/Hunspell
[2]: http://www.mail-archive.com/dev@lingucomponent.openoffice.org/msg01519.html

0

Hi Alexander,

Thank you for your investigations. The decision of dictionary format is
based on the language-independent behavior of the spellchecker. We don't
use stemming at all. It force dictionary to grow, but we have now a
solution how to compress dictionary and prevent huge memory utilization.

--
Ekaterina Shliakhovetskaja
Senior Software Developer
JetBrains, Inc.
http://www.jetbrains.com/
"Develop with Pleasure!"

On 10.12.2009 22:24, Alexander Kiel wrote:

Hi Ekaterina,

>

thanks for your replay.

>

The issue which I see with the simple word list format is, that you have to add every case of conjugation, lower and upper case.

>

As I asked for a converter, I had this issue in mind. The converter could expand a ispell/myspell/hunspell dictionary into a word list. I'm not sure if there is already one available. A quick google search did not reveal anything.

>

I think hunspell is widely used and more advanced as a simple wordlist. Unfortnately there is no Pure Java implementation available . I think a JNI wrapper is not a good solution for IDEA.

>

: http://en.wikipedia.org/wiki/Hunspell
: http://www.mail-archive.com/dev@lingucomponent.openoffice.org/msg01519.html

>

---
Original message URL: http://www.jetbrains.net/devnet/message/5252363#5252363


0

Hi Ekaterina,

okay, thanks. I'm looking forward for the german dictionary in the 9.0.1 version.

0

Hi Ekaterina,

I have installed IDEA 9.0.1 via patch update on linux. But I don't see a German dictionary. Is it scheduled for a later version?

Thanks
Alexander

0
As regards german dictionary - we are going to bundle it in next

available IDEA public build (IDEA 9.0.1 will be released before the New
Year).

Hello  Ekaterina,

I would be very interested in a german dictionary, too.
Yet IDEA 9.0.3 does bundle only the english one. (Which sounds ok for me).
But any links how to create a custom dictionary without starting from scratch would be fine.

Thanks!
- Ben

0

"As regards dictionary format and converters - in the nearest future we
are not going to create a public converter for ispell/hunspell
dictionaries."

As you probably already know, some hunspell dictionaries do NOT contain all words and contain only lemmas which allow to create all the form of a word. A good example is dicollecte.

How can we get all words from a hunspell dic file to IntelliJ?

 

 

0

Sorry for bringing this old topic back to the top, but I really would love to see a simple support for hunspell dictionaries in Intellij, too. I tried to compile my own German dictionary and ended up with a list of 1.5M words (unmunched a German hunspell dic). Unfortunately this list is still incomplete. Even the simple word "Wörterbuch" (Dictionary) is not listed, because it is a combination of "Wörter" and "Buch". Trying to get every possible word combination in one file is nearly impossible. I'm not sure if this is a unique problem for German words or how many other languages are using word combinations...

It is just a spell check, but it would be nice, if we could activate it in our company again.

0

Please sign in to leave a comment.