Possible to make a plugin to control IntelliJ via voice?
Hey all, I just made a proof of concept of using one's voice to program in eclipse (video https://www.youtube.com/watch?v=Ywmc-D0SqUw) However, I'm fast discovering that developing plugins for Eclipse is a nightmare (to just save the document took three days to figure out), and I'm looking into switching my voice-programming project completely to IntelliJ.
I've followed the tutorial on making a plugin at https://www.youtube.com/watch?v=-ZmQD6Fr6KE and IntelliJ plugin development feels very promising (For example, in IntelliJ I can just use PsiElement, whereas to do the same thing in eclipse I'd have to couple my entire plugin to the JDT, which is an entire other monster).
There's one critical feature that I need to implement, which is much too difficult to do in eclipse, and I hope will be easier in IntelliJ. The feature would help the user express a location in the document. In my current program I say "lab 9 crafter 4 rack" which means "line 9 after fourth right parentheses" but I want to improve the process a bit:
1. User says "lab 9"
2. A little overlay pops up over line 9 (obscuring line 10, but that's fine). The overlay would like this mock: http://i.imgur.com/WOktA1F.gif
It describes regions in the line that the user can refer to.
3. User can say "h" to move the cursor to just before the left parentheses.
I've got other ambitions for the project, such as hooking up voice commands for all the menus and so on, but this is the feature I'm most worried about.
Anyway, if anyone could tell me whether it's possible to make an IntelliJ plugin like this, and give me a rough direction I can start looking into how to make this work, I would greatly appreciate it!
Please sign in to leave a comment.
Hi Evan! This is awesome! You should totally do this. There are two projects you might want to check out. First is a plugin called AceJump for the style of navigation you're looking for, specifically take a look at AceFinder for establishing the jump points, and AceCanvas for painting the overlay. There is also an EditorHelper class from ideavim that has a number of helpful methods. It's written in Kotlin, but it looks like you're pretty familiar with Java, so that shouldn't be a problem.
The second is a proof of concept we developed for exploring speech based controls for IntelliJ IDEA, you can find the implementation here, specifically in ASRServiceImpl. Instead of Dragon NaturallySpeaking we use an open source speech recognition toolkit called CMUSphinx, and a speech synthesis library called MaryTTS. We have the IDE commands, but the editor framework currently needs some more work. Let me know if you have any questions or would like to collaborate on any of this. I'm happy to help! :)
Hi Breandan! Thank you so much for replying, you've given me a ton to work with and a lot of amazing resources there. I've been reading through idear's code, and I think I have a sense of how it's put together. The way it fires off actions and deals with Psi stuff will be incredibly useful.
That AceJump plugin is exactly what I needed. I like its method much better than mine: my method was to specify the line first, and then which part of the line to go to, while AceJump's method is to specify something that I'm looking for, and then it gives me labeled choices to choose from. It means I don't have to say what line number it is, and if it's unambiguous, then their method can just jump to it.
In the next few weeks, I'm going to try and integrate AceJump (or more likely, my own version or fork of it) with voice.
An update of where I am right now: This week I managed to get round-trip communication between my voice and IntelliJ through Google's webkitSpeechRecognition, which I find to be much, much more accurate than Dragon. It also doesn't need a windows VM, which is nice.
One interesting ability I gain from webkitSpeechRecognition is that it gives me the interim results, before it makes a final decision on what was said. The interim results are pretty inaccurate, but it *is* a superset of all the words that appear in the final result, which I believe I can take advantage of when I'm integrating AceJump with voice.
I do lose Dragon's ability to train new words, but I hope to make up for that with some cool user interface. For example, instead of training Dragon to recognize "myVar" because it's used so often in the file I'm looking at, I might perhaps display a little view that has an alphabetized list of identifiers, each with a number. If myVar is twelfth on the list, perhaps I could say "identifier 12" and it will type in myVar.
If I were to wish for a feature in some voice recognition library, I'd wish for a way to add made-up words to the speech recognizer. For example, if I have the line "public static void main(String[] args) {" I wish I could tell the recognizer that "args" is a word, and that it should try to recognize it. I doubt it can be done, without Dragon's training or Google's massive amount of data. Perhaps Sphinx can do something like that though? I see hints of certain configuration abilities, something about grammars and language models.
I've been looking a bit into Sphinx (haven't heard of it before your post) and it looks quite interesting, especially its feature where it pays special attention for certain keywords. That sounds incredibly useful for a project like this. Were there any other reasons you guys went with CMU's sphinx?
Also, what are your goals for idear? I see some hints in there of features to be added (like the grammarservice), and I'm quite curious.
The nice thing about CMU Sphinx is that it is completely open source, even though the documentation is a bit inscrutable. Defining a new word like "args" is easy enough - in the dictionary file, you can just define a new word "args AA R G S", add it to your grammar, recompile, and you're all set. Theoretically, you should be able to do this at runtime as well, but in order to recognize arbitrary identifiers, you would need to convert the word to a list of phonemes. G2P tools like Phonetisaurus and FreeTTS can approximate the pronunciation for an unknown word, which you can add to the dictionary on the fly. An easier approach would be to just use some substring to recognize valid words, or tokenize them based on naming convention, ie. "myVariable" -> "my" "variable".
This is where open source libraries are slightly better than Google Speech and other APIs. Like you, we also used GSAPI for some tasks, but for writing code we needed something a little more flexible. We also ran into some issues with rate limiting on the API, and setting up and managing API keys is kind of messy. There is also the issue of offline users and data security, and we wanted to make sure it was easily accessible. Using webkitSpeechRecognition in continuous mode is a neat idea. Supposedly, there are some tools like Wit.ai that allow you to define custom recognizers, so that might be something you want to explore. But for configurability and speed, it's hard to beat offline recognition.
So the main problem we're trying to solve is easier than dictation. We have a set of actions the user can accomplish, and a set of contexts where these actions are applicable. The idea is, if you're blind or something, then you should still be able to use the keyboard. I feel like this is still the most reliable method for inputting text, especially for sensitive tasks like programming. I can see where Tavis Rudd has carpal tunnel, then maybe typing isn't completely perfect. But there are also keyboards for this. So maybe speech isn't necessarily the best way to write 10K lines of code, but it's very good at performing IDE actions like refactoring and navigation. This, we can do very reliably. Together with TTS, we can build a pretty good VUI for visually-impaired users.
The GrammarService is to swap grammars in and out of the recognizer. The problem you encounter, the more complex your grammar becomes, is decreasing recognition accuracy. So you need to somehow constrain the model to use only those grammars that are applicable in a particular context, to recognize actions more quickly and accurately. The goal is to have no perceptible latency between speaking a command and getting a result. So when the user switches to the editor, then we would activate the EditorGrammar with the symbols in that particular file. If you say, "jump to method XYZ", then we would immediately jump to method XYZ. But CMU Sphinx does not support grammar swapping well. So it's not implemented. :)
At last, I got a speech-driven AceJump-like thing to work: https://www.youtube.com/watch?v=X9nKT2syB5I
It's using the looping webkitSpeechRecognition in a separate window, and then feeding that via proxy to an IntelliJ thread which does some finding and painting similar to what AceJump did. It was quite a feeling being able to jump around the file with my voice.
When you used GSAPI, what was the latency like? As seen in the video, the final results are quite slow (though the interim results, which I use to draw the icons, are fast).
The next step is exploring Sphinx. I fired up a tiny program that prints out whatever Sphinx recognizes, and to my dismay, the accuracy is *terrible*. Did you guys experience this? Did you perhaps have to train it somehow, like how Dragon has new users train? I've seen lots of mentions of training new acoustic models / dictionaries, but it's always mentioned in the context of adding new languages, not of new users just trying to use it.
I like your idea of splitting the words according to conventions, and then approximating the pronunciations, that sounds way better than my existing plan, which was to manually ask the user to pronounce every identifier on the screen. I can perhaps save that as a last resort for when the approximations go wrong, and it'll save a ton of time.
Nice work! AceJump for voice is a great feature, you're really onto something there.
That's the tradeoff. Sphinx in free form dictation with the default English language model is not great, and although there's not too much you can do to reduce the latency of GSAPI, you can configure Sphinx for better accuracy. The easiest way is by using a grammar, in JSGF format. Here's a simple grammar that we used to get started. There's not too much similarity between phrases, and when it does make a mistake, it's usually a false negative. For getting acquainted with the API, this tutorial is probably the most up-to-date.
Bad news, it looks like training is completely unsupported in sphinx: http://stackoverflow.com/questions/33070628/how-can-i-train-a-word-at-runtime-with-sphinx?noredirect=1#comment53967635_33070628
So it seems like Sphinx can really only be useful for places with limited vocabularies, like filling in a number, typing just keywords, or navigating menus. Perfect for Idear's use case, really.
Looks like I have to proceed with Google's for now.
Hello, don't know if you'll get this since the thread is ages old, but wondering about the status of this project? Thanks for trying it! I actually use Webstorm, but any news of voice coding is good news in my book.