Unicode and codepage

Is IDEA java source editor Unicode-compliat? I created a String with
cyrillic characters (I was able to type it without any problems on my
Win2000 with russian fonts and keyboard layout installed), but IDEA did not
save it as cyrillic, it just stuffed the string with question marks. My
default locale is English (American).

  • Do I need to set up a default codepage for source files? How do I do it?

  • Can IDEA save *.java files in Unicode?


I know that I can set codepage for javac, but I would prefer *.java files to
be saved in Unicode.

As another approach, IDEA might save *.java in default encoding, but save
strings as Unicode escapes. IDEA might assign language to every string
literal in the source code to convert them into proper Unicode escapes.

Michael J.



9 comments

Michael Jouravlev wrote:

Is IDEA java source editor Unicode-compliat? I created a String with


Since it's written in Java, I would think so.

cyrillic characters (I was able to type it without any problems on my
Win2000 with russian fonts and keyboard layout installed), but IDEA did not
save it as cyrillic, it just stuffed the string with question marks. My
default locale is English (American).


That would be the problem. If you look in IDE Settings -> General,
you'll find a "File encoding" setting which defaults to the System
Default encoding. So those Cyrillic characters are being changed into ?
because they don't exist in the English (American) locale. You could try
changing the File encoding setting to UTF-8 which should use UTF-8
escape sequences to encode the Cyrillic characters.

As another approach, IDEA might save *.java in default encoding, but save
strings as Unicode escapes. IDEA might assign language to every string
literal in the source code to convert them into proper Unicode escapes.


Wooo. That would far too difficult to track. I think changing IDEA's
File encoding setting is probably better.

Ciao,
Gordon

--
Gordon Tyler (Software Developer)
Quest Software <http://java.quest.com/>
260 King Street East, Toronto, Ontario M5A 4L5, Canada
Voice: 416-643-4846 | Fax: 416-594-1919

0


"Gordon Tyler" <gordon.tyler@sitraka.com> wrote in message
news:b6fhb8$lgn$1@is.intellij.net...

Michael Jouravlev wrote:

Is IDEA java source editor Unicode-compliat? I created a String with

>

Since it's written in Java, I would think so.

>

cyrillic characters (I was able to type it without any problems on my
Win2000 with russian fonts and keyboard layout installed), but IDEA did

not

save it as cyrillic, it just stuffed the string with question marks. My
default locale is English (American).

>

That would be the problem. If you look in IDE Settings -> General,
you'll find a "File encoding" setting which defaults to the System
Default encoding. So those Cyrillic characters are being changed into ?
because they don't exist in the English (American) locale. You could try
changing the File encoding setting to UTF-8 which should use UTF-8
escape sequences to encode the Cyrillic characters.


Thanks, I found this setting and it works with string literals, but not with
variables (not that I really need that...) I was looking for "unicode" in
online help and found nothing, so I asked here :)

One thing that bothers me, and it does not relate to IDEA, is that Win2K
Notepad can save text files in Unicode big and little endian (which is
UTF-16) and in UTF-8. In all these modes it puts encoding type in the
beginning of the file, exactly in accordance with Unicode standards.

javac can compile UTF-16 file which contains encoding type characters
without problems, but it chokes on UTF8, if file has encoding type
characters. Why is that? IDEA does not put encoding type characters in the
source file, so it is compiled correctly. They knew ;) But I still wonder,
is it a javac bug?

Michael J.


0

javac can compile UTF-16 file which contains encoding type characters
without problems, but it chokes on UTF8, if file has encoding type
characters. Why is that? IDEA does not put encoding type characters in the
source file, so it is compiled correctly. They knew ;) But I still wonder,
is it a javac bug?


I forgot to add, that I use -encoding swith of javac. I cannot compile
without this switch, if I have UTF-16 file. I cannot compile UTF8 file even
with the switch, if file starts from encoding type chars.


0

Maybe the best approach is to convert your java sources to plain ASCII
using the native2ascii tool from the JDK.

Tom

0

UTF-8 or UTF-16 is not your only choice if you only need russian in string
literals or comments.
You may choose windows-1251

--

Best regards,
Maxim Shafirov
JetBrains, Inc / IntelliJ Software
http://www.intellij.com
"Develop with pleasure!"


"Michael Jouravlev" <mikus@mail.ru> wrote in message
news:b6f6t6$ufb$1@is.intellij.net...

Is IDEA java source editor Unicode-compliat? I created a String with
cyrillic characters (I was able to type it without any problems on my
Win2000 with russian fonts and keyboard layout installed), but IDEA did

not

save it as cyrillic, it just stuffed the string with question marks. My
default locale is English (American).

>

  • Do I need to set up a default codepage for source files? How do I do it?

  • Can IDEA save *.java files in Unicode?

>

I know that I can set codepage for javac, but I would prefer *.java files

to

be saved in Unicode.

>

As another approach, IDEA might save *.java in default encoding, but save
strings as Unicode escapes. IDEA might assign language to every string
literal in the source code to convert them into proper Unicode escapes.

>

Michael J.

>
>
>


0


"Maxim Shafirov" <max@intellij.net> wrote in message
news:b6h5mg$vja$1@is.intellij.net...

UTF-8 or UTF-16 is not your only choice if you only need russian in string
literals or comments.
You may choose windows-1251


Yeah, I know... I would prefer my source files to be multilingual. So, IDEA
does not support UTF-16 and it does not support
"out-of-current-8-bit-code-table" characters outside string literals, right?
Or does it support UTF-16, but is has some other name like ISO-simething?

Thanks.

Michael Jouravlev.


0

Michael Jouravlev wrote:

Thanks, I found this setting and it works with string literals, but not with
variables (not that I really need that...) I was looking for "unicode" in
online help and found nothing, so I asked here :)


According to the Java Language Spec section 3.1 "Unicode":

http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#95413

"Except for comments (?3.7), identifiers, and the contents of character
and string literals (?3.10.4, ?3.10.5), all input elements (?3.5) in a
program are formed only from ASCII characters (or Unicode escapes (?3.3)
which result in ASCII characters). ASCII (ANSI X3.4) is the American
Standard Code for Information Interchange. The first 128 characters of
the Unicode character encoding are the ASCII characters."

javac can compile UTF-16 file which contains encoding type characters
without problems, but it chokes on UTF8, if file has encoding type
characters. Why is that? IDEA does not put encoding type characters in the
source file, so it is compiled correctly. They knew ;) But I still wonder,
is it a javac bug?


http://developer.java.sun.com/developer/bugParade/bugs/4508058.html
http://developer.java.sun.com/developer/bugParade/bugs/4211090.html

(BOM = Byte Order Mark)

Ciao,
Gordon

--
Gordon Tyler (Software Developer)
Quest Software <http://java.quest.com/>
260 King Street East, Toronto, Ontario M5A 4L5, Canada
Voice: 416-643-4846 | Fax: 416-594-1919

0


"Gordon Tyler" <gordon.tyler@sitraka.com> wrote in message
news:b6hoh9$gd9$1@is.intellij.net...
>Michael Jouravlev wrote:
>> Thanks, I found this setting and it works with string literals, but not
with
>> variables (not that I really need that...) I was looking for "unicode" in
>> online help and found nothing, so I asked here :)

According to the Java Language Spec section 3.1 "Unicode":
"Except for comments (?3.7), identifiers, and the contents of character
and string literals (?3.10.4, ?3.10.5)..."


Yes, that what I meant -- I cannot save cyrillic identifier even if I set
encoding to UTF8. String is saved normally. But I do not care much about
cyrillic variables :)

http://developer.java.sun.com/developer/bugParade/bugs/4508058.html
http://developer.java.sun.com/developer/bugParade/bugs/4211090.html


Woa, thanks for the links, Gordon! So it IS a bug. An old spot on the Sun :)

Michael J.


0

UTF-16 is not allowed to be a default encoding though files in UTF-16 should
read up seamlessly if "File detected as........" is checked on.
On other hand there's no any limitation on where national characters could
be used. We do not anyhow specially control string literals or whatever.
If you select UTF-8 as your default encoding everything should be written in
UTF-8 (after restart. Did you do that?).

--

Best regards,
Maxim Shafirov
JetBrains, Inc / IntelliJ Software
http://www.intellij.com
"Develop with pleasure!"


"Michael Jouravlev" <mikus@mail.ru> wrote in message
news:b6hmbs$au1$1@is.intellij.net...
>

"Maxim Shafirov" <max@intellij.net> wrote in message
news:b6h5mg$vja$1@is.intellij.net...

UTF-8 or UTF-16 is not your only choice if you only need russian in

string

literals or comments.
You may choose windows-1251

>

Yeah, I know... I would prefer my source files to be multilingual. So,

IDEA

does not support UTF-16 and it does not support
"out-of-current-8-bit-code-table" characters outside string literals,

right?

Or does it support UTF-16, but is has some other name like ISO-simething?

>

Thanks.

>

Michael Jouravlev.

>
>


0

Please sign in to leave a comment.