Unicode filenames within a unit test on the mac ?

I have this really puzzling project that runs both in maven and intellij on linux, but once I move to the mac the unit test simply fails when I run it with the intellij test runner.

The project can be seen at https://github.com/krosenvold/utf8repository.git, and contains simply one unit test and two different files with unicode characters in the filename, the unicode files are checked into git.

When I run this on the CLI I get

00000000 63 72 65 61 74 65 64 4F 6E 4D 61 63 2F 61 62 63 createdOnMac/abc
00000010 C3 B8                                           ..
00000000 63 72 65 61 74 65 64 4F 6E 4C 69 6E 75 78 2F 74 createdOnLinux/t
00000010 65 73 74 C3 B8                                  est..

which is totally correct according to my understanding. Inside the JUnit test runner in Idea I get:

00000000 63 72 65 61 74 65 64 4F 6E 4D 61 63 2F 61 62 63 createdOnMac/abc
00000010 EF BF BD EF BF BD                               ......
00000000 63 72 65 61 74 65 64 4F 6E 4C 69 6E 75 78 2F 74 createdOnLinux/t
00000010 65 73 74 EF BF BD EF BF BD                      est......

The project contains a test that fails inside idea but not outside. This is simply the biggest WTF I have seen in quite some time.
I understand there's something about unicode normalization, but this does not really seem to be the case here....

It seems to happen on 11,12 and 13.

What is going on here ?

Kristian

0
8 comments

IDEA starts unit tests with "-Dfile.encoding=UTF-8" VM option added (you can see it in the run console) - most probably that is the reason.

0

I wish it were that simple :)  UTF-8 0xC3B8 is actually the expected correct result here, the letter ø. I have tried setting different file.encodings and it does not appear to affect the outcome of the test caseproject. I'm not sure it makes sense file.encoding affects this behaviour either; the encoding of the file system should be an absolute value which doesnt make sense to configure in any way in the JVM? And why does Idea change any of this ??

Kristian

0

Added some diagnostic to the test:

    @BeforeClass     public static void setUp() throws Exception {         System.out.println("** JVM=" + System.getProperty("java.home"));         System.out.println("** encoding=" + System.getProperty("file.encoding"));     }


From the console:

** JVM=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home ** encoding=MacRoman 00000000 63 72 65 61 74 65 64 4F 6E 4D 61 63 2F 61 62 63 createdOnMac/abc 00000010 BF                                              . 00000000 63 72 65 61 74 65 64 4F 6E 4C 69 6E 75 78 2F 74 createdOnLinux/t 00000010 65 73 74 BF                                     est.


From the IDE:

** JVM=/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/jre
** encoding=UTF-8

00000000 63 72 65 61 74 65 64 4F 6E 4D 61 63 2F 61 62 63 createdOnMac/abc 00000010 C3 B8                                           .. 00000000 63 72 65 61 74 65 64 4F 6E 4C 69 6E 75 78 2F 74 createdOnLinux/t 00000010 65 73 74 C3 B8                                  est..


Setting JAVA_HOME to 1.7 for "mvn test" makes it produce the same result as in IDE.

0

My machine only has java7 installed (it's 2 weeks old and came with mavericks + java7)  - I never installed java6 (unless you guys bring it as part of the setup). I get the *same* java installation on disk in the sout's you added, both inside Idea and in maven.

(do a git pull on the repo; I just added your soutv and removed one of the uncessary test that could fail for other reasons...

Can you reproduce the failing test inside intellij ? It always works in maven, but fails on the mac at line 38.

(And; the reason I'm linking to the SO article in the source code is that it claims that unicode normalization issues *should* happen on the mac,
whereas my test asserts that no such normalization problem exists....? I suspect this might be some kind of OSX/java7 change that
causes stuff to break inside idea...?)

Kristian

0

AFAIK the launcher binary IDEA ships with still require Java 6 (if it's missing OS X usually suggests to install).

Nope, I can't reproduce the failure, on my Mac tests run just fine and produce expected result (I posted it in a previous comment).

Two comments on tests:
1. You're testing that paths are normalized to the form 'D', but from what I can see in OpenJDK 7 source they use form C (io_util_md.c, line 46).
2. You should be careful on judging Java string by byte dump: getBytes() method uses default encoding (the very -Dfile.encoding) to converts internal representation to byte stream.

Still have no explanation why the failure. As a wild guess - will it make ny difference if you start IDEA from the console instead of Dock/Spotlight?

0

Starting from the cli "solves" the problem.... WTF ???

As for the normalization stuff, that was just due to the SO article. I tried all normalization forms and they were "true" for all forms; so it appears
not to be an issue... (which might mean this mac is different from the ones they used ...?)

Kristian

0

You probably know that apps started by launchd get a different environment than ones started from a terminal. I suspect that there is a var in the shell environment - may be something from LANG* or LC_* family - which affects Core Foundation or Java behavior and which is not set via launchd.

To check this out you may knock off a simple app which dumps sorted System.getenv(), start it from a terminal and from IDEA, and compare output.

0

OK, this appears to be

http://stackoverflow.com/questions/12987252/file-list-retrieves-file-names-with-non-ascii-characters-incorrectly-on-mac-os

It would appear to be the LC_CTYPE=UTF-8 environment variable that saves my bacon in shell.

It turns out this problem was fixed in the .40 release of java 7, and upgrading solves the problem also for launcher-based starting.

I tell our users in issue trackers all the time those small numbers at the end of versions tend to /mean/ things; I only deserve
the pain when I ignore them myself!

Thanks a lot !

Kristian

0

Please sign in to leave a comment.