UnicodeEncodeError in PyCharm console, but not in Terminal console

*NB: This post concerns Python 2.7.3*

I've an HTML file that I'm fairly certain is encoded as it states itself to be (ie. in UTF-8). After opening a python console in PyCharm 2.5, I open the file and read a line that contains a character outside of the ASCII range, then attempt to print it:

    >>>unifile = open("/tmp/bsHtml5.html","r")

    >>>line = unifile.next()

    >>>line

    '       66laps/django-4store \xc2\xb7 GitHub\n'

    >>>print(line)

    6laps/django-4store · GitHub

    >>>print(line.decode("utf-8"))

    Traceback (most recent call last):

      File "<console>", line 1, in <module>

    UnicodeEncodeError: 'ascii' codec can't encode character u'\xb7' in position 28: ordinal not in range(128)

As the example above shows, the unicode character prints correctly the first two times, but not in the third, when I call decode('utf-8') on the string. When I repeat the exact same process in OS X's Terminal, no error is produced; output using print(line) and print(line.decode('utf-8')) is the same.

If you've gotten this far in reading this post, you might be saying, "well then, just don't do that." And I agree, this is a contrived example. But it's a simplified version of a problem that also occurs when I use an XML/HTML parser's prettify() method, which "pretty prints" a document to stdout. In the PyCharm console, this produces the same UnicodeEncodeError error, in the Terminal, it does not.

Yeah, I realize it's not a show stopper that will prevent me from getting things done, but it is an annoyance and I think it can be solved via configuration of the PyCharm environment. To that end, here are some potentially relevant values of various environment variables from both applications (note that both the Terminal console and the PyCharm console are using the same virtualenv and the same interpreter--Python 2.7.3).

PyCharm Console:

    >>>os.environ

    {'LANG': 'en_US.UTF-8', ... 'TERM': 'emacs', ...  '__CF_USER_TEXT_ENCODING': '0x1F5:0:0', ... 'COMMAND_MODE': 'unix2003'}

    >>>sys.getdefaultencoding()

    'ascii'

Console in Terminal:

    >>>os.environ

    {... 'LANG': 'en_US.UTF-8', 'TERM': 'screen', ... 'TERMCAP': 'SC|screen|VT 100/ANSI X3.64 virtual ... ' ... '__CF_USER_TEXT_ENCODING': '0x1F5:0:0',... 'COMMAND_MODE': 'unix2003'}

    >>>sys.getdefaultencoding()

    'ascii'

I'm pretty much grasping at straws, but could the term setting have something to do with it? I've tried changing it in the PyCharm console using Preferences->Console->Python Console->Environment Variables but it does not effect the value of the variable when I start a new console (that is, TERM is still 'emacs').

Any constructive suggestions would be warmly welcomed.

Thanks,

Casey

3 comments
Comment actions Permalink

Neither the TERM nor TERMCAP variables seem to have anything to do with the issue.

0
Comment actions Permalink

Problem was in sys.stdout.encoding which was unset in Pycharm, while console detected it as UTF-8.

PYTHONIOENCODING setting solved the problem, so try next PyCharm build.

0

Please sign in to leave a comment.