Unable to work in console when using large Pandas Dataframe

Answered

Whenever I create a large (~1.5x10^6 rows, with large text data column, with a total size of 5GB size, with deep=True), Pycharm's iPython console gets totally unresponsive. Even simplest of operations, like doing analytics on a small subset, i.e. working with df1 defined as

df1 = df[0:100].copy()

takes forever, and hangs the console. Ctrl-C does not work and the message REPL Communication is shown in Background Tasks.

If the same operations are sent to the iPython console, outside Pycharm, no retardation is observed.

7
8 comments

Hello! Could you please disable variables showing (uncheck "Show Variables"  button in Python Console tool window) and try to reproduce the problem again?

0

The problem persists if Show Variables is unchecked.

In both situations (Show Variables checked/unchecked), a simple df.shape might take more than 30 seconds the first time it is called. 

Let me go through some example (the file has ~1.5Million records, with a large text field: clean_text). All that follows is run in a iPython console inside Pycharm 2018.1.4 (python 3.5.2 and Show Variables is unchecked from the start):

import pandas as pd

CC_FILE = '?.._webs_enriched.jsonl'
df = pd.read_json(CC_FILE, orient='records', lines=True)

If I call "df.shape" after this, it takes from no noticeable time to around 5 seconds (I don't know on which this variation depends). The results is (1493318, 11).

Then, after some processing (long but simple):

df.clean_text.fillna('', inplace=True)
df['no_tokens'] = df.clean_text.str.findall(r'\w+').str.len()

then, df.shape now takes more than 30 seconds to show the obvious result (1493318, 12). df.iloc[0], for example, takes about the same time.

This happens with ipython console inside Pycharm (with an independent iphyton console, I don't notice any delay)

 

0

Thank you for the update.

I've created https://youtrack.jetbrains.com/issue/PY-30650 in Pycharm issue tracker, please follow it for updates. See https://intellij-support.jetbrains.com/hc/en-us/articles/207241135-How-to-follow-YouTrack-issues-and-receive-notifications if you are not familiar with YouTrack. Please attach your zipped log folder (https://intellij-support.jetbrains.com/hc/en-us/articles/207241085-Locating-IDE-log-files) to the issue.

0

I am experiencing the same problem with sometime some very basic request like `df.columns`.

Currently using PyCharm Professional 2018.1.2 on MacOs 10.11.6

thanks

0

 Hi Walter,

Please vote for the issue above and feel free to leave a comment.

1

I am working with images recognition and I have to admit the REPL Communication is causing alot of delay. Even simple commands would take a while to execute. The delay gets worse the more I load more variables and image files. Could you please make REPL Communication work faster. Thank you.

0

REPL Communications slows down all commands in PyCharm. Please fix this issue ASAP.

2

Please update to the latest version (2019.2) and try changing the variable loading policy according to https://youtrack.jetbrains.com/issue/PY-30222#focus=streamItem-27-2904652.0-0

0

Please sign in to leave a comment.