Pycharm run times oddity

Completed

I have a strange problem (at least till I find what silly mistake I am making). A short summary is presented below with a link to the stackoverflow question (https://stackoverflow.com/questions/51129600/python-pycharm-runtimes).
Essentially, I generate a large dataframe upstream in my code and then pass it to a FOR loop that generates groups (2.8 million) using a SINGLE column and saves to a list. The large dataframe (temp_df1) is 10million rows X 18 cols. This GROUPBY operation takes 25 mins to run as it creates groups and appends to a list. This is significantly long. So I tested the code by saving the large dataframe (temp_df1) to a CSV and then batching that in. The GROUPBY operation when run on this pre-saved and batched in CSV only takes 7 minutes.

So what is it that I am doing that is causing such a drastic difference in run times? 

I am using the latest version of PyCharm. Have no SettingwithCopy warnings. Running on a machine with 256 GB RAM and 24 cores so no memory or CPU limitations either. Thanks.

0
2 comments
Avatar
Permanently deleted user

I am answering my question as I stumbled upon the answer while doing a bunch of tests and thankfully when I googled the solution someone else had the same issue. The explanation for why having categorical columns is a bad idea when doing group_by operations can be found at the above link. Thus I am not going to post it here. Thanks.

0

Are you running the script within Pycharm?  If so, what is the time difference when you run the same script outside pycharm?  Are you using 'Run' or 'Debug'?

0

Please sign in to leave a comment.