Pycharm run times oddity
I have a strange problem (at least till I find what silly mistake I am making). A short summary is presented below with a link to the stackoverflow question (https://stackoverflow.com/
Essentially, I generate a large dataframe upstream in my code and then pass it to a FOR loop that generates groups (2.8 million) using a SINGLE column and saves to a list. The large dataframe (temp_df1) is 10million rows X 18 cols. This GROUPBY operation takes 25 mins to run as it creates groups and appends to a list. This is significantly long. So I tested the code by saving the large dataframe (temp_df1) to a CSV and then batching that in. The GROUPBY operation when run on this pre-saved and batched in CSV only takes 7 minutes.
So what is it that I am doing that is causing such a drastic difference in run times?
I am using the latest version of PyCharm. Have no SettingwithCopy warnings. Running on a machine with 256 GB RAM and 24 cores so no memory or CPU limitations either. Thanks.
Please sign in to leave a comment.
I am answering my question as I stumbled upon the answer while doing a bunch of tests and thankfully when I googled the solution someone else had the same issue. The explanation for why having categorical columns is a bad idea when doing group_by operations can be found at the above link. Thus I am not going to post it here. Thanks.
Are you running the script within Pycharm? If so, what is the time difference when you run the same script outside pycharm? Are you using 'Run' or 'Debug'?