Auto-imports and project structure
There is a problem that arises from an interaction between PyCharm's auto-import logic and the described project structure. This seems like a bug to me as it isn't in line with the project structure of a lot of existing Python projects.
So usually a Python project is structured in the following manner (using scikit-learn as an example but this also applies to scipy, numpy, gensim etc.)
scikit-learn <– top-level directory created during git clone (project root)
./benchmarks <– benchmarks (obviously)
./continuous_integration
./doc
./examples
./sklearn <– all the sources for the project
./sklearn/cluster <– sklearn cluster package containing stuff related to clustering
To get PyCharm to recognise what is what, one tends to mark some directories as Source directories in Settings - Project Structure. Given that the `scikit-learn/sklearn` directory is the root of the entire source directory tree one would be inclined to mark that as `Source`, especially when all the other directories don't really contain any sources, just auxiliary things. Doing this however will cause all the auto-imports to fail because PyCharm won't take the Source directory itself into account when auto-importing. So the `sklearn.cluster` package will be imported as just `import cluster` which will of course fail when trying to run your code. The import should of course be `from sklearn import cluster`, but PyCharm discounts the top of the Source tree.
There are two solutions to this:
1) create a separate 'src' directory under the project root and then the top level package in in there - this doesn't actually rhyme well with a lot of existing projects (sklearn, scipy, numpy, gensim, ...)
2) Mark the entire project directory as a Source directory and then Exclude all the directories that don't actually contain any sources - this is just stupid
Is there are better way to get the auto-imports to work properly?
So usually a Python project is structured in the following manner (using scikit-learn as an example but this also applies to scipy, numpy, gensim etc.)
scikit-learn <– top-level directory created during git clone (project root)
./benchmarks <– benchmarks (obviously)
./continuous_integration
./doc
./examples
./sklearn <– all the sources for the project
./sklearn/cluster <– sklearn cluster package containing stuff related to clustering
To get PyCharm to recognise what is what, one tends to mark some directories as Source directories in Settings - Project Structure. Given that the `scikit-learn/sklearn` directory is the root of the entire source directory tree one would be inclined to mark that as `Source`, especially when all the other directories don't really contain any sources, just auxiliary things. Doing this however will cause all the auto-imports to fail because PyCharm won't take the Source directory itself into account when auto-importing. So the `sklearn.cluster` package will be imported as just `import cluster` which will of course fail when trying to run your code. The import should of course be `from sklearn import cluster`, but PyCharm discounts the top of the Source tree.
There are two solutions to this:
1) create a separate 'src' directory under the project root and then the top level package in in there - this doesn't actually rhyme well with a lot of existing projects (sklearn, scipy, numpy, gensim, ...)
2) Mark the entire project directory as a Source directory and then Exclude all the directories that don't actually contain any sources - this is just stupid
Is there are better way to get the auto-imports to work properly?
Please sign in to leave a comment.
1) Clone the scikit-learn repo with Git.
2) Create a virtualenv (venv) for this project and activate it.
3) pip install numpy scipy
4) python setup.py develop
5) Open in PyCharm
6) Set the 'sklearn' folder as the only source folder in Project Structure. (I also exclude the venv but that's a personal preference.)
7) Open a Python prompt from within PyCharm.
8) Both of these statements work:
>>> import sklearn
>>> import sklearn.cluster
Scenario 1:
Let's say you wanted to refactor sklearn.utils.extmath - you realise that the pinvh shouldn't be there but should instead be under sklearn.utils.fixes. You of course want all the references to that function to also change, for instance the one in sklearn.mixture.dpgmm (line 20).
So you look for sklearn.utils.extmath.pinvh, hit F6 to move the function and move it to sklearn.utils.fixes
The import in sklearn.mixture.dpgmm is changed to
from utils.fixes import pinvh
which will fail when you run the code, as that function actually lives under sklearn.utils.fixes.
Scenario 2:
First, undo the all previous changes (the refactor) and remove sklearn from the venv (undo python setup.py develop)
If you now make some changes to sklearn.mixture.dpgmm. Say you write an awesome new function that uses pinvh. Open sklearn.mixture.dpgmm and delete the import statement from line 20
from ..utils.extmath import logsumexp, pinvh, squared_norm
simulating you writing the new code and later just adding the needed imports. This will create several errors, one on line 215 because the pinvh function isn't found. Put your caret on the pinvh reference, hit Alt+Enter (on OSX). The import statement that PyCharm offers references utils.extmath.pinvh which is wrong, it should be sklearn.utils.extmath.pinvh - the top level root directory is ignored.
This issue is somewhat alleviated if you have the project code installed under the venv as PyCharm will also look for the missing import under site-packages. It doesn't fix the problem however.
Something I thought was interesting is that you get different behavior if you don't designate a source folder at all. (Make sure there is no blue 'Source folder' listed in the right side panel.) May also need to 'invalidate caches and restart'.
Once I do that and then move the pinvh function with F6, it at least changes the import statement in dpgmm.py to "from sklearn.utils.fixes import pinvh" which should work although it doesn't follow the existing relative import usage. Same result when designating the root folder as the only source folder and excluding all else. It's not clear to me whether you're normally expected to set your main package folder as a 'source folder' or not.
However the curious thing was that the refactoring seemed to miss some occurrences in files I didn't have opened. Once I opened these files and repeated the refactoring these were included. Actually this missing files in refactoring seems to happen no matter what the source folder situation.
This might be due to using relative package imports. PEP 8 says pretty clearly they should not be used, although it is widespread. There are also PyCharm bug reports which suggest relative imports aren't handled correctly, such as this one. I think mixing absolute and relative imports might also throw off the refactoring.
Bottom line, I agree that PyCharm does not seem to handle refactoring relative imports in an optimal manner. For now I'm not sure what you can do now other than manually fix the import lines after using refactorings. It might be worth opening a new bug report.
Do note that this isn't just a problem with relative imports. I am working on a project that uses solely absolute imports and the same problem occurs. It's to do with which directory you mark as source, if the top level package is marked as a source directory then PyCharm ignores that when resolving the auto-imports, relative or not.
Given that quite a number of Python projects (scipy, numpy, gensim, sklearn) are structured like that, it would seem that this needs to be fixed.
You might still need to mark a bunch of non-source folders as excluded if you don't want them showing up in things like search results. A bit tedious, I agree, but it works. Refactoring won't preserve relative import usage but will at least produce usable code.