Auto-imports and project structure

There is a problem that arises from an interaction between PyCharm's auto-import logic and the described project structure. This seems like a bug to me as it isn't in line with the project structure of a lot of existing Python projects.

So usually a Python project is structured in the following manner (using scikit-learn as an example but this also applies to scipy, numpy, gensim etc.)

scikit-learn <– top-level directory created during git clone (project root)
./benchmarks  <– benchmarks (obviously)
./sklearn  <– all the sources for the project
./sklearn/cluster  <– sklearn cluster package containing stuff related to clustering

To get PyCharm to recognise what is what, one tends to mark some directories as Source directories in Settings - Project Structure. Given that the `scikit-learn/sklearn` directory is the root of the entire source directory tree one would be inclined to mark that as `Source`, especially when all the other directories don't really contain any sources, just auxiliary things. Doing this however will cause all the auto-imports to fail because PyCharm won't take the Source directory itself into account when auto-importing. So the `sklearn.cluster` package will be imported as just `import cluster` which will of course fail when trying to run your code. The import should of course be `from sklearn import cluster`, but PyCharm discounts the top of the Source tree.

There are two solutions to this:

1) create a separate 'src' directory under the project root and then the top level package in in there - this doesn't actually rhyme well with a lot of existing projects (sklearn, scipy, numpy, gensim, ...)
2) Mark the entire project directory as a Source directory and then Exclude all the directories that don't actually contain any sources - this is just stupid

Is there are better way to get the auto-imports to work properly?
Comment actions Permalink
In your example, are both ./sklearn and ./sklearn/cluster Python packages? That is, do both directories have a file?
Comment actions Permalink
They are. I copied structure from the scikit-learn github page
Comment actions Permalink
Here's what I did:

1) Clone the scikit-learn repo with Git.
2) Create a virtualenv (venv) for this project and activate it.
3) pip install numpy scipy
4) python develop
5) Open in PyCharm
6) Set the 'sklearn' folder as the only source folder in Project Structure.  (I also exclude the venv but that's a personal preference.)
7) Open a Python prompt from within PyCharm.
8) Both of these statements work:
   >>> import sklearn
   >>> import sklearn.cluster
Comment actions Permalink
That's not the issue.

Scenario 1:

Let's say you wanted to refactor sklearn.utils.extmath - you realise that the pinvh shouldn't be there but should instead be under sklearn.utils.fixes. You of course want all the references to that function to also change, for instance the one in sklearn.mixture.dpgmm (line 20).

So you look for sklearn.utils.extmath.pinvh, hit F6 to move the function and move it to sklearn.utils.fixes

The import in sklearn.mixture.dpgmm is changed to

from utils.fixes import pinvh

which will fail when you run the code, as that function actually lives under sklearn.utils.fixes.

Scenario 2:

First, undo the all previous changes (the refactor) and remove sklearn from the venv (undo python develop)

If you now make some changes to sklearn.mixture.dpgmm. Say you write an awesome new function that uses pinvh. Open sklearn.mixture.dpgmm and delete the import statement from line 20

from ..utils.extmath import logsumexp, pinvh, squared_norm

simulating you writing the new code and later just adding the needed imports. This will create several errors, one on line 215 because the pinvh function isn't found. Put your caret on the pinvh reference, hit Alt+Enter (on OSX). The import statement that PyCharm offers references utils.extmath.pinvh which is wrong, it should be sklearn.utils.extmath.pinvh - the top level root directory is ignored.

This issue is somewhat alleviated if you have the project code installed under the venv as PyCharm will also look for the missing import under site-packages. It doesn't fix the problem however.
Comment actions Permalink
I see what you mean now and have reproduced it. I have also run into this behavior in my projects. I usually just manually fix the import statements, but yeah, it's annoying.

Something I thought was interesting is that you get different behavior if you don't designate a source folder at all. (Make sure there is no blue 'Source folder' listed in the right side panel.)  May also need to 'invalidate caches and restart'.

Once I do that and then move the pinvh function with F6, it at least changes the import statement in to "from sklearn.utils.fixes import pinvh" which should work although it doesn't follow the existing relative import usage.  Same result when designating the root folder as the only source folder and excluding all else. It's not clear to me whether you're normally expected to set your main package folder as a 'source folder' or not.

However the curious thing was that the refactoring seemed to miss some occurrences in files I didn't have opened.  Once I opened these files and repeated the refactoring these were included. Actually this missing files in refactoring seems to happen no matter what the source folder situation.

This might be due to using relative package imports. PEP 8 says pretty clearly they should not be used, although it is widespread. There are also PyCharm bug reports which suggest relative imports aren't handled correctly, such as this one. I think mixing absolute and relative imports might also throw off the refactoring.

Bottom line, I agree that PyCharm does not seem to handle refactoring relative imports in an optimal manner. For now I'm not sure what you can do now other than manually fix the import lines after using refactorings. It might be worth opening a new bug report.
Comment actions Permalink
I've been fixing the errors manually as well but that is clearly silly as the refactoring is supposed to take care of that. I mean I could just as well move the bit of code and then go fix the import errors.

Do note that this isn't just a problem with relative imports. I am working on a project that uses solely absolute imports and the same problem occurs. It's to do with which directory you mark as source, if the top level package is marked as a source directory then PyCharm ignores that when resolving the auto-imports, relative or not.

Given that quite a number of Python projects (scipy, numpy, gensim, sklearn) are structured like that, it would seem that this needs to be fixed.
Comment actions Permalink
The PyCharm help doesn't make this very clear at all, but from looking at the help for PHPStorm and Webstorm, it seems that the 'source folder' is intended only to mark a concept like 'namespace roots' or as you said a separate src directory esp. for multiple packages. So in your example marking 'sklearn' or anything under it as a source folder is incorrect because that would be like adding the package folder directly to sys.path.  I think you don't need to mark one at all if your project root folder is the source namespace root.

You might still need to mark a bunch of non-source folders as excluded if you don't want them showing up in things like search results. A bit tedious, I agree, but it works. Refactoring won't preserve relative import usage but will at least produce usable code.

Please sign in to leave a comment.