getting sys.path to include the proper root directory on Run

Answered

Hi - I'm a novice at PyCharm, just trying out EAP 1.1, after using the NetBeans Python support for quite a while. How does PyCharm determine what it should add to sys.path when it runs a file? It's not doing what I would expect.

My project directory hierarchy looks like this:

(in /home/me)

MyProject/

  src/

    mymodule/

      __init__.py

      something.py

      ...

      submodule/

        somethingelse.py

I have marked MyProject/src as the "Source Root".

If I run something.py PyCharm prepends /home/me/MyProject/src/mymodule to sys.path (twice).

If I run somethingelse.py it prepends  /home/me/MyProject/src/mymodule/submodule to sys.path (twice).

But I would expect it to add /home/me/MyProject/src to sys.path. (This is what NetBeans does when you designate a source folder.)


something.py has

  from mymodule import Blah

in it, but because of the wrong sys.path, Python can't find mymodule

What's the best way to get the correct sys.path (PYTHONPATH)?

Thanks very much,

Dan

10 comments
Comment actions Permalink

Can someone please comment on why the directory containing the running script is prepended to sys.path? Is this configurable?

0
Comment actions Permalink

Hi tpetricek! I believe this is a standard behavior in Python, when you call python some_script.py current working directory is prepended to sys.path. Do you have a more complex case in mind?

0
Comment actions Permalink

Hi Pavel, I wasn't talking about the current working directory but about the directory which contains the running script. These are two different things. And Pycharm seems to add the containing directory to sys.path, which is not standard behavior. Is this behavior configurable?

1
Comment actions Permalink

Hello Pavel Karateev, I noticed the same issue. I'm working on an airflow project with a structure as follow:

airflow/
    dags/
        data_lake/
            zendesk.py
    plugins/
        zendesk/
            __init__.py
            operators/
                zendesk_to_gcs_operator.py

In my plugins/zendesk/__init__.py, there's an import statement:

from zendesk.operators import ZendeskToGcsOperator


But this didn't work because "the directory containing the running script is prepended to sys.path", ie. in sys.path the first entry is `.../airflow/dags/data_lake`, which means that the zendesk package is resolved as the zendesk.py in .../airflow/dags/data_lake. This is an unexpected behavior and I have found the culprit to be in the following code in pydevd.py:

if not is_module:
# now, the local directory has to be added to the pythonpath
# sys.path.insert(0, os.getcwd())
# Changed: it's not the local directory, but the directory of the file launched
# The file being run must be in the pythonpath (even if it was not before)
sys.path.insert(0, os.path.split(rPath(file))[0])


Could we get an answer on why is this logic needed? Is there a way to get around it? If not, is it OK if I take the initiative to fix this issue by submitting a PR to the open source intellij-community project?

0
Comment actions Permalink

Hi Hartantoteddy,

In my plugins/zendesk/__init__.py, there's an import statement:

Is there "__init__.py" with "ZendeskToGcsOperator" class import in "operators" directory?

How does the "plugin" directory end up in PYTHONPATH  for the import to be resolved when you run "zendesk.py" from the terminal?

1
Comment actions Permalink

Hi Pavel Karateev, sorry for the late reply. I didn't get any notification.

Yes, there's "__init__.py" in "operators", and it imports "ZendeskToGcsOperator". Sorry that I omitted that detail.

The plugin directory is added to PYTHONPATH in runtime by airflow, and airflow also registers each plugin to their airflow namespace. So, in "airflow/dags/data_lake/zendesk.py", I have an import statement "from airflow.operators.zendesk_plugin import ZendeskToGcsOperator". And then, when Python tries to resolve that dependency, in "airflow/plugins/zendesk/__init__.py", it runs into the error that I described.

I'm so sorry if this sound a bit convoluted. It's just the nature of how airflow works.

I think we can take a look at the original scenario described by dhalbert, which is easier to grasp.

0
Comment actions Permalink

Hartantoteddy Thank you for the clarification!

OP example has some controversy in my opinion:

- Why Python can't find `mymodule` if `src` will still be added to the PYTHONPATH (not as the first one but anyway)? Is there some name collision?

- How is it supposed to work outside of PyCharm where you can't mark `src` folder as source root?

- It is true that PyCharm doesn't do the best work managing source roots ordering in PYTHONPATH (the dedicated ticket https://youtrack.jetbrains.com/issue/PY-28321) but I would argue that if your code depends to such extent on the ordering of source roots in PYTHONPATH it may have a questionable architecture in the first place (maybe I am wrong here, no offense)

Back to your example. I am not familiar with airflow but I guess you don't run `python dags/data_lake/zendesk.py` in the terminal, is it correct? Elsewise in this case airflow magic won't work and `dags/data_lake` will get prepended to PYTHONPATH following the default Python behavior breaking `zendesk` import as it happens in the IDE. If my reasoning is correct than PyCharm more or less follows the standard Python PYTHONPATH manipulation logic. Which leads us either to changing the way `zendesk.py` is executed in PyCharm to match how it is done by airflow or reporting a feature request in the IDE to provide a way to override this behavior with prepending script parent directory to PYTHONPATH.

Please correct me if I misunderstood the details again and sorry for the wall of text.

1
Comment actions Permalink

Hi Pavel Karateev

Ah ok I see it now. My problem is different from the one raised by the OP.

Your explanation is completely correct. My bad. I understand the problem now. It has to do with how airflow works specifically... I guess if there's any changes it should be from the Airflow side.

So to give you the full story, when I ran it in PyCharm, in my zendesk.py, I have this as the main function:

if __name__ == "__main__":
dag.cli()

The `dag.cli()` exposes airflow CLI magic. So I've been able to run other similar code where the filename in `dags/data_lake/` is different from `plugins/` by passing the parameters similar to: `test load_zendesk_data_sg 2020-03-15`, which would result in `python dags/data_lake/zendesk.py test load_zendesk_data_sg 2020-03-15`

On the terminal, I can run: "airflow test datalake_zendesk load_zendesk_data_sg 2020-03-15", which achieves the same thing except that "dags/data_lake" is not added to PYTHONPATH. You are right that when "python dags/data_lake/zendesk.py" is ran, "dags/data_lake" is added to PYTHONPATH. So I guess the PyCharm developers were following the standard Python PYTHONPATH manipulation logic, as you have pointed out, just that in my case this doesn't work well with how I setup our airflow DAGS to be ran in PyCharm. I think I know how to fix this. I will change how the program is triggered from PyCharm. But hopefully I can still maintain the debugging feature.

0
Comment actions Permalink

Thank you for the extra details 👍

Actually PyCharm has a special feature to debug programs which can't be triggered within the IDE, it's called Remote Debug Server (ignore the "Remote" part, it can be used locally).

Debugger consists of two parts - Frontend (Java+Kotlin) and Backend (Python). Frontend is basically buttons you click in IDE and all that high-level stuff, while Backend is responsible for actually controlling your Python code by pausing execution, evaluating chunks of code and so on. Frontend and Backend communicate with each other over a socket so when you click e.g. "Continue" while debugging Frontend sends a corresponding protocol command to Backend.

When you execute a script in PyCharm under the debugger both Frontend and Backend are started (actually your script is passed to Backend entry point to be executed). Unfortunately, you can't do it when the program can't be triggered from PyCharm (e.g. a complex case with airflow).

But you can use Remote Debug Server run configuration to start the Frontend alone in a loop waiting for Backend connection. Next you can insert Backend initialization call (pydevd_pycharm.settrace) inside the script of interest. Once the script execution (triggered by whatever program outside of the IDE) will reach the Backend initialization call it will pause execution and establish Frontend-Backend connection allowing you to debug as usual. Please see the screencast I recorded a while ago https://youtu.be/uYt3G9WEYvo

This is not ideal as it requires some extra clicks, but it helps with such cases. Hope my description makes sense.

1
Comment actions Permalink

Pavel Karateev I see. Thank you for the thorough explanation! I am aware of Remote Debug Server being used for executables running on docker containers and the likes. I only vaguely understood how it works but now with your explanation, it's much clearer :) Didn't know it's something I can look into in this case. I will check it out. Thanks again!

0

Please sign in to leave a comment.