Problems using pyspark from pycharm
I'm using PyCharm 2018.1, and it's using my installed python 3.6.5 interpreter, in which I've installed pyspark via pip3 -- specifically pyspark (2.1.1+hadoop2.7). Spark and Hadoop are both installed, and the environment variables on OS X seem to be set correctly.
When I run a simple pyspark job from ipython, it works correctly, like this:
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from pyspark import SparkContext
In [2]: SparkContext.getOrCreate().parallelize(range(0, 10000)).filter(lambda x: x%3 == 0).count()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/05/05 18:53:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Out[2]: 3334
This also works from python3.6.5 directly. However, when I open a Python console in PyCharm and do the same thing, it fails with a lengthy stack trace, with a number of repetitions of this:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 451, in loads
return pickle.loads(obj, encoding=encoding)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 783, in _make_skel_func
closure = _reconstruct_closure(closures) if closures else None
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 775, in _reconstruct_closure
return tuple([_make_cell(v) for v in values])
TypeError: 'int' object is not iterable
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/05/05 19:05:18 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 163, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
return self.loads(obj)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 451, in loads
return pickle.loads(obj, encoding=encoding)
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 783, in _make_skel_func
closure = _reconstruct_closure(closures) if closures else None
File "/Users/XXX/code/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py", line 775, in _reconstruct_closure
return tuple([_make_cell(v) for v in values])
TypeError: 'int' object is not iterable
Any ideas what's wrong? Thanks in advance.
Please sign in to leave a comment.