Hello, I recently setup a small 3 cluster setup of Spark on an existing Hadoop installation. I’m running into an error message when attempting to use the pyspark shell. I can reproduce the error in the pyspark shell the with the following example:
from operator import add text = sc.textFile("shakespeare.txt") def tokenize(text): return text.split() words = text.flatMap(tokenize) print(words) Once the above code is executed I receive the following error message: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/python/pyspark/rdd.py", line 199, in __repr__ return self._jrdd.toString() File "/opt/spark/python/pyspark/rdd.py", line 2439, in _jrdd self._jrdd_deserializer, profiler) File "/opt/spark/python/pyspark/rdd.py", line 2374, in _wrap_function sc.pythonVer, broadcast_vars, sc._javaAccumulator) File "/usr/local/lib/python3.5/site-packages/py4j/java_gateway.py", line 1414, in __call__ answer, self._gateway_client, None, self._fqn) File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/lib/python3.5/site-packages/py4j/protocol.py", line 324, in get_return_value format(target_id, ".", name, value)) py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.api.python.PythonFunction. Trace: py4j.Py4JException: Constructor org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.String, class java.lang.String, class java.util.ArrayList, class org.apache.spark.api.python.PythonAccumulatorV2]) does not exist at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) at py4j.Gateway.invoke(Gateway.java:235) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) I’m running the following versions: Spark Version: 2.0.2 Hadoop Version: 2.7.3 Python Version: 3.5.2 Scala: 2.12 OS Version – RHEL 7.2 It appears that the python py4j gateway is having trouble communicating with “org.apache.spark.api.python.PythonFunction”. Is there any debugging that I can do on my end to figure out what may be causing that? For reference, using the spark-shell and Scala works as expected. Thanks for your time. -Tegan