Hello,

I recently setup a small 3 cluster setup of Spark on an existing Hadoop 
installation. I’m running into an error message when attempting to use the 
pyspark shell. I can reproduce the error in the pyspark shell the with the 
following example:


from operator import add
text = sc.textFile("shakespeare.txt")
def tokenize(text):
    return text.split()

words = text.flatMap(tokenize)
print(words)



Once the above code is executed I receive the following error message:


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/python/pyspark/rdd.py", line 199, in __repr__
    return self._jrdd.toString()
  File "/opt/spark/python/pyspark/rdd.py", line 2439, in _jrdd
    self._jrdd_deserializer, profiler)
  File "/opt/spark/python/pyspark/rdd.py", line 2374, in _wrap_function
    sc.pythonVer, broadcast_vars, sc._javaAccumulator)
  File "/usr/local/lib/python3.5/site-packages/py4j/java_gateway.py", line 
1414, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.5/site-packages/py4j/protocol.py", line 324, in 
get_return_value
    format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling 
None.org.apache.spark.api.python.PythonFunction. Trace:
py4j.Py4JException: Constructor 
org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, 
class java.util.ArrayList, class java.lang.String, class java.lang.String, 
class java.util.ArrayList, class 
org.apache.spark.api.python.PythonAccumulatorV2]) does not exist
                at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
                at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
                at py4j.Gateway.invoke(Gateway.java:235)
                at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
                at 
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
                at py4j.GatewayConnection.run(GatewayConnection.java:214)
                at java.lang.Thread.run(Thread.java:745)


I’m running the following versions:

Spark Version: 2.0.2
Hadoop Version: 2.7.3
Python Version: 3.5.2
Scala: 2.12
OS Version – RHEL 7.2

It appears that the python py4j gateway is having trouble communicating with 
“org.apache.spark.api.python.PythonFunction”. Is there any debugging that I can 
do on my end to figure out what may be causing that?

For reference, using the spark-shell and Scala works as expected.

Thanks for your time.

-Tegan

Reply via email to