pyspark cassandra examples

David Vincelli Tue, 30 Sep 2014 11:38:31 -0700

I've been trying to get the cassandra_inputformat.py and
cassandra_outputformat.py examples running for the past half day. I am
running cassandra21 community from datastax on a single node (in my dev
environment) with spark-1.1.0-bin-hadoop2.4.


I can connect and use cassandra via cqlsh and I can run the pyspark
computation of pi job.

Unfortunately, I cannot run the cassandra_inputformat and
cassandra_outputformat examples succesfully.

This is the output I am getting now:

14/09/30 18:15:41 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@dev:40208/user/HeartbeatReceiver
14/09/30 18:15:42 INFO deprecation: mapreduce.outputformat.class is
deprecated. Instead, use mapreduce.job.outputformat.class
14/09/30 18:15:43 INFO Converter: Loaded converter:
org.apache.spark.examples.pythonconverters.ToCassandraCQLKeyConverter
14/09/30 18:15:43 INFO Converter: Loaded converter:
org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter
Traceback (most recent call last):
  File
"/opt/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_outputformat.py",
line 83, in <module>

valueConverter="org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter")
  File "/opt/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1184,
in saveAsNewAPIHadoopDataset
    keyConverter, valueConverter, True)
  File
"/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset.
: java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.JobContext, but class was expected
at
org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.checkOutputSpecs(AbstractColumnFamilyOutputFormat.java:75)
at
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900)
at
org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:687)
at
org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Should I have built a custom spark assembly? Am I missing a cassandra
driver? I have browsed through the documentation and found nothing
specifically relevant to cassandra, is there such a piece of documentation?

Thank you,

- David

pyspark cassandra examples

Reply via email to