I've been trying to get the cassandra_inputformat.py and cassandra_outputformat.py examples running for the past half day. I am running cassandra21 community from datastax on a single node (in my dev environment) with spark-1.1.0-bin-hadoop2.4.
I can connect and use cassandra via cqlsh and I can run the pyspark computation of pi job. Unfortunately, I cannot run the cassandra_inputformat and cassandra_outputformat examples succesfully. This is the output I am getting now: 14/09/30 18:15:41 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@dev:40208/user/HeartbeatReceiver 14/09/30 18:15:42 INFO deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class 14/09/30 18:15:43 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.ToCassandraCQLKeyConverter 14/09/30 18:15:43 INFO Converter: Loaded converter: org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter Traceback (most recent call last): File "/opt/spark-1.1.0-bin-hadoop2.4/examples/src/main/python/cassandra_outputformat.py", line 83, in <module> valueConverter="org.apache.spark.examples.pythonconverters.ToCassandraCQLValueConverter") File "/opt/spark-1.1.0-bin-hadoop2.4/python/pyspark/rdd.py", line 1184, in saveAsNewAPIHadoopDataset keyConverter, valueConverter, True) File "/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__ File "/opt/spark-1.1.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset. : java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.checkOutputSpecs(AbstractColumnFamilyOutputFormat.java:75) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:900) at org.apache.spark.api.python.PythonRDD$.saveAsHadoopDataset(PythonRDD.scala:687) at org.apache.spark.api.python.PythonRDD.saveAsHadoopDataset(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Should I have built a custom spark assembly? Am I missing a cassandra driver? I have browsed through the documentation and found nothing specifically relevant to cassandra, is there such a piece of documentation? Thank you, - David