add kafka streaming jars when initialising the sparkcontext in python

David Kennedy Wed, 10 Feb 2016 04:48:46 -0800

I have no problems when submitting the task using spark-submit.  The --jars
option with the list of jars required is successful and I see in the output
the jars being added:


16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/usr/lib/spark/extras/lib/spark-streaming-kafka.jar at
http://192.168.10.4:33820/jars/spark-streaming-kafka.jar with timestamp
1455102864058
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/scala-library-2.10.1.jar at
http://192.168.10.4:33820/jars/scala-library-2.10.1.jar with timestamp
1455102864077
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/kafka_2.10-0.8.1.1.jar at
http://192.168.10.4:33820/jars/kafka_2.10-0.8.1.1.jar with timestamp
1455102864085
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/metrics-core-2.2.0.jar at
http://192.168.10.4:33820/jars/metrics-core-2.2.0.jar with timestamp
1455102864086
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/usr/share/java/mysql.jar at http://192.168.10.4:33820/jars/mysql.jar
with timestamp 1455102864090

But when I try to programmatically create a context in python (I want to
set up some tests) I don't see this and I end up with class not found
errors.

Trying to cover all bases even though I suspect that it's redundant when
running local I've tried:

spark_conf = SparkConf()
spark_conf.setMaster('local[4]')
spark_conf.set('spark.executor.extraLibraryPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
spark_conf.set('spark.executor.extraClassPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
spark_conf.set('spark.driver.extraClassPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
spark_conf.set('spark.driver.extraLibraryPath',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')
self.spark_context = SparkContext(conf=spark_conf)

But still I get the same failure to find the same class:

Py4JJavaError: An error occurred while calling o30.loadClass.
: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper

The class is certainly in the spark_streaming_kafka.jar and is present in
the filesystem at that location.

I'm under the impression that were I using java I'd be able to use the
addJars method on the conf to take care of this but there doesn't appear to
be anything that corresponds for python.

Hacking about I found that adding:


spark_conf.set('spark.jars',
               '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
               '/opt/kafka/libs/scala-library-2.10.1.jar,'
               '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
               '/opt/kafka/libs/metrics-core-2.2.0.jar,'
               '/usr/share/java/mysql.jar')

got the logging to admit to adding the jars to the http server (just as for
the spark submit output above) but leaving the other config options in
place or removing them the class is still not found.

Is this not possible in python?

Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's
deprecated and ignored anyway) and I cannot find anything else to try.

Can anybody help?

David K.

add kafka streaming jars when initialising the sparkcontext in python

Reply via email to