Hello, I’m using Spark streaming to aggregate data from a Kafka topic in sliding windows. Usually we want to persist this aggregated data to a MongoDB cluster, or republish to a different Kafka topic. When I include these 3rd party drivers, I usually get a NotSerializableException due to the parallelization of the job. To side step this, I’ve used static class variables which seem to help, e.g., I can run my jobs.
Is this the proper way to provide 3rd party libs to Spark jobs? Does having these drivers declared as static prohibit me from parallelizing my job? Is this even a proper way to design jobs? An alternative (I assume) would be to aggregate my data into HDFS and have another process (perhaps non-Spark?) consume it and republish/persist? Thanks, Matt --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org