Spark Streaming + Kafka + Hive: delayed
Hello. I have a process (python) that reads a kafka queue, for each record it checks in a table. # Load table in memory table=sqlContext.sql("select id from table") table.cache() kafkaTopic.foreachRDD(processForeach) def processForeach (time, rdd): print(time) for k in rdd.collect (): if (table.filter("id =' %s'" % k["id"]).count()>0): print (k) The problem is that little by little spark time is lagging behind, I can see it in the "print(time)" output. the kafka topic with a maximum of 3 messages per second.
Re[4]: Trying to connect Spark 1.6 to Hive
Yes... I know... but The cluster is not administered by me On Mié., Ago. 9, 2017 at 13:46, Gourav Sengupta wrote: Hi, Just out of sheer curiosity - why are you using SPARK 1.6? Since then SPARK has made significant advancement and improvement, why not take advantage of that? Regards, Gourav On Wed, Aug 9, 2017 at 10:41 AM, toletum wrote: Thanks Matteo I fixed it Regards, JCS On Mié., Ago. 9, 2017 at 11:22, Matteo Cossu wrote: Hello, try to use these options when starting Spark: --conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" In this way you will be sure that the executor and the driver of Spark will use the classpath you define. Best Regards, Matteo Cossu On 5 August 2017 at 23:04, toletum wrote: Hi everybody I'm trying to connect Spark to Hive. Hive uses Derby Server for metastore_db. $SPARK_HOME/conf/hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby://derby:1527/metastore_db;create=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName org.apache.derby.jdbc.ClientDriver Driver class name for a JDBC metastore I have copied to $SPARK_HOME/lib derby.jar, derbyclient.jar, derbytools.jar Added to CLASSPATH the 3 jars too $SPARK_HOMElib/derby.jar:$SPARK_HOME/lib/derbytools.jar:$SPARK_HOME/lib/derbyclient.jar But spark-sql saids: org.datanucleus.store.rdbms.co (http://org.datanucleus.store.rdbms.co)nnectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("org.apache.derby.jdbc.ClientDriver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. java finds the class java org.apache.derby.jdbc.ClientDriver Error: Main method not found in class org.apache.derby.jdbc.ClientDriver, please define the main method as: public static void main(String[] args) or a JavaFX application class must extend javafx.application.Application It seems Spark can't find the driver
Re[2]: Trying to connect Spark 1.6 to Hive
Thanks Matteo I fixed it Regards, JCS On Mié., Ago. 9, 2017 at 11:22, Matteo Cossu wrote: Hello, try to use these options when starting Spark: --conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" In this way you will be sure that the executor and the driver of Spark will use the classpath you define. Best Regards, Matteo Cossu On 5 August 2017 at 23:04, toletum wrote: Hi everybody I'm trying to connect Spark to Hive. Hive uses Derby Server for metastore_db. $SPARK_HOME/conf/hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby://derby:1527/metastore_db;create=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName org.apache.derby.jdbc.ClientDriver Driver class name for a JDBC metastore I have copied to $SPARK_HOME/lib derby.jar, derbyclient.jar, derbytools.jar Added to CLASSPATH the 3 jars too $SPARK_HOMElib/derby.jar:$SPARK_HOME/lib/derbytools.jar:$SPARK_HOME/lib/derbyclient.jar But spark-sql saids: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("org.apache.derby.jdbc.ClientDriver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. java finds the class java org.apache.derby.jdbc.ClientDriver Error: Main method not found in class org.apache.derby.jdbc.ClientDriver, please define the main method as: public static void main(String[] args) or a JavaFX application class must extend javafx.application.Application It seems Spark can't find the driver
Trying to connect Spark 1.6 to Hive
Hi everybody I'm trying to connect Spark to Hive. Hive uses Derby Server for metastore_db. $SPARK_HOME/conf/hive-site.xml javax.jdo.option.ConnectionURL jdbc:derby://derby:1527/metastore_db;create=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName org.apache.derby.jdbc.ClientDriver Driver class name for a JDBC metastore I have copied to $SPARK_HOME/lib derby.jar, derbyclient.jar, derbytools.jar Added to CLASSPATH the 3 jars too $SPARK_HOMElib/derby.jar:$SPARK_HOME/lib/derbytools.jar:$SPARK_HOME/lib/derbyclient.jar But spark-sql saids: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("org.apache.derby.jdbc.ClientDriver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. java finds the class java org.apache.derby.jdbc.ClientDriver Error: Main method not found in class org.apache.derby.jdbc.ClientDriver, please define the main method as: public static void main(String[] args) or a JavaFX application class must extend javafx.application.Application It seems Spark can't find the driver