RE: tableau spark sql cassandra
Hi! Sure, I'll post the info I grabbed once the cassandra tables values appear in Tableau. Best, Jerome -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19480.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: tableau spark sql cassandra
Well, after many attempts I can now successfully run the thrift server using root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master spark://10.194.30.2:7077 --hiveconf hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port 1 (the command was failing because of the --driver-class-path $CLASSPATH parameter which I guess was setting the spark.driver.extraClassPath) and I can get the cassandra data using beeline! However, the table's values are null in Tableau but this is another problem ;) Best, Jerome -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19392.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: tableau spark sql cassandra
I finally solved this problem. The org.apache.hadoop.mapreduce.JobContext is a class in hadoop < 2.0 and is an interface in hadoop >= 2.0. I have to use a spark build for hadoop v1. So spark-sql seems fine. But, the thriftserver does not work with my config! Here is my spark-env.sh: #!/usr/bin/env bash export JAVA_HOME=/usr/lib/jvm/java-7-oracle export SPARK_HOME=/home/jererc/spark export SPARK_MASTER_IP=10.194.30.2 export SPARK_WORKER_CORES=2 export SPARK_WORKER_INSTANCES=4 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_MEMORY=4g export MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT} export CLASSPATH=$(echo ${SPARK_HOME}/lib/*.jar | sed 's/ /:/g'):$CLASSPATH export SPARK_CLASSPATH=$CLASSPATH Here is the output: root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master spark://10.194.30.2:7077 --driver-class-path $CLASSPATH --hiveconf hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port 1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/11/20 14:55:35 INFO thriftserver.HiveThriftServer2: Starting SparkContext 14/11/20 14:55:35 WARN spark.SparkConf: SPARK_CLASSPATH was detected (set to '/home/jererc/spark/lib/cassandra-all-1.2.9.jar:/home/jererc/spark/lib/cassandra-thrift-1.2.9.jar:/home/jererc/spark/lib/datanucleus-api-jdo-3.2.1.jar:/home/jererc/spark/lib/datanucleus-core-3.2.2.jar:/home/jererc/spark/lib/datanucleus-rdbms-3.2.1.jar:/home/jererc/spark/lib/hadoop-core-0.20.205.0.jar:/home/jererc/spark/lib/hive-cassandra-1.2.9.jar:/home/jererc/spark/lib/mysql-connector-java.jar:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/jererc/spark/lib/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar:'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --driver-class-path to augment the driver classpath - spark.executor.extraClassPath to augment the executor classpath 14/11/20 14:55:35 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to '/home/jererc/spark/lib/cassandra-all-1.2.9.jar:/home/jererc/spark/lib/cassandra-thrift-1.2.9.jar:/home/jererc/spark/lib/datanucleus-api-jdo-3.2.1.jar:/home/jererc/spark/lib/datanucleus-core-3.2.2.jar:/home/jererc/spark/lib/datanucleus-rdbms-3.2.1.jar:/home/jererc/spark/lib/hadoop-core-0.20.205.0.jar:/home/jererc/spark/lib/hive-cassandra-1.2.9.jar:/home/jererc/spark/lib/mysql-connector-java.jar:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/jererc/spark/lib/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar:' as a work-around. Exception in thread "main" org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former. at org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:300) at org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:298) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:298) at org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:286) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:286) at org.apache.spark.SparkContext.(SparkContext.scala:158) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:36) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:57) at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) And if I don't use SPARK_CLASSPATH then spark-sql does not work. I tried ADD_JARS without much success. What's the best way to set the CLASSPATH and the jars? -- View this message in context: http://apache-spark-user-li
Re: tableau spark sql cassandra
Hi! The hive table is an external table, which I created like this: CREATE EXTERNAL TABLE MyHiveTable ( id int, data string ) STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler' TBLPROPERTIES ("cassandra.host" = "10.194.30.2", "cassandra.ks.name" = "test" , "cassandra.cf.name" = "mytable" , "cassandra.ks.repfactor" = "1" , "cassandra.ks.strategy" = "org.apache.cassandra.locator.SimpleStrategy" ); Here is the output from spark-sql for different commands: spark-sql> show tables; 14/11/20 09:50:32 INFO parse.ParseDriver: Parsing command: show tables 14/11/20 09:50:32 INFO parse.ParseDriver: Parse Completed 14/11/20 09:50:32 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO parse.ParseDriver: Parsing command: show tables 14/11/20 09:50:32 INFO parse.ParseDriver: Parse Completed 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: Semantic Analysis Completed 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO exec.ListSinkOperator: Initializing Self 0 OP 14/11/20 09:50:32 INFO exec.ListSinkOperator: Operator 0 OP initialized 14/11/20 09:50:32 INFO exec.ListSinkOperator: Initialization Done 0 OP 14/11/20 09:50:32 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from deserializer)], properties:null) 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: Starting command: show tables 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: OK 14/11/20 09:50:32 INFO ql.Driver: OK 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO mapred.FileInputFormat: Total input paths to process : 1 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: myhivetable Time taken: 0.088 seconds 14/11/20 09:50:32 INFO CliDriver: Time taken: 0.088 seconds 14/11/20 09:50:32 INFO ql.Driver: 14/11/20 09:50:32 INFO ql.Driver: spark-sql> describe myhivetable; 14/11/20 09:50:35 INFO parse.ParseDriver: Parsing command: describe myhivetable 14/11/20 09:50:35 INFO parse.ParseDriver: Parse Completed id int from deserializer datastring from deserializer Time taken: 0.226 seconds 14/11/20 09:50:35 INFO CliDriver: Time taken: 0.226 seconds spark-sql> select * from myhivetable; 14/11/20 09:50:39 INFO parse.ParseDriver: Parsing command: select * from myhivetable 14/11/20 09:50:39 INFO parse.ParseDriver: Parse Completed 14/11/20 09:50:39 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 14/11/20 09:50:39 INFO storage.MemoryStore: ensureFreeSpace(420085) called with curMem=0, maxMem=278302556 14/11/20 09:50:39 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 410.2 KB, free 265.0 MB) 14/11/20 09:50:39 INFO storage.MemoryStore: ensureFreeSpace(30564) called with curMem=420085, maxMem=278302556 14/11/20 09:50:39 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 29.8 KB, free 265.0 MB) 14/11/20 09:50:39 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.194.30.2:57707 (size: 29.8 KB, free: 265.4 MB) 14/11/20 09:50:39 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0 14/11/20 09:50:39 ERROR thriftserver.SparkSQLDriver: Failed in [select * from myhivetable] java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext at org.apache.hadoop.hive.cassandra.input.cql.HiveCqlInputFormat.getSplits(HiveCqlInputFormat.java:166) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca
tableau spark sql cassandra
Hello! I'm working on a POC with Spark SQL, where I’m trying to get data from Cassandra into Tableau using Spark SQL. Here is the stack: - cassandra (v2.1) - spark SQL (pre build v1.1 hadoop v2.4) - cassandra / spark sql connector (https://github.com/datastax/spark-cassandra-connector) - hive - mysql - hive / mysql connector - hive / cassandra handler (https://github.com/tuplejump/cash/tree/master/cassandra-handler) - tableau - tableau / spark sql connector I get an exception in spark-sql (bin/spark-sql) when trying to query the cassandra table (java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext), it looks like a missing hadoop dependency; showing tables or describing them work fine. Do you know how to solve this without of hadoop? Is Hive a dependency in Spark SQL? Best, Jerome -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org