RE: tableau spark sql cassandra

2014-11-21 Thread jererc
Hi!

Sure, I'll post the info I grabbed once the cassandra tables values appear
in Tableau.

Best,
Jerome



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: tableau spark sql cassandra

2014-11-20 Thread jererc
Well, after many attempts I can now successfully run the thrift server using
root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master
spark://10.194.30.2:7077 --hiveconf hive.server2.thrift.bind.host 0.0.0.0
--hiveconf hive.server2.thrift.port 1

(the command was failing because of the --driver-class-path $CLASSPATH
parameter which I guess was setting the spark.driver.extraClassPath) and I
can get the cassandra data using beeline!

However, the table's values are null in Tableau but this is another problem
;)

Best,
Jerome



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282p19392.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: tableau spark sql cassandra

2014-11-20 Thread jererc
I finally solved this problem.
The org.apache.hadoop.mapreduce.JobContext is a class in hadoop < 2.0 and is
an interface in hadoop >= 2.0.
I have to use a spark build for hadoop v1.

So spark-sql seems fine. 
But, the thriftserver does not work with my config!

Here is my spark-env.sh:

#!/usr/bin/env bash
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export SPARK_HOME=/home/jererc/spark
export SPARK_MASTER_IP=10.194.30.2
export SPARK_WORKER_CORES=2
export SPARK_WORKER_INSTANCES=4
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_MEMORY=4g
export MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}
export CLASSPATH=$(echo ${SPARK_HOME}/lib/*.jar | sed 's/ /:/g'):$CLASSPATH
export SPARK_CLASSPATH=$CLASSPATH

Here is the output:

root@cdb-01:~/spark# ./sbin/start-thriftserver.sh --master
spark://10.194.30.2:7077 --driver-class-path $CLASSPATH --hiveconf
hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port
1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/11/20 14:55:35 INFO thriftserver.HiveThriftServer2: Starting SparkContext
14/11/20 14:55:35 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to
'/home/jererc/spark/lib/cassandra-all-1.2.9.jar:/home/jererc/spark/lib/cassandra-thrift-1.2.9.jar:/home/jererc/spark/lib/datanucleus-api-jdo-3.2.1.jar:/home/jererc/spark/lib/datanucleus-core-3.2.2.jar:/home/jererc/spark/lib/datanucleus-rdbms-3.2.1.jar:/home/jererc/spark/lib/hadoop-core-0.20.205.0.jar:/home/jererc/spark/lib/hive-cassandra-1.2.9.jar:/home/jererc/spark/lib/mysql-connector-java.jar:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/jererc/spark/lib/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar:').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

14/11/20 14:55:35 WARN spark.SparkConf: Setting
'spark.executor.extraClassPath' to
'/home/jererc/spark/lib/cassandra-all-1.2.9.jar:/home/jererc/spark/lib/cassandra-thrift-1.2.9.jar:/home/jererc/spark/lib/datanucleus-api-jdo-3.2.1.jar:/home/jererc/spark/lib/datanucleus-core-3.2.2.jar:/home/jererc/spark/lib/datanucleus-rdbms-3.2.1.jar:/home/jererc/spark/lib/hadoop-core-0.20.205.0.jar:/home/jererc/spark/lib/hive-cassandra-1.2.9.jar:/home/jererc/spark/lib/mysql-connector-java.jar:/home/jererc/spark/lib/spark-assembly-1.1.0-hadoop1.0.4.jar:/home/jererc/spark/lib/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar:/home/jererc/spark/lib/spark-examples-1.1.0-hadoop1.0.4.jar:'
as a work-around.
Exception in thread "main" org.apache.spark.SparkException: Found both
spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:300)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5$$anonfun$apply$6.apply(SparkConf.scala:298)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:298)
at
org.apache.spark.SparkConf$$anonfun$validateSettings$5.apply(SparkConf.scala:286)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:286)
at org.apache.spark.SparkContext.(SparkContext.scala:158)
at
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:36)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:57)
at
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


And if I don't use SPARK_CLASSPATH then spark-sql does not work.
I tried ADD_JARS without much success.

What's the best way to set the CLASSPATH and the jars?




--
View this message in context: 
http://apache-spark-user-li

Re: tableau spark sql cassandra

2014-11-20 Thread jererc
Hi!

The hive table is an external table, which I created like this:

CREATE EXTERNAL TABLE MyHiveTable
( id int, data string )
STORED BY 'org.apache.hadoop.hive.cassandra.cql.CqlStorageHandler'
TBLPROPERTIES ("cassandra.host" = "10.194.30.2", "cassandra.ks.name"
= "test" ,
  "cassandra.cf.name" = "mytable" ,
  "cassandra.ks.repfactor" = "1" ,
  "cassandra.ks.strategy" =
"org.apache.cassandra.locator.SimpleStrategy" );


Here is the output from spark-sql for different commands:

spark-sql> show tables;
14/11/20 09:50:32 INFO parse.ParseDriver: Parsing command: show tables
14/11/20 09:50:32 INFO parse.ParseDriver: Parse Completed
14/11/20 09:50:32 INFO Configuration.deprecation: mapred.input.dir.recursive
is deprecated. Instead, use
mapreduce.input.fileinputformat.input.dir.recursive
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO parse.ParseDriver: Parsing command: show tables
14/11/20 09:50:32 INFO parse.ParseDriver: Parse Completed
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: Semantic Analysis Completed
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO exec.ListSinkOperator: Initializing Self 0 OP
14/11/20 09:50:32 INFO exec.ListSinkOperator: Operator 0 OP initialized
14/11/20 09:50:32 INFO exec.ListSinkOperator: Initialization Done 0 OP
14/11/20 09:50:32 INFO ql.Driver: Returning Hive schema:
Schema(fieldSchemas:[FieldSchema(name:tab_name, type:string, comment:from
deserializer)], properties:null)
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: Starting command: show tables
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
OK
14/11/20 09:50:32 INFO ql.Driver: OK
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO mapred.FileInputFormat: Total input paths to process
: 1
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
myhivetable
Time taken: 0.088 seconds
14/11/20 09:50:32 INFO CliDriver: Time taken: 0.088 seconds
14/11/20 09:50:32 INFO ql.Driver: 
14/11/20 09:50:32 INFO ql.Driver: 
spark-sql> describe myhivetable;
14/11/20 09:50:35 INFO parse.ParseDriver: Parsing command: describe
myhivetable
14/11/20 09:50:35 INFO parse.ParseDriver: Parse Completed
id  int from deserializer
datastring  from deserializer
Time taken: 0.226 seconds
14/11/20 09:50:35 INFO CliDriver: Time taken: 0.226 seconds
spark-sql> select * from myhivetable;
14/11/20 09:50:39 INFO parse.ParseDriver: Parsing command: select * from
myhivetable
14/11/20 09:50:39 INFO parse.ParseDriver: Parse Completed
14/11/20 09:50:39 INFO Configuration.deprecation: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
14/11/20 09:50:39 INFO storage.MemoryStore: ensureFreeSpace(420085) called
with curMem=0, maxMem=278302556
14/11/20 09:50:39 INFO storage.MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 410.2 KB, free 265.0 MB)
14/11/20 09:50:39 INFO storage.MemoryStore: ensureFreeSpace(30564) called
with curMem=420085, maxMem=278302556
14/11/20 09:50:39 INFO storage.MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 29.8 KB, free 265.0 MB)
14/11/20 09:50:39 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in
memory on 10.194.30.2:57707 (size: 29.8 KB, free: 265.4 MB)
14/11/20 09:50:39 INFO storage.BlockManagerMaster: Updated info of block
broadcast_0_piece0
14/11/20 09:50:39 ERROR thriftserver.SparkSQLDriver: Failed in [select *
from myhivetable]
java.lang.InstantiationError: org.apache.hadoop.mapreduce.JobContext
at
org.apache.hadoop.hive.cassandra.input.cql.HiveCqlInputFormat.getSplits(HiveCqlInputFormat.java:166)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

tableau spark sql cassandra

2014-11-19 Thread jererc
Hello!

I'm working on a POC with Spark SQL, where I’m trying to get data from
Cassandra into Tableau using Spark SQL.

Here is the stack:
- cassandra (v2.1)
- spark SQL (pre build v1.1 hadoop v2.4)
- cassandra / spark sql connector
(https://github.com/datastax/spark-cassandra-connector)
- hive
- mysql
- hive / mysql connector
- hive / cassandra handler
(https://github.com/tuplejump/cash/tree/master/cassandra-handler)
- tableau
- tableau / spark sql connector

I get an exception in spark-sql (bin/spark-sql) when trying to query the
cassandra table (java.lang.InstantiationError:
org.apache.hadoop.mapreduce.JobContext), it looks like a missing hadoop
dependency; showing tables or describing them work fine.

Do you know how to solve this without of hadoop?
Is Hive a dependency in Spark SQL?

Best,
Jerome




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org