Re: SparkSQL + Tableau Connector

Silvio Fiorito Thu, 19 Feb 2015 07:37:02 -0800

Great, glad it worked out!

From: Todd Nist
Date: Thursday, February 19, 2015 at 9:19 AM
To: Silvio Fiorito
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: SparkSQL + Tableau Connector

Hi Silvio,

I got this working today using your suggestion with the "Initial SQL" and a 
"Custom Query".  See here for details:

http://stackoverflow.com/questions/28403664/connect-to-existing-hive-in-intellij-using-sbt-as-build/28608608#28608608

It is not ideal as I need to write a custom query, but does work for now.  I 
also have it working by doing a SaveAsTable on the ingested data which stores 
the reference into the metastore for access via the thrift server.

Thanks for the help.

-Todd

On Wed, Feb 11, 2015 at 8:41 PM, Silvio Fiorito 
<silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> wrote:
Hey Todd,

I don’t have an app to test against the thrift server, are you able to define 
custom SQL without using Tableau’s schema query? I guess it’s not possible to 
just use SparkSQL temp tables, you may have to use permanent Hive tables that 
are actually in the metastore so Tableau can discover them in the schema. In 
that case you will either have to generate the Hive tables externally from 
Spark or use Spark to process the data and save them using a HiveContext.

From: Todd Nist
Date: Wednesday, February 11, 2015 at 7:53 PM
To: Andrew Lee
Cc: Arush Kharbanda, "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: SparkSQL + Tableau Connector

First sorry for the long post.  So back to tableau and Spark SQL, I'm still 
missing something.

TL;DR

To get the Spark SQL Temp table associated with the metastore are there 
additional steps required beyond doing the below?

Initial SQL on connection:

create temporary table test
using org.apache.spark.sql.json
options (path '/data/json/*');

cache table test;

I feel like I'm missing a step of associating the Spark SQL table with the 
metastore, do I need to actually save it in some fashion?   I'm trying to avoid 
saving to hive if possible.

Details:

I configured the hive-site.xml and placed it in the $SPARK_HOME/conf.  It looks 
like this, thanks Andrew and Arush for the assistance:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>hive.semantic.analyzer.factory.impl</name>
    <value>org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory</value>
  </property>

  <property>
    <name>hive.metastore.sasl.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
  </property>

  <property>
    <name>hive.server2.enable.doAs</name>
    <value>true</value>
  </property>

  <!--
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the 
metastore host</description>
  </property>
  -->

  <property>
    <name>hive.warehouse.subdir.inherit.perms</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
    <description>metadata is stored in a MySQL server</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>MySQL JDBC driver class</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hiveuser</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hiveuser</value>
  </property>

</configuration>

When I start the server it looks fine:

>$ ./sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 
>--hiveconf hive.server2.thrift.bind.host radtech.io<http://radtech.io> 
>--master spark://radtech.io:7077<http://radtech.io:7077> --driver-class-path 
>/usr/local/spark/lib/mysql-connector-java-5.1.34-bin.jar
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/usr/local/spark-1.2.1-bin-hadoop2.4/logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out
radtech:spark tnist$ tail -f 
logs/spark-tnist-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-radtech.io.out
15/02/11 19:15:24 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20150211191524-0008/1 on hostPort 
192.168.1.2:50851<http://192.168.1.2:50851> with 2 cores, 512.0 MB RAM
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: 
app-20150211191524-0008/0 is now LOADING
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: 
app-20150211191524-0008/1 is now LOADING
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: 
app-20150211191524-0008/0 is now RUNNING
15/02/11 19:15:24 INFO AppClient$ClientActor: Executor updated: 
app-20150211191524-0008/1 is now RUNNING
15/02/11 19:15:24 INFO NettyBlockTransferService: Server created on 50938
15/02/11 19:15:24 INFO BlockManagerMaster: Trying to register BlockManager
15/02/11 19:15:24 INFO BlockManagerMasterActor: Registering block manager 
192.168.1.2:50938<http://192.168.1.2:50938> with 265.1 MB RAM, 
BlockManagerId(<driver>, 192.168.1.2, 50938)
15/02/11 19:15:24 INFO BlockManagerMaster: Registered BlockManager
15/02/11 19:15:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready 
for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
15/02/11 19:15:25 INFO HiveMetaStore: 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:15:25 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:15:26 INFO Persistence: Property 
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/02/11 19:15:26 INFO Persistence: Property datanucleus.cache.level2 unknown - 
will be ignored
15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/02/11 19:15:26 WARN Connection: BoneCP specified but not present in 
CLASSPATH (or one of dependencies)
15/02/11 19:15:27 INFO ObjectStore: Setting MetaStore object pin classes with 
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571<http://sparkExecutor@192.168.1.2:50944/user/Executor#1008909571>]
 with ID 0
15/02/11 19:15:28 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
15/02/11 19:15:28 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
15/02/11 19:15:28 INFO SparkDeploySchedulerBackend: Registered executor: 
Actor[akka.tcp://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541<http://sparkExecutor@192.168.1.2:50948/user/Executor#-688434541>]
 with ID 1
15/02/11 19:15:28 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
"embedded-only" so does not have its own datastore table.
15/02/11 19:15:28 INFO Datastore: The class 
"org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so 
does not have its own datastore table.
15/02/11 19:15:28 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:15:28 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager 
192.168.1.2:50951<http://192.168.1.2:50951> with 265.1 MB RAM, 
BlockManagerId(0, 192.168.1.2, 50951)
15/02/11 19:15:28 INFO BlockManagerMasterActor: Registering block manager 
192.168.1.2:50952<http://192.168.1.2:50952> with 265.1 MB RAM, 
BlockManagerId(1, 192.168.1.2, 50952)
15/02/11 19:15:28 INFO HiveMetaStore: Added admin role in metastore
15/02/11 19:15:28 INFO HiveMetaStore: Added public role in metastore
15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
15/02/11 19:15:29 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/02/11 19:15:29 INFO AbstractService: HiveServer2: Async execution pool size 
100
15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is inited.
15/02/11 19:15:29 INFO AbstractService: Service: SessionManager is inited.
15/02/11 19:15:29 INFO AbstractService: Service: CLIService is inited.
15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is 
inited.
15/02/11 19:15:29 INFO AbstractService: Service: HiveServer2 is inited.
15/02/11 19:15:29 INFO AbstractService: Service:OperationManager is started.
15/02/11 19:15:29 INFO AbstractService: Service:SessionManager is started.
15/02/11 19:15:29 INFO AbstractService: Service:CLIService is started.
15/02/11 19:15:29 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
15/02/11 19:15:29 INFO HiveMetaStore: 0: get_databases: default
15/02/11 19:15:29 INFO audit: ugi=tnistip=unknown-ip-addrcmd=get_databases: 
default
15/02/11 19:15:29 INFO HiveMetaStore: 0: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:15:29 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:15:29 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:15:29 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:15:29 INFO AbstractService: Service:ThriftBinaryCLIService is 
started.
15/02/11 19:15:29 INFO AbstractService: Service:HiveServer2 is started.
15/02/11 19:15:29 INFO HiveThriftServer2: HiveThriftServer2 started
15/02/11 19:15:29 INFO ThriftCLIService: ThriftBinaryCLIService listening on 
radtech.io/192.168.1.2:10001<http://radtech.io/192.168.1.2:10001>

When I start Tableau and use the SparkSQL (Beta) connector I see that the 
"initial SQL" is being executed:

15/02/11 19:25:35 INFO HiveMetaStore: 2: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:35 INFO Driver: OK
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'set -v'
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#18)
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Result Schema: List(#21)
15/02/11 19:25:35 INFO ThriftCLIService: Client protocol version: 
HIVE_CLI_SERVICE_PROTOCOL_V6
15/02/11 19:25:35 INFO HiveMetaStore: No user is added in admin role, since 
config is empty
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/02/11 19:25:35 INFO SparkExecuteStatementOperation: Running query 'use 
`default`'
15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default`
15/02/11 19:25:35 INFO ParseDriver: Parse Completed
15/02/11 19:25:35 INFO SessionState: No Tez session required at this point. 
hive.execution.engine=mr.
15/02/11 19:25:35 INFO Driver: Concurrency mode is disabled, not creating a 
lock manager
15/02/11 19:25:35 INFO ParseDriver: Parsing command: use `default`
15/02/11 19:25:35 INFO ParseDriver: Parse Completed
15/02/11 19:25:35 INFO Driver: Semantic Analysis Completed
15/02/11 19:25:35 INFO Driver: Returning Hive schema: Schema(fieldSchemas:null, 
properties:null)
15/02/11 19:25:35 INFO Driver: Starting command: use `default`
15/02/11 19:25:35 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:35 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:35 INFO HiveMetaStore: 3: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:35 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:36 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:25:36 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:36 INFO HiveMetaStore: 3: get_database: default
15/02/11 19:25:36 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:36 INFO Driver: OK
15/02/11 19:25:36 INFO SparkExecuteStatementOperation: Running query 'create 
temporary table test
using org.apache.spark.sql.json
options (path ‘/data/json/*')'

....

15/02/11 19:25:38 INFO Driver: Starting command: use `default`
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO HiveMetaStore: 4: Opening raw store with implemenation 
class:org.apache.hadoop.hive.metastore.ObjectStore
15/02/11 19:25:38 INFO ObjectStore: ObjectStore, initialize called
15/02/11 19:25:38 INFO Query: Reading in results for query 
"org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is 
closing
15/02/11 19:25:38 INFO ObjectStore: Initialized ObjectStore
15/02/11 19:25:38 INFO HiveMetaStore: 4: get_database: default
15/02/11 19:25:38 INFO audit: ugi=anonymous ip=unknown-ip-addr 
cmd=get_database: default
15/02/11 19:25:38 INFO Driver: OK
15/02/11 19:25:38 INFO SparkExecuteStatementOperation: Running query '
cache table test '
15/02/11 19:25:38 INFO MemoryStore: ensureFreeSpace(211383) called with 
curMem=101514, maxMem=278019440
15/02/11 19:25:38 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 206.4 KB, free 264.8 MB)

I see no way in Tableau to see the cached table "test".  I think I am missing a 
step of associating the generated temp table from Spark SQL with the metastore. 
 Any guidance or insights on what I'm missing here.

Thanks for the assistance.

-Todd

On Wed, Feb 11, 2015 at 3:20 PM, Andrew Lee 
<alee...@hotmail.com<mailto:alee...@hotmail.com>> wrote:
Sorry folks, it is executing Spark jobs instead of Hive jobs. I mis-read the 
logs since there were other activities going on on the cluster.

________________________________
From: alee...@hotmail.com<mailto:alee...@hotmail.com>
To: ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>; 
tsind...@gmail.com<mailto:tsind...@gmail.com>
CC: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: SparkSQL + Tableau Connector
Date: Wed, 11 Feb 2015 11:56:44 -0800

I'm using mysql as the metastore DB with Spark 1.2.
I simply copy the hive-site.xml to /etc/spark/ and added the mysql JDBC JAR to 
spark-env.sh in /etc/spark/, everything works fine now.

My setup looks like this.

Tableau => Spark ThriftServer2 => HiveServer2

It's talking to Tableau Desktop 8.3. Interestingly, when I query a Hive table, 
it still invokes Hive queries to HiveServer2 which is running MR or Tez engine. 
 Is this expected?

I thought it should at least use the catalyst engine and talk to the underlying 
HDFS like what HiveContext API does to pull in the data into RDD.  Did I 
misunderstood the purpose of Spark ThriftServer2?

________________________________
Date: Wed, 11 Feb 2015 16:07:40 +0530
Subject: Re: SparkSQL + Tableau Connector
From: ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>
To: tsind...@gmail.com<mailto:tsind...@gmail.com>
CC: user@spark.apache.org<mailto:user@spark.apache.org>

Hi

I used this, though its using a embedded driver and is not a good approch.It 
works. You can configure for some other metastore type also. I have not tried 
the metastore uri's.

<configuration>

<property>
  <name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=/opt/bigdata/spark-1.2.0/metastore_db;create=true</value>
  <description>URL for the DB</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>

<!-- <property>
  <name>hive.metastore.uris</name>
  <value>thrift://x.x.x.x:10000<http://172.17.1.172:10000/></value>
  <description>IP address (or fully-qualified domain name) and port of the 
metastore host</description>
</property> -->
</configuration>

On Wed, Feb 11, 2015 at 3:59 PM, Todd Nist 
<tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote:
Hi Arush,

So yes I want to create the tables through Spark SQL.  I have placed the 
hive-site.xml file inside of the $SPARK_HOME/conf directory I thought that was 
all I should need to do to have the thriftserver use it.  Perhaps my 
hive-site.xml is worng, it currently looks like this:

<configuration>
<property>
  <name>hive.metastore.uris</name>
  <!-- Ensure that the following statement points to the Hive Metastore URI in 
your cluster -->

<value>thrift://sandbox.hortonworks.com:9083<http://sandbox.hortonworks.com:9083></value>
  <description>URI for client to contact metastore server</description>
</property>
</configuration>

Which leads me to believe it is going to pull form the thriftserver from 
Horton?  I will go look at the docs to see if this is right, it is what Horton 
says to do.  Do you have an example hive-site.xml by chance that works with 
Spark SQL?

I am using 8.3 of tableau with the SparkSQL Connector.

Thanks for the assistance.

-Todd

On Wed, Feb 11, 2015 at 2:34 AM, Arush Kharbanda 
<ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>> wrote:
BTW what tableau connector are you using?

On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda 
<ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>> wrote:
 I am a little confused here, why do you want to create the tables in hive. You 
want to create the tables in spark-sql, right?

If you are not able to find the same tables through tableau then thrift is 
connecting to a diffrent metastore than your spark-shell.

One way to specify a metstore to thrift is to provide the path to hive-site.xml 
while starting thrift using --files hive-site.xml.

similarly you can specify the same metastore to your spark-submit or 
sharp-shell using the same option.

On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist 
<tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote:
Arush,

As for #2 do you mean something like this from the docs:

// sc is an existing SparkContext.val sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc)sqlContext.sql("CREATE TABLE IF NOT 
EXISTS src (key INT, value STRING)")sqlContext.sql("LOAD DATA LOCAL INPATH 
'examples/src/main/resources/kv1.txt' INTO TABLE src")// Queries are expressed 
in HiveQLsqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

Or did you have something else in mind?

-Todd

On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist 
<tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote:
Arush,

Thank you will take a look at that approach in the morning.  I sort of figured 
the answer to #1 was NO and that I would need to do 2 and 3 thanks for 
clarifying it for me.

-Todd

On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda 
<ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com>> wrote:

1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON 
files? NO
2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive? Create a table in spark sql to expose via spark sql
3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2 you would need to configure thrift to 
read from the metastore you expect it read from - by default it reads from 
metastore_db directory present in the directory used to launch the thrift 
server.

On 11 Feb 2015 01:35, "Todd Nist" 
<tsind...@gmail.com<mailto:tsind...@gmail.com>> wrote:
Hi,

I'm trying to understand how and what the Tableau connector to SparkSQL is able 
to access.  My understanding is it needs to connect to the thriftserver and I 
am not sure how or if it exposes parquet, json, schemaRDDs, or does it only 
expose schemas defined in the metastore / hive.

For example, I do the following from the spark-shell which generates a 
schemaRDD from a csv file and saves it as a JSON file as well as a parquet file.

import org.apache.sql.SQLContext
import com.databricks.spark.csv._

val sqlContext = new SQLContext(sc)
val test = 
sqlContext.csfFile("/data/test.csv")test.toJSON().saveAsTextFile("/data/out")
test.saveAsParquetFile("/data/out")

When I connect from Tableau, the only thing I see is the "default" schema and 
nothing in the tables section.
So my questions are:

1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON files?
2.  Do I need to do something to expose these via hive / metastore other than 
creating a table in hive?
3.  Does the thriftserver need to be configured to expose these in some 
fashion, sort of related to question 2.
TIA for the assistance.
-Todd

--

[Sigmoid Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>

--

[Sigmoid Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>

--

[Sigmoid Analytics]<http://htmlsig.com/www.sigmoidanalytics.com>

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com<mailto:ar...@sigmoidanalytics.com> || 
www.sigmoidanalytics.com<http://www.sigmoidanalytics.com/>

Re: SparkSQL + Tableau Connector

Reply via email to