Re: Spark-SQL JDBC driver

Michael Armbrust Sun, 14 Dec 2014 01:18:07 -0800

I'll add that there is an experimental method that allows you to start the
JDBC server with an existing HiveContext (which might have registered
temporary tables).


https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L42


On Thu, Dec 11, 2014 at 6:52 AM, Denny Lee <denny.g....@gmail.com> wrote:
>
> Yes, that is correct. A quick reference on this is the post
> https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1
> with the pertinent section being:
>
> It is important to note that when you create Spark tables (for example,
> via the .registerTempTable) these are operating within the Spark
> environment which resides in a separate process than the Hive Metastore.
> This means that currently tables that are created within the Spark context
> are not available through the Thrift server. To achieve this, within the
> Spark context save your temporary table into Hive - then the Spark Thrift
> Server will be able to see the table.
>
> HTH!
>
>
> On Thu, Dec 11, 2014 at 04:09 Anas Mosaad <anas.mos...@incorta.com> wrote:
>
>> Actually I came to a conclusion that RDDs has to be persisted in hive in
>> order to be able to access through thrift.
>> Hope I didn't end up with incorrect conclusion.
>> Please someone correct me if I am wrong.
>> On Dec 11, 2014 8:53 AM, "Judy Nash" <judyn...@exchange.microsoft.com>
>> wrote:
>>
>>>  Looks like you are wondering why you cannot see the RDD table you have
>>> created via thrift?
>>>
>>>
>>>
>>> Based on my own experience with spark 1.1, RDD created directly via
>>> Spark SQL (i.e. Spark Shell or Spark-SQL.sh) is not visible on thrift,
>>> since thrift has its own session containing its own RDD.
>>>
>>> Spark SQL experts on the forum can confirm on this though.
>>>
>>>
>>>
>>> *From:* Cheng Lian [mailto:lian.cs....@gmail.com]
>>> *Sent:* Tuesday, December 9, 2014 6:42 AM
>>> *To:* Anas Mosaad
>>> *Cc:* Judy Nash; user@spark.apache.org
>>> *Subject:* Re: Spark-SQL JDBC driver
>>>
>>>
>>>
>>> According to the stacktrace, you were still using SQLContext rather than
>>> HiveContext. To interact with Hive, HiveContext *must* be used.
>>>
>>> Please refer to this page
>>> http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
>>>
>>>  On 12/9/14 6:26 PM, Anas Mosaad wrote:
>>>
>>>  Back to the first question, this will mandate that hive is up and
>>> running?
>>>
>>>
>>>
>>> When I try it, I get the following exception. The documentation says
>>> that this method works only on SchemaRDD. I though that
>>> countries.saveAsTable did not work for that a reason so I created a tmp
>>> that contains the results from the registered temp table. Which I could
>>> validate that it's a SchemaRDD as shown below.
>>>
>>>
>>>
>>>
>>> * @Judy,* I do really appreciate your kind support and I want to
>>> understand and off course don't want to wast your time. If you can direct
>>> me the documentation describing this details, this will be great.
>>>
>>>
>>>
>>> scala> val tmp = sqlContext.sql("select * from countries")
>>>
>>> tmp: org.apache.spark.sql.SchemaRDD =
>>>
>>> SchemaRDD[12] at RDD at SchemaRDD.scala:108
>>>
>>> == Query Plan ==
>>>
>>> == Physical Plan ==
>>>
>>> PhysicalRDD
>>> [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
>>> MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36
>>>
>>>
>>>
>>> scala> tmp.saveAsTable("Countries")
>>>
>>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
>>> Unresolved plan found, tree:
>>>
>>> 'CreateTableAsSelect None, Countries, false, None
>>>
>>>  Project
>>> [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29]
>>>
>>>   Subquery countries
>>>
>>>    LogicalRDD
>>> [COUNTRY_ID#20,COUNTRY_ISO_CODE#21,COUNTRY_NAME#22,COUNTRY_SUBREGION#23,COUNTRY_SUBREGION_ID#24,COUNTRY_REGION#25,COUNTRY_REGION_ID#26,COUNTRY_TOTAL#27,COUNTRY_TOTAL_ID#28,COUNTRY_NAME_HIST#29],
>>> MapPartitionsRDD[9] at mapPartitions at ExistingRDD.scala:36
>>>
>>>
>>>
>>> at
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:78)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:78)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:76)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
>>>
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>>
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>>>
>>> at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
>>>
>>> at scala.collection.immutable.List.foreach(List.scala:318)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
>>>
>>> at
>>> org.apache.spark.sql.SchemaRDDLike$class.saveAsTable(SchemaRDDLike.scala:126)
>>>
>>> at org.apache.spark.sql.SchemaRDD.saveAsTable(SchemaRDD.scala:108)
>>>
>>> at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22)
>>>
>>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
>>>
>>> at $iwC$$iwC$$iwC.<init>(<console>:29)
>>>
>>> at $iwC$$iwC.<init>(<console>:31)
>>>
>>> at $iwC.<init>(<console>:33)
>>>
>>> at <init>(<console>:35)
>>>
>>> at .<init>(<console>:39)
>>>
>>> at .<clinit>(<console>)
>>>
>>> at .<init>(<console>:7)
>>>
>>> at .<clinit>(<console>)
>>>
>>> at $print(<console>)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>
>>> at
>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
>>>
>>> at
>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
>>>
>>> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
>>>
>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
>>>
>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
>>>
>>> at
>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
>>>
>>> at
>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
>>>
>>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
>>>
>>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
>>>
>>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
>>>
>>> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
>>>
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
>>>
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
>>>
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
>>>
>>> at
>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>>
>>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
>>>
>>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
>>>
>>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>>
>>> at org.apache.spark.repl.Main.main(Main.scala)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:365)
>>>
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>>
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 9, 2014 at 11:44 AM, Cheng Lian <lian.cs....@gmail.com>
>>> wrote:
>>>
>>>  How did you register the table under spark-shell? Two things to notice:
>>>
>>> 1. To interact with Hive, HiveContext instead of SQLContext must be used.
>>> 2. `registerTempTable` doesn't persist the table into Hive metastore,
>>> and the table is lost after quitting spark-shell. Instead, you must use
>>> `saveAsTable`.
>>>
>>>
>>>
>>> On 12/9/14 5:27 PM, Anas Mosaad wrote:
>>>
>>>  Thanks Cheng,
>>>
>>>
>>>
>>> I thought spark-sql is using the same exact metastore, right? However,
>>> it didn't work as expected. Here's what I did.
>>>
>>>
>>>
>>> In spark-shell, I loaded a csv files and registered the table, say
>>> countries.
>>>
>>> Started the thrift server.
>>>
>>> Connected using beeline. When I run show tables or !tables, I get empty
>>> list of tables as follow:
>>>
>>>  *0: jdbc:hive2://localhost:10000> !tables*
>>>
>>> *+------------+--------------+-------------+-------------+----------+*
>>>
>>> *| TABLE_CAT  | TABLE_SCHEM  | TABLE_NAME  | TABLE_TYPE  | REMARKS  |*
>>>
>>> *+------------+--------------+-------------+-------------+----------+*
>>>
>>> *+------------+--------------+-------------+-------------+----------+*
>>>
>>> *0: jdbc:hive2://localhost:10000> show tables ;*
>>>
>>> *+---------+*
>>>
>>> *| result  |*
>>>
>>> *+---------+*
>>>
>>> *+---------+*
>>>
>>> *No rows selected (0.106 seconds)*
>>>
>>> *0: jdbc:hive2://localhost:10000> *
>>>
>>>
>>>
>>>
>>>
>>> Kindly advice, what am I missing? I want to read the RDD using SQL from
>>> outside spark-shell (i.e. like any other relational database)
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian <lian.cs....@gmail.com>
>>> wrote:
>>>
>>>  Essentially, the Spark SQL JDBC Thrift server is just a Spark port of
>>> HiveServer2. You don't need to run Hive, but you do need a working
>>> Metastore.
>>>
>>>
>>>
>>> On 12/9/14 3:59 PM, Anas Mosaad wrote:
>>>
>>>  Thanks Judy, this is exactly what I'm looking for. However, and plz
>>> forgive me if it's a dump question is: It seems to me that thrift is the
>>> same as hive2 JDBC driver, does this mean that starting thrift will start
>>> hive as well on the server?
>>>
>>>
>>>
>>> On Mon, Dec 8, 2014 at 9:11 PM, Judy Nash <
>>> judyn...@exchange.microsoft.com> wrote:
>>>
>>>  You can use thrift server for this purpose then test it with beeline.
>>>
>>>
>>>
>>> See doc:
>>>
>>>
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
>>>
>>>
>>>
>>>
>>>
>>> *From:* Anas Mosaad [mailto:anas.mos...@incorta.com]
>>> *Sent:* Monday, December 8, 2014 11:01 AM
>>> *To:* user@spark.apache.org
>>> *Subject:* Spark-SQL JDBC driver
>>>
>>>
>>>
>>> Hello Everyone,
>>>
>>>
>>>
>>> I'm brand new to spark and was wondering if there's a JDBC driver to
>>> access spark-SQL directly. I'm running spark in standalone mode and don't
>>> have hadoop in this environment.
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> *Best Regards/أطيب المنى,*
>>>
>>>
>>>
>>> *Anas Mosaad*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> *Best Regards/أطيب المنى,*
>>>
>>>
>>>
>>> *Anas Mosaad*
>>>
>>> *Incorta Inc.*
>>>
>>> *+20-100-743-4510 <%2B20-100-743-4510>*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> *Best Regards/أطيب المنى,*
>>>
>>>
>>>
>>> *Anas Mosaad*
>>>
>>> *Incorta Inc.*
>>>
>>> *+20-100-743-4510 <%2B20-100-743-4510>*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> *Best Regards/أطيب المنى,*
>>>
>>>
>>>
>>> *Anas Mosaad*
>>>
>>> *Incorta Inc.*
>>>
>>> *+20-100-743-4510 <%2B20-100-743-4510>*
>>>
>>>
>>>
>>

Re: Spark-SQL JDBC driver

Reply via email to