RE: Hive From Spark

Andrew Lee Mon, 21 Jul 2014 10:28:39 -0700

Hi All,
Currently, if you are running Spark HiveContext API with Hive 0.12, it won't 
work due to the following 2 libraries which are not consistent with Hive 0.12 
and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common 
practice, they should be consistent to work inter-operable).
These are under discussion in the 2 JIRA tickets:
https://issues.apache.org/jira/browse/HIVE-7387
https://issues.apache.org/jira/browse/SPARK-2420
When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, 
I was able to create table through HiveContext, however, when I fetch the data, 
due to incompatible API calls in Guava, it breaks. This is critical since it 
needs to map the cllumns to the RDD schema.
Hive and Hadoop are using an older version of guava libraries (11.0.1) where 
Spark Hive is using guava 14.0.1+.The community isn't willing to downgrade to 
11.0.1 which is the current version for Hadoop 2.2 and Hive 0.12. Be aware of 
protobuf version as well in Hive 0.12 (it uses protobuf 2.4).
scala>scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContextscala> import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._scala>scala> val hiveContext = new 
org.apache.spark.sql.hive.HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@34bee01ascala>scala> 
hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
res0: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==
<Native command: executed by Hive>scala> hiveContext.hql("LOAD DATA LOCAL 
INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
res1: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[3] at RDD at SchemaRDD.scala:104
== Query Plan ==
<Native command: executed by Hive>scala>scala> // Queries are expressed in 
HiveQLscala> hiveContext.hql("FROM src SELECT key, 
value").collect().foreach(println)
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at org.apache.spark.broadcast.HttpBroadcast.<init>(HttpBroadcast.scala:52)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:35)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
at org.apache.spark.sql.hive.HadoopTableReader.<init>(TableReader.scala:60)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.scala:70)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:280)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:69)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:319)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:319)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:420)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:19)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
at $iwC$$iwC$$iwC.<init>(<console>:28)
at $iwC$$iwC.<init>(<console>:30)
at $iwC.<init>(<console>:32)
at <init>(<console>:34)
at .<init>(<console>:38)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


> From: hao.ch...@intel.com
> To: user@spark.apache.org; u...@spark.incubator.apache.org
> Subject: RE: Hive From Spark
> Date: Mon, 21 Jul 2014 01:14:19 +0000
> 
> JiaJia, I've checkout the latest 1.0 branch, and then do the following steps:
> SPAKR_HIVE=true sbt/sbt clean assembly
> cd examples
> ../bin/run-example sql.hive.HiveFromSpark
> 
> It works well in my local
> 
> From your log output, it shows "Invalid method name: 'get_table', seems an 
> incompatible jar version or something wrong between the Hive metastore 
> service and client, can you double check the jar versions of Hive metastore 
> service or thrift?
> 
> 
> -----Original Message-----
> From: JiajiaJing [mailto:jj.jing0...@gmail.com] 
> Sent: Saturday, July 19, 2014 7:29 AM
> To: u...@spark.incubator.apache.org
> Subject: RE: Hive From Spark
> 
> Hi Cheng Hao,
> 
> Thank you very much for your reply.
> 
> Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .
> 
> Some setups of the environment are done by running "SPARK_HIVE=true sbt/sbt 
> assembly/assembly", including the jar in all the workers, and copying the 
> hive-site.xml to spark's conf dir. 
> 
> And then run the program as: "  ./bin/run-example 
> org.apache.spark.examples.sql.hive.HiveFromSpark  "
> 
> It's good to know that this example runs well on your machine, could you 
> please give me some insight about your have done as well?
> 
> Thank you very much!
> 
> Jiajia
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Hive From Spark

Reply via email to