Re: Hive From Spark

Du Li Thu, 21 Aug 2014 16:24:07 -0700

Hi,

This guava dependency conflict problem should have been fixed as of yesterday 
according to https://issues.apache.org/jira/browse/SPARK-2420


However, I just got java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
by the following code snippet and “mvn3 test” on Mac. I built the latest 
version of spark (1.1.0-SNAPSHOT) and installed the jar files to the local 
maven repo. From my pom file I explicitly excluded guava from almost all 
possible dependencies, such as spark-hive_2.10-1.1.0.SNAPSHOT, and 
hadoop-client. This snippet is abstracted from a larger project. So the pom.xml 
includes many dependencies although not all are required by this snippet. The 
pom.xml is attached.

Anybody knows what to fix it?

Thanks,
Du
-------

package com.myself.test

import org.scalatest._
import org.apache.hadoop.io.{NullWritable, BytesWritable}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

class MyRecord(name: String) extends Serializable {
  def getWritable(): BytesWritable = {
    new BytesWritable(Option(name).getOrElse("\\N").toString.getBytes("UTF-8"))
  }

  final override def equals(that: Any): Boolean = {
    if( !that.isInstanceOf[MyRecord] )
      false
    else {
      val other = that.asInstanceOf[MyRecord]
      this.getWritable == other.getWritable
    }
  }
}

class MyRecordTestSuite extends FunSuite {
  // construct an MyRecord by Consumer.schema
  val rec: MyRecord = new MyRecord("James Bond")

  test("generated SequenceFile should be readable from spark") {
    val path = "./testdata/"

    val conf = new SparkConf(false).setMaster("local").setAppName("test data 
exchange with Hive")
    conf.set("spark.driver.host", "localhost")
    val sc = new SparkContext(conf)
    val rdd = sc.makeRDD(Seq(rec))
    rdd.map((x: MyRecord) => (NullWritable.get(), x.getWritable()))
      .saveAsSequenceFile(path)

    val bytes = sc.sequenceFile(path, classOf[NullWritable], 
classOf[BytesWritable]).first._2
    assert(rec.getWritable() == bytes)

    sc.stop()
    System.clearProperty("spark.driver.port")
  }
}


From: Andrew Lee <alee...@hotmail.com<mailto:alee...@hotmail.com>>
Reply-To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Date: Monday, July 21, 2014 at 10:27 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>, 
"u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>" 
<u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>>
Subject: RE: Hive From Spark

Hi All,

Currently, if you are running Spark HiveContext API with Hive 0.12, it won't 
work due to the following 2 libraries which are not consistent with Hive 0.12 
and Hadoop as well. (Hive libs aligns with Hadoop libs, and as a common 
practice, they should be consistent to work inter-operable).

These are under discussion in the 2 JIRA tickets:

https://issues.apache.org/jira/browse/HIVE-7387

https://issues.apache.org/jira/browse/SPARK-2420

When I ran the command by tweaking the classpath and build for Spark 1.0.1-rc3, 
I was able to create table through HiveContext, however, when I fetch the data, 
due to incompatible API calls in Guava, it breaks. This is critical since it 
needs to map the cllumns to the RDD schema.

Hive and Hadoop are using an older version of guava libraries (11.0.1) where 
Spark Hive is using guava 14.0.1+.
The community isn't willing to downgrade to 11.0.1 which is the current version 
for Hadoop 2.2 and Hive 0.12.
Be aware of protobuf version as well in Hive 0.12 (it uses protobuf 2.4).


scala>

scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext

scala> import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive._

scala>

scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext: org.apache.spark.sql.hive.HiveContext = 
org.apache.spark.sql.hive.HiveContext@34bee01a

scala>

scala> hiveContext.hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
res0: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:104
== Query Plan ==
<Native command: executed by Hive>

scala> hiveContext.hql("LOAD DATA LOCAL INPATH 
'examples/src/main/resources/kv1.txt' INTO TABLE src")
res1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[3] at RDD at SchemaRDD.scala:104
== Query Plan ==
<Native command: executed by Hive>

scala>

scala> // Queries are expressed in HiveQL

scala> hiveContext.hql("FROM src SELECT key, value").collect().foreach(println)
java.lang.NoSuchMethodError: 
com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at 
org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
at 
org.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
at 
org.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)
at 
org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:169)
at 
org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:161)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:155)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:75)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:92)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:661)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:812)
at org.apache.spark.broadcast.HttpBroadcast.<init>(HttpBroadcast.scala:52)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:35)
at 
org.apache.spark.broadcast.HttpBroadcastFactory.newBroadcast(HttpBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:776)
at org.apache.spark.sql.hive.HadoopTableReader.<init>(TableReader.scala:60)
at 
org.apache.spark.sql.hive.execution.HiveTableScan.<init>(HiveTableScan.scala:70)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$4.apply(HiveStrategies.scala:73)
at 
org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:280)
at 
org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:69)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:316)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:319)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:319)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:420)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:19)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
at $iwC$$iwC$$iwC.<init>(<console>:28)
at $iwC$$iwC.<init>(<console>:30)
at $iwC.<init>(<console>:32)
at <init>(<console>:34)
at .<init>(<console>:38)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


> From: hao.ch...@intel.com<mailto:hao.ch...@intel.com>
> To: user@spark.apache.org<mailto:user@spark.apache.org>; 
> u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
> Subject: RE: Hive From Spark
> Date: Mon, 21 Jul 2014 01:14:19 +0000
>
> JiaJia, I've checkout the latest 1.0 branch, and then do the following steps:
> SPAKR_HIVE=true sbt/sbt clean assembly
> cd examples
> ../bin/run-example sql.hive.HiveFromSpark
>
> It works well in my local
>
> From your log output, it shows "Invalid method name: 'get_table', seems an 
> incompatible jar version or something wrong between the Hive metastore 
> service and client, can you double check the jar versions of Hive metastore 
> service or thrift?
>
>
> -----Original Message-----
> From: JiajiaJing [mailto:jj.jing0...@gmail.com]
> Sent: Saturday, July 19, 2014 7:29 AM
> To: u...@spark.incubator.apache.org<mailto:u...@spark.incubator.apache.org>
> Subject: RE: Hive From Spark
>
> Hi Cheng Hao,
>
> Thank you very much for your reply.
>
> Basically, the program runs on Spark 1.0.0 and Hive 0.12.0 .
>
> Some setups of the environment are done by running "SPARK_HIVE=true sbt/sbt 
> assembly/assembly", including the jar in all the workers, and copying the 
> hive-site.xml to spark's conf dir.
>
> And then run the program as: " ./bin/run-example 
> org.apache.spark.examples.sql.hive.HiveFromSpark "
>
> It's good to know that this example runs well on your machine, could you 
> please give me some insight about your have done as well?
>
> Thank you very much!
>
> Jiajia
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Hive-From-Spark-tp10110p10215.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

pom.xml
Description: pom.xml

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Hive From Spark

Reply via email to