Hi!

I created an EMR cluster with Spark and HBase according to
http://aws.amazon.com/articles/4926593393724923 with --hbase flag to
include HBase. Although spark and shark both work nicely with the provided
S3 examples, there is a problem with external tables pointing to the HBase
instance.

We create the following external table with shark:

CREATE EXTERNAL TABLE oh (id STRING, name STRING, title STRING) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.zookeeper.quorum" =
"172.31.13.161","hbase.zookeeper.property.clientPort"="2181",
"hbase.columns.mapping" = ":key,o:OH_Name,o:OH_Title") TBLPROPERTIES("
hbase.table.name" = "objects")

The objects table exists and has all columns as defined in the DDL.
The Zookeeper for HBase is running on the specified hostname and port.

CREATE TABLE oh_cached AS SELECT * FROM OH fails with the following error:

org.apache.spark.SparkException: Job aborted: Task 11.0:0 failed more than
4 times
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827)
        at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825)
        at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
        at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825)
        at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440)
        at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502)
        at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157)

The logfiles of the spark workers are almost empty, however, the stages
information in the spark web console reveals additional hints:

 0 4 FAILED NODE_LOCAL ip-172-31-10-246.ec2.internal 2014/03/05 13:38:20
java.lang.IllegalStateException (java.lang.IllegalStateException: unread
block data)

java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2420)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380)java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1954)j

ava.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1848)java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)java.io.ObjectInput

Stream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)org.apa

che.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:199)org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:18

2)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)java.lang.Thread.run(Thread.java:724)

Reply via email to