Hi! I created an EMR cluster with Spark and HBase according to http://aws.amazon.com/articles/4926593393724923 with --hbase flag to include HBase. Although spark and shark both work nicely with the provided S3 examples, there is a problem with external tables pointing to the HBase instance.
We create the following external table with shark: CREATE EXTERNAL TABLE oh (id STRING, name STRING, title STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.zookeeper.quorum" = "172.31.13.161","hbase.zookeeper.property.clientPort"="2181", "hbase.columns.mapping" = ":key,o:OH_Name,o:OH_Title") TBLPROPERTIES(" hbase.table.name" = "objects") The objects table exists and has all columns as defined in the DDL. The Zookeeper for HBase is running on the specified hostname and port. CREATE TABLE oh_cached AS SELECT * FROM OH fails with the following error: org.apache.spark.SparkException: Job aborted: Task 11.0:0 failed more than 4 times at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:827) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:825) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:825) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:440) at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:502) at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:157) The logfiles of the spark workers are almost empty, however, the stages information in the spark web console reveals additional hints: 0 4 FAILED NODE_LOCAL ip-172-31-10-246.ec2.internal 2014/03/05 13:38:20 java.lang.IllegalStateException (java.lang.IllegalStateException: unread block data) java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2420)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380)java.io.ObjectInputStream.skipCustomData(ObjectInputStream.java:1954)j ava.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1848)java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1794)java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)java.io.ObjectInput Stream.readObject(ObjectInputStream.java:370)org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)org.apa che.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:199)org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:18 2)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)java.lang.Thread.run(Thread.java:724)