You can register your data as a table using this library and then query it using HiveQL
CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path "src/test/resources/episodes.avro") On Fri, Aug 7, 2015 at 11:42 AM, java8964 <java8...@hotmail.com> wrote: > Hi, Michael: > > I am not sure how spark-avro can help in this case. > > My understanding is that to use Spark-avro, I have to translate all the > logic from this big Hive query into Spark code, right? > > If I have this big Hive query, how I can use spark-avro to run the query? > > Thanks > > Yong > > ------------------------------ > From: mich...@databricks.com > Date: Fri, 7 Aug 2015 11:32:21 -0700 > Subject: Re: Spark SQL query AVRO file > To: java8...@hotmail.com > CC: user@spark.apache.org > > > Have you considered trying Spark SQL's native support for avro data? > > https://github.com/databricks/spark-avro > > On Fri, Aug 7, 2015 at 11:30 AM, java8964 <java8...@hotmail.com> wrote: > > Hi, Spark users: > > We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our > production cluster, which has 42 data/task nodes. > > There is one dataset stored as Avro files about 3T. Our business has a > complex query running for the dataset, which is stored in nest structure > with Array of Struct in Avro and Hive. > > We can query it using Hive without any problem, but we like the SparkSQL's > performance, so we in fact run the same query in the Spark SQL, and found > out it is in fact much faster than Hive. > > But when we run it, we got the following error randomly from Spark > executors, sometime seriously enough to fail the whole spark job. > > Below the stack trace, and I think it is a bug related to Spark due to: > > 1) The error jumps out inconsistent, as sometimes we won't see it for this > job. (We run it daily) > 2) Sometime it won't fail our job, as it recover after retry. > 3) Sometime it will fail our job, as I listed below. > 4) Is this due to the multithreading in Spark? The NullPointException > indicates Hive got a Null ObjectInspector of the children of > StructObjectInspector, as I read the Hive source code, but I know there is > no null of ObjectInsepector as children of StructObjectInspector. Google > this error didn't give me any hint. Does any one know anything like this? > > Project > [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L, > StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL > tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, > StringType),CAST(active_contact_count_a#16L, > StringType),CAST(other_api_contact_count_a#6L, > StringType),CAST(fb_api_contact_count_a#7L, > StringType),CAST(evm_contact_count_a#8L, > StringType),CAST(loyalty_contact_count_a#9L, > StringType),CAST(mobile_jmml_contact_count_a#10L, > StringType),CAST(savelocal_contact_count_a#11L, > StringType),CAST(siteowner_contact_count_a#12L, > StringType),CAST(socialcamp_service_contact_count_a#13L, > S...org.apache.spark.SparkException: Job aborted due to stage failure: Task > 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in > stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerException > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.<init>(AvroObjectInspectorGenerator.java:55) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745)h > > >