Re: Spark SQL query AVRO file

Michael Armbrust Fri, 07 Aug 2015 11:46:10 -0700

You can register your data as a table using this library and then query it
using HiveQL


CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path "src/test/resources/episodes.avro")


On Fri, Aug 7, 2015 at 11:42 AM, java8964 <java8...@hotmail.com> wrote:

> Hi, Michael:
>
> I am not sure how spark-avro can help in this case.
>
> My understanding is that to use Spark-avro, I have to translate all the
> logic from this big Hive query into Spark code, right?
>
> If I have this big Hive query, how I can use spark-avro to run the query?
>
> Thanks
>
> Yong
>
> ------------------------------
> From: mich...@databricks.com
> Date: Fri, 7 Aug 2015 11:32:21 -0700
> Subject: Re: Spark SQL query AVRO file
> To: java8...@hotmail.com
> CC: user@spark.apache.org
>
>
> Have you considered trying Spark SQL's native support for avro data?
>
> https://github.com/databricks/spark-avro
>
> On Fri, Aug 7, 2015 at 11:30 AM, java8964 <java8...@hotmail.com> wrote:
>
> Hi, Spark users:
>
> We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our
> production cluster, which has 42 data/task nodes.
>
> There is one dataset stored as Avro files about 3T. Our business has a
> complex query running for the dataset, which is stored in nest structure
> with Array of Struct in Avro and Hive.
>
> We can query it using Hive without any problem, but we like the SparkSQL's
> performance, so we in fact run the same query in the Spark SQL, and found
> out it is in fact much faster than Hive.
>
> But when we run it, we got the following error randomly from Spark
> executors, sometime seriously enough to fail the whole spark job.
>
> Below the stack trace, and I think it is a bug related to Spark due to:
>
> 1) The error jumps out inconsistent, as sometimes we won't see it for this
> job. (We run it daily)
> 2) Sometime it won't fail our job, as it recover after retry.
> 3) Sometime it will fail our job, as I listed below.
> 4) Is this due to the multithreading in Spark? The NullPointException
> indicates Hive got a Null ObjectInspector of the children of
> StructObjectInspector, as I read the Hive source code, but I know there is
> no null of ObjectInsepector as children of StructObjectInspector. Google
> this error didn't give me any hint. Does any one know anything like this?
>
> Project
> [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L,
> StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL
> tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L,
> StringType),CAST(active_contact_count_a#16L,
> StringType),CAST(other_api_contact_count_a#6L,
> StringType),CAST(fb_api_contact_count_a#7L,
> StringType),CAST(evm_contact_count_a#8L,
> StringType),CAST(loyalty_contact_count_a#9L,
> StringType),CAST(mobile_jmml_contact_count_a#10L,
> StringType),CAST(savelocal_contact_count_a#11L,
> StringType),CAST(siteowner_contact_count_a#12L,
> StringType),CAST(socialcamp_service_contact_count_a#13L,
> S...org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in
> stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerException
>         at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139)
>         at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89)
>         at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101)
>         at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117)
>         at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81)
>         at
> org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.<init>(AvroObjectInspectorGenerator.java:55)
>         at
> org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69)
>         at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112)
>         at
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109)
>         at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
>         at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:56)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:198)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)h
>
>
>

Re: Spark SQL query AVRO file

Reply via email to