RE: Spark SQL query AVRO file
Good to know that. Let me research it and give it a try. Thanks Yong From: mich...@databricks.com Date: Fri, 7 Aug 2015 11:44:48 -0700 Subject: Re: Spark SQL query AVRO file To: java8...@hotmail.com CC: user@spark.apache.org You can register your data as a table using this library and then query it using HiveQL CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path "src/test/resources/episodes.avro") On Fri, Aug 7, 2015 at 11:42 AM, java8964 wrote: Hi, Michael: I am not sure how spark-avro can help in this case. My understanding is that to use Spark-avro, I have to translate all the logic from this big Hive query into Spark code, right? If I have this big Hive query, how I can use spark-avro to run the query? Thanks Yong From: mich...@databricks.com Date: Fri, 7 Aug 2015 11:32:21 -0700 Subject: Re: Spark SQL query AVRO file To: java8...@hotmail.com CC: user@spark.apache.org Have you considered trying Spark SQL's native support for avro data? https://github.com/databricks/spark-avro On Fri, Aug 7, 2015 at 11:30 AM, java8964 wrote: Hi, Spark users: We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production cluster, which has 42 data/task nodes. There is one dataset stored as Avro files about 3T. Our business has a complex query running for the dataset, which is stored in nest structure with Array of Struct in Avro and Hive. We can query it using Hive without any problem, but we like the SparkSQL's performance, so we in fact run the same query in the Spark SQL, and found out it is in fact much faster than Hive. But when we run it, we got the following error randomly from Spark executors, sometime seriously enough to fail the whole spark job. Below the stack trace, and I think it is a bug related to Spark due to: 1) The error jumps out inconsistent, as sometimes we won't see it for this job. (We run it daily)2) Sometime it won't fail our job, as it recover after retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the multithreading in Spark? The NullPointException indicates Hive got a Null ObjectInspector of the children of StructObjectInspector, as I read the Hive source code, but I know there is no null of ObjectInsepector as children of StructObjectInspector. Google this error didn't give me any hint. Does any one know anything like this? Project [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L, StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, StringType),CAST(active_contact_count_a#16L, StringType),CAST(other_api_contact_count_a#6L, StringType),CAST(fb_api_contact_count_a#7L, StringType),CAST(evm_contact_count_a#8L, StringType),CAST(loyalty_contact_count_a#9L, StringType),CAST(mobile_jmml_contact_count_a#10L, StringType),CAST(savelocal_contact_count_a#11L, StringType),CAST(siteowner_contact_count_a#12L, StringType),CAST(socialcamp_service_contact_count_a#13L, S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerExceptionat org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at org.apache.spark.rdd.RD
Re: Spark SQL query AVRO file
You can register your data as a table using this library and then query it using HiveQL CREATE TEMPORARY TABLE episodes USING com.databricks.spark.avro OPTIONS (path "src/test/resources/episodes.avro") On Fri, Aug 7, 2015 at 11:42 AM, java8964 wrote: > Hi, Michael: > > I am not sure how spark-avro can help in this case. > > My understanding is that to use Spark-avro, I have to translate all the > logic from this big Hive query into Spark code, right? > > If I have this big Hive query, how I can use spark-avro to run the query? > > Thanks > > Yong > > -- > From: mich...@databricks.com > Date: Fri, 7 Aug 2015 11:32:21 -0700 > Subject: Re: Spark SQL query AVRO file > To: java8...@hotmail.com > CC: user@spark.apache.org > > > Have you considered trying Spark SQL's native support for avro data? > > https://github.com/databricks/spark-avro > > On Fri, Aug 7, 2015 at 11:30 AM, java8964 wrote: > > Hi, Spark users: > > We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our > production cluster, which has 42 data/task nodes. > > There is one dataset stored as Avro files about 3T. Our business has a > complex query running for the dataset, which is stored in nest structure > with Array of Struct in Avro and Hive. > > We can query it using Hive without any problem, but we like the SparkSQL's > performance, so we in fact run the same query in the Spark SQL, and found > out it is in fact much faster than Hive. > > But when we run it, we got the following error randomly from Spark > executors, sometime seriously enough to fail the whole spark job. > > Below the stack trace, and I think it is a bug related to Spark due to: > > 1) The error jumps out inconsistent, as sometimes we won't see it for this > job. (We run it daily) > 2) Sometime it won't fail our job, as it recover after retry. > 3) Sometime it will fail our job, as I listed below. > 4) Is this due to the multithreading in Spark? The NullPointException > indicates Hive got a Null ObjectInspector of the children of > StructObjectInspector, as I read the Hive source code, but I know there is > no null of ObjectInsepector as children of StructObjectInspector. Google > this error didn't give me any hint. Does any one know anything like this? > > Project > [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L, > StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL > tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, > StringType),CAST(active_contact_count_a#16L, > StringType),CAST(other_api_contact_count_a#6L, > StringType),CAST(fb_api_contact_count_a#7L, > StringType),CAST(evm_contact_count_a#8L, > StringType),CAST(loyalty_contact_count_a#9L, > StringType),CAST(mobile_jmml_contact_count_a#10L, > StringType),CAST(savelocal_contact_count_a#11L, > StringType),CAST(siteowner_contact_count_a#12L, > StringType),CAST(socialcamp_service_contact_count_a#13L, > S...org.apache.spark.SparkException: Job aborted due to stage failure: Task > 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in > stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerException > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scal
RE: Spark SQL query AVRO file
Hi, Michael: I am not sure how spark-avro can help in this case. My understanding is that to use Spark-avro, I have to translate all the logic from this big Hive query into Spark code, right? If I have this big Hive query, how I can use spark-avro to run the query? Thanks Yong From: mich...@databricks.com Date: Fri, 7 Aug 2015 11:32:21 -0700 Subject: Re: Spark SQL query AVRO file To: java8...@hotmail.com CC: user@spark.apache.org Have you considered trying Spark SQL's native support for avro data? https://github.com/databricks/spark-avro On Fri, Aug 7, 2015 at 11:30 AM, java8964 wrote: Hi, Spark users: We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our production cluster, which has 42 data/task nodes. There is one dataset stored as Avro files about 3T. Our business has a complex query running for the dataset, which is stored in nest structure with Array of Struct in Avro and Hive. We can query it using Hive without any problem, but we like the SparkSQL's performance, so we in fact run the same query in the Spark SQL, and found out it is in fact much faster than Hive. But when we run it, we got the following error randomly from Spark executors, sometime seriously enough to fail the whole spark job. Below the stack trace, and I think it is a bug related to Spark due to: 1) The error jumps out inconsistent, as sometimes we won't see it for this job. (We run it daily)2) Sometime it won't fail our job, as it recover after retry.3) Sometime it will fail our job, as I listed below.4) Is this due to the multithreading in Spark? The NullPointException indicates Hive got a Null ObjectInspector of the children of StructObjectInspector, as I read the Hive source code, but I know there is no null of ObjectInsepector as children of StructObjectInspector. Google this error didn't give me any hint. Does any one know anything like this? Project [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L, StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, StringType),CAST(active_contact_count_a#16L, StringType),CAST(other_api_contact_count_a#6L, StringType),CAST(fb_api_contact_count_a#7L, StringType),CAST(evm_contact_count_a#8L, StringType),CAST(loyalty_contact_count_a#9L, StringType),CAST(mobile_jmml_contact_count_a#10L, StringType),CAST(savelocal_contact_count_a#11L, StringType),CAST(siteowner_contact_count_a#12L, StringType),CAST(socialcamp_service_contact_count_a#13L, S...org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerExceptionat org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81) at org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartition
Re: Spark SQL query AVRO file
Have you considered trying Spark SQL's native support for avro data? https://github.com/databricks/spark-avro On Fri, Aug 7, 2015 at 11:30 AM, java8964 wrote: > Hi, Spark users: > > We currently are using Spark 1.2.2 + Hive 0.12 + Hadoop 2.2.0 on our > production cluster, which has 42 data/task nodes. > > There is one dataset stored as Avro files about 3T. Our business has a > complex query running for the dataset, which is stored in nest structure > with Array of Struct in Avro and Hive. > > We can query it using Hive without any problem, but we like the SparkSQL's > performance, so we in fact run the same query in the Spark SQL, and found > out it is in fact much faster than Hive. > > But when we run it, we got the following error randomly from Spark > executors, sometime seriously enough to fail the whole spark job. > > Below the stack trace, and I think it is a bug related to Spark due to: > > 1) The error jumps out inconsistent, as sometimes we won't see it for this > job. (We run it daily) > 2) Sometime it won't fail our job, as it recover after retry. > 3) Sometime it will fail our job, as I listed below. > 4) Is this due to the multithreading in Spark? The NullPointException > indicates Hive got a Null ObjectInspector of the children of > StructObjectInspector, as I read the Hive source code, but I know there is > no null of ObjectInsepector as children of StructObjectInspector. Google > this error didn't give me any hint. Does any one know anything like this? > > Project > [HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcatWS(,,CAST(account_id#23L, > StringType),CAST(gross_contact_count_a#4L, StringType),CASE WHEN IS NULL > tag_cnt#21 THEN 0 ELSE CAST(tag_cnt#21, StringType),CAST(list_cnt_a#5L, > StringType),CAST(active_contact_count_a#16L, > StringType),CAST(other_api_contact_count_a#6L, > StringType),CAST(fb_api_contact_count_a#7L, > StringType),CAST(evm_contact_count_a#8L, > StringType),CAST(loyalty_contact_count_a#9L, > StringType),CAST(mobile_jmml_contact_count_a#10L, > StringType),CAST(savelocal_contact_count_a#11L, > StringType),CAST(siteowner_contact_count_a#12L, > StringType),CAST(socialcamp_service_contact_count_a#13L, > S...org.apache.spark.SparkException: Job aborted due to stage failure: Task > 58 in stage 1.0 failed 4 times, most recent failure: Lost task 58.3 in > stage 1.0 (TID 257, 10.20.95.146): java.lang.NullPointerException > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:139) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:89) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:101) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:117) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:81) > at > org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:55) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:69) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:112) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$2.apply(TableReader.scala:109) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) > at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:618) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >