your field name is *enum1_values* but you have data { "foo1": "test123", *"enum1"*: "BLUE" }
i.e. since you defined enum and not union(null, enum) it tries to find value for enum1_values and doesn't find one... On 3 March 2016 at 11:30, Chris Miller <cmiller11...@gmail.com> wrote: > I've been digging into this a little deeper. Here's what I've found: > > test1.avsc: > ******************** > { > "namespace": "com.cmiller", > "name": "test1", > "type": "record", > "fields": [ > { "name":"foo1", "type":"string" } > ] > } > ******************** > > test2.avsc: > ******************** > { > "namespace": "com.cmiller", > "name": "test1", > "type": "record", > "fields": [ > { "name":"foo1", "type":"string" }, > { "name":"enum1", "type": { "type":"enum", "name":"enum1_values", > "symbols":["BLUE","RED", "GREEN"]} } > ] > } > ******************** > > test1.json (encoded and saved to test/test1.avro): > ******************** > { "foo1": "test123" } > ******************** > > test2.json (encoded and saved to test/test1.avro): > ******************** > { "foo1": "test123", "enum1": "BLUE" } > ******************** > > Here is how I create the tables and add the data: > > ******************** > CREATE TABLE test1 > PARTITIONED BY (ds STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test1.avsc'); > > ALTER TABLE test1 ADD PARTITION (ds='1') LOCATION > 's3://spark-data/dev/test1'; > > > CREATE TABLE test2 > PARTITIONED BY (ds STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' > TBLPROPERTIES ('avro.schema.url'='hdfs:///path/to/schemas/test2.avsc'); > > ALTER TABLE test2 ADD PARTITION (ds='1') LOCATION > 's3://spark-data/dev/test2'; > ******************** > > And here's what I get: > > ******************** > SELECT * FROM test1; > -- works fine, shows data > > SELECT * FROM test2; > > org.apache.avro.AvroTypeException: Found com.cmiller.enum1_values, > expecting union > at > org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) > at > org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) > at > org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) > at > org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111) > at > org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserialize(AvroDeserializer.java:175) > at > org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:201) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:409) > at > org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to > (TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > ******************** > > In addition to the above, I also tried putting the test Avro files on HDFS > instead of S3 -- the error is the same. I also tried querying from Scala > instead of using Zeppelin, and I get the same error. > > Where should I begin with troubleshooting this problem? This same query > runs fine on Hive. Based on the error, it appears to be something in the > deserializer though... but if it were a bug in the Avro deserializer, why > does it only appear with Spark? I imagine Hive queries would be using the > same deserializer, no? > > Thanks! > > > > -- > Chris Miller > > On Thu, Mar 3, 2016 at 4:33 AM, Chris Miller <cmiller11...@gmail.com> > wrote: > >> Hi, >> >> I have a strange issue occurring when I use manual partitions. >> >> If I create a table as follows, I am able to query the data with no >> problem: >> >> ******** >> CREATE TABLE test1 >> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' >> STORED AS INPUTFORMAT >> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' >> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' >> LOCATION 's3://analytics-bucket/prod/logs/avro/2016/03/02/' >> TBLPROPERTIES ('avro.schema.url'='hdfs:///data/schemas/schema.avsc'); >> ******** >> >> If I create the table like this, however, and then add a partition with a >> LOCATION specified, I am unable to query: >> >> ******** >> CREATE TABLE test2 >> PARTITIONED BY (ds STRING) >> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' >> STORED AS INPUTFORMAT >> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' >> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' >> TBLPROPERTIES ('avro.schema.url'='hdfs:///data/schemas/schema.avsc'); >> >> ALTER TABLE test7 ADD PARTITION (ds='1') LOCATION >> 's3://analytics-bucket/prod/logs/avro/2016/03/02/'; >> ******** >> >> This is what happens >> >> ******** >> SELECT * FROM test2 LIMIT 1; >> >> org.apache.avro.AvroTypeException: Found ActionEnum, expecting union >> at >> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) >> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) >> at >> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267) >> at >> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155) >> at >> org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193) >> at >> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183) >> at >> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151) >> at >> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) >> at >> org.apache.hadoop.hive.serde2.avro.AvroDeserializer$SchemaReEncoder.reencode(AvroDeserializer.java:111) >> at >> org.apache.hadoop.hive.serde2.avro.AvroDeserializer.deserialize(AvroDeserializer.java:175) >> at >> org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:201) >> at >> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:409) >> at >> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:408) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> at >> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >> at >> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) >> at >> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) >> at scala.collection.TraversableOnce$class.to >> (TraversableOnce.scala:273) >> at scala.collection.AbstractIterator.to(Iterator.scala:1157) >> at >> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) >> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) >> at >> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) >> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) >> at >> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) >> at >> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) >> at >> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) >> at >> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) >> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >> at org.apache.spark.scheduler.Task.run(Task.scala:89) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:745) >> ******** >> >> The data is exactly the same, and I can still go back and query the test1 >> table without issue. I don't have control over the directory structure, so >> I need to add the partitions manually so that I can specify a location. >> >> For what it's worth, "ActionEnum" is the first field in my schema. This >> same table and query structure works fine with Hive. When I try to run this >> with SparkSQL, however, I get the above error. >> >> Anyone have any idea what the problem is here? Thanks! >> >> -- >> Chris Miller >> > >