Hello, Firstly thanks for your help. I've corrected my schema according to your advice, but I've still the same kind of issue :
------------------------------------------------------------------------ With this schema : /(...) / {"name": "in_reply_to", "type": ["null", "long" ], "default": null }, /(...) / {"name":"urls","type":["null",{"type":"array","items": /(record)/ }]} (...) Using this schema, the following data : {"created_at": "Mon, 28 May 2012 00:01:25 +0000", "emitter": 405427230, "emitter_name": "CallmeOceane_", "geo": null, "hashtags": null,* "in_reply_to": 206897508021055489*, "lang": "fr", "msg": "@Chloe_OneD Aaaah puuuuutain j'ai toujours pas finis Wild Souls machin truc", "uid": 206897932501385217, "urls": null, "usermentions": [{"id": 288136906, "indices": [0, 11], "name": "Happiness \u10e6", "screen_name": "Chloe_OneD"}]}| Ends on this error : 2012-06-07 10:16:07,831 WARN org.apache.hadoop.streaming.PipeMapRed: org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460) at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102) at com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446) at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421) ------------------------------------------------------------------------ While using this data : {"created_at": "Mon, 28 May 2012 00:00:10 +0000", "emitter": 59809965, "emitter_name": "Droolius", "geo": null, "hashtags": null, *"in_reply_to": null*, "lang": "en", "msg": "RT @davidchang: Thank you again Amy Rowat & team UCLA @scienceandfood : Umami Reverse Engineering + The Joy of MSG http://t.co/nk1QBGbg", "uid": 206897616326377472, *"urls": [{"display_url": "bit.ly/KvD0QZ", "expanded_url": "http://bit.ly/KvD0QZ", "indices": [119, 139], "url": "http://t.co/nk1QBGbg"}]*, "usermentions": [{"id": 221185711, "indices": [3, 14], "name": "Dave Chang", "screen_name": "davidchang"}, {"id": 526175293, "indices": [58, 73], "name": "UCLA Science & Food", "screen_name": "scienceandfood"}]}| It ends with : 2012-06-07 10:38:19,530 WARN org.apache.hadoop.streaming.PipeMapRed: org.apache.avro.AvroTypeException: Expected start-union. Got START_ARRAY at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460) at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418) at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) at com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102) at com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446) at org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421) ------------------------------------------------------------------------ Accordingly to these error stacks I guess that my problem has something to do with the custom output format which relies on org.apache.avro.generic, am I right (and consequently on the strict java implementation) ? All the best, again thanks for reading :) Regards, François. since the Avro writer in the one available in : , I > According to the spec, the default value for a union is assumed to have > the type of the first element of the union. > > http://avro.apache.org/docs/current/spec.html#schema_record > > So some valid fields would be: > > {"name":"x", "type":["long", "null"], "default": 0} > {"name":"y", "type":["null", "long"], "default": null} > > The following are invalid fields, since the type of the default value > does not match that of the first union element. > > {"name":"x", "type":["long", "null"], "default": null} > {"name":"y", "type":["null", "long"], "default": 0} > > Python may not implement this strictly, but Java does. > > This is a common point of confusion. We should probably document it > better. I'm not sure whether it's causing the problem you're seeing, > but perhaps it is. > > Cheers, > > Doug > > On 06/06/2012 04:15 AM, François Kawala wrote: > > Dear all, > > > > Despite my desperate effort to get a working schema I can not manage to > > specify that a field of a given record can be either : "a given type" or > > "null". I've tried with unions but the back-end that I have to use seems > > to be unhappy with it. More precisely : I'm trying to output the result > > of a Streaming MR job within an AVRO container. This job is written in > > python an executed through dumbo (http://www.dumbotics.com), and a > > custom OutputFormat is used > > (https://github.com/tomslabs/avro-utils/tree/master/src/main/java/com/tomslabs/grid/avro) > > > > > > However since this custom OutputFormat relies on org.apache.avro > > sources, I've thought this list could be a good spot to call for help. > > > > Thanks for reading, > > François. > > > > ------------------------------------------------------------------------ > > > > Here is some complementary elements : > > > > Fragment of the schema that I think to be responsible of my troubles : > > > > {"name": "in_reply_to", "type": [{"type": "long"},"null"], "default":"null"} > > > > I've also unsuccessfully tried : > > > > {"name": "in_reply_to", "type": [{"type": "long"},"null"]} > > {"name": "in_reply_to", "type": ["null",{"type": "long"}]} > > > > Each ending with the same error message : > > > > org.apache.avro.AvroTypeException: Expected start-union. Got > > VALUE_NUMBER_INT > > > > Error Stack : > > > > at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460) > > at > > org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418) > > at > > org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) > > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) > > at > > org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) > > at > > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142) > > at > > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166) > > at > > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138) > > at > > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129) > > at > > com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102) > > at > > com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88) > > at > > org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446) > > at > > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421) > > > > > > > > > > > > > > >