Hello,
Firstly thanks for your help. I've corrected my schema according to your
advice, but I've still the same kind of issue :
------------------------------------------------------------------------
With this schema :
/(...) /
{"name": "in_reply_to", "type": ["null", "long" ], "default": null },
/(...) /
{"name":"urls","type":["null",{"type":"array","items": /(record)/ }]}
(...)
Using this schema, the following data :
{"created_at": "Mon, 28 May 2012 00:01:25 +0000", "emitter": 405427230,
"emitter_name": "CallmeOceane_", "geo": null, "hashtags": null,* "in_reply_to":
206897508021055489*,
"lang": "fr", "msg": "@Chloe_OneD Aaaah puuuuutain j'ai toujours pas finis Wild
Souls machin truc", "uid": 206897932501385217, "urls": null, "usermentions":
[{"id": 288136906, "indices": [0, 11], "name": "Happiness \u10e6",
"screen_name": "Chloe_OneD"}]}|
Ends on this error :
2012-06-07 10:16:07,831 WARN org.apache.hadoop.streaming.PipeMapRed:
org.apache.avro.AvroTypeException: Expected start-union. Got VALUE_NUMBER_INT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460)
at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418)
at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
at
com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102)
at
com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446)
at
org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421)
------------------------------------------------------------------------
While using this data :
{"created_at": "Mon, 28 May 2012 00:00:10 +0000", "emitter": 59809965,
"emitter_name": "Droolius", "geo": null, "hashtags": null, *"in_reply_to":
null*, "lang": "en", "msg":
"RT @davidchang: Thank you again Amy Rowat & team UCLA @scienceandfood :
Umami Reverse Engineering + The Joy of MSG http://t.co/nk1QBGbg", "uid":
206897616326377472,
*"urls": [{"display_url": "bit.ly/KvD0QZ", "expanded_url":
"http://bit.ly/KvD0QZ", "indices": [119, 139], "url":
"http://t.co/nk1QBGbg"}]*,
"usermentions": [{"id": 221185711, "indices": [3, 14], "name": "Dave Chang",
"screen_name": "davidchang"},
{"id": 526175293, "indices": [58, 73], "name": "UCLA Science & Food",
"screen_name": "scienceandfood"}]}|
It ends with :
2012-06-07 10:38:19,530 WARN org.apache.hadoop.streaming.PipeMapRed:
org.apache.avro.AvroTypeException: Expected start-union. Got START_ARRAY
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460)
at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418)
at
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at
org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
at
com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102)
at
com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446)
at
org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421)
------------------------------------------------------------------------
Accordingly to these error stacks I guess that my problem has something
to do with the custom output format which relies on
org.apache.avro.generic,
am I right (and consequently on the strict java implementation) ?
All the best, again thanks for reading :)
Regards,
François.
since the Avro writer in the one available in : , I
> According to the spec, the default value for a union is assumed to have
> the type of the first element of the union.
>
> http://avro.apache.org/docs/current/spec.html#schema_record
>
> So some valid fields would be:
>
> {"name":"x", "type":["long", "null"], "default": 0}
> {"name":"y", "type":["null", "long"], "default": null}
>
> The following are invalid fields, since the type of the default value
> does not match that of the first union element.
>
> {"name":"x", "type":["long", "null"], "default": null}
> {"name":"y", "type":["null", "long"], "default": 0}
>
> Python may not implement this strictly, but Java does.
>
> This is a common point of confusion. We should probably document it
> better. I'm not sure whether it's causing the problem you're seeing,
> but perhaps it is.
>
> Cheers,
>
> Doug
>
> On 06/06/2012 04:15 AM, François Kawala wrote:
> > Dear all,
> >
> > Despite my desperate effort to get a working schema I can not manage to
> > specify that a field of a given record can be either : "a given type" or
> > "null". I've tried with unions but the back-end that I have to use seems
> > to be unhappy with it. More precisely : I'm trying to output the result
> > of a Streaming MR job within an AVRO container. This job is written in
> > python an executed through dumbo (http://www.dumbotics.com), and a
> > custom OutputFormat is used
> > (https://github.com/tomslabs/avro-utils/tree/master/src/main/java/com/tomslabs/grid/avro)
> >
> >
> > However since this custom OutputFormat relies on org.apache.avro
> > sources, I've thought this list could be a good spot to call for help.
> >
> > Thanks for reading,
> > François.
> >
> > ------------------------------------------------------------------------
> >
> > Here is some complementary elements :
> >
> > Fragment of the schema that I think to be responsible of my troubles :
> >
> > {"name": "in_reply_to", "type": [{"type": "long"},"null"], "default":"null"}
> >
> > I've also unsuccessfully tried :
> >
> > {"name": "in_reply_to", "type": [{"type": "long"},"null"]}
> > {"name": "in_reply_to", "type": ["null",{"type": "long"}]}
> >
> > Each ending with the same error message :
> >
> > org.apache.avro.AvroTypeException: Expected start-union. Got
> > VALUE_NUMBER_INT
> >
> > Error Stack :
> >
> > at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:460)
> > at
> > org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:418)
> > at
> > org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
> > at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> > at
> > org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
> > at
> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
> > at
> > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:166)
> > at
> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:138)
> > at
> > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:129)
> > at
> > com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:102)
> > at
> > com.tomslabs.grid.avro.TextTypedBytesToAvroOutputFormat$AvroRecordWriter.write(TextTypedBytesToAvroOutputFormat.java:88)
> > at
> > org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:446)
> > at
> > org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(PipeMapRed.java:421)
> >
> >
> >
> >
> >
> >
> >
>