Thanks, that is very helpful. It actually makes complete sense (note the other email where I was wondering exactly how avro dealt with unions of similar types), I guess what threw me off is that the python implementation worked fine.
Thanks again Jon 2013/4/7 Scott Carey <[email protected]> > It is well documented in the specification: > http://avro.apache.org/docs/current/spec.html#json_encoding > > I know others have overridden this behavior by extending GenericData > and/or the JsonDecoder/Encoder. It wouldn't conform to the Avro > Specification JSON, but you can extend avro do do what you need it to. > > The reason for this encoding is to make sure that round-tripping data from > binary to json and back results in the same data. Additionally, unions can > be more complicated and contain multiple records each with different names. > Disambiguating the value requires more information since several Avro data > types map to the same JSON data type. If the schema is a union of bytes > and string, is "hello" a string, or byte literal? If it is a union of a > map and a record, is {"state":"CA", "city":"Pittsburgh"} a record with two > string fields, or a map? There are other approaches, and for some users > perfect transmission of types is not critical. Generally speaking, if you > want to output Avro data as JSON and consume as JSON, the extra data is not > helpful. If you want to read it back in as Avro, you're going to need the > info to know which branch of the union to take. > > On 4/6/13 6:49 PM, "Jonathan Coveney" <[email protected]> wrote: > > Err, it's the output format that deserializes the json and then writes it > in the binary format, not the input format. But either way the general flow > is the same. > > As a general aside, is it the case that the java case is correct in that > when writing a union it should be {"string": "hello"} or whatnot? Seems > like we should probably add that to the documentation if it is a > requirement. > > > 2013/4/7 Jonathan Coveney <[email protected]> > >> Scott, >> >> Thanks for the input. The use case is that a number of our batch >> processes are built on python streaming. Currently, the reducer will output >> a json string as a value, and then the input format will deserialize the >> json, and then write it in binary format. >> >> Given that our use of python streaming isn't going away, any suggestions >> on how to make this better? Is there a better way to go from json string -> >> writing binary avro data? >> >> Thanks again >> Jon >> >> >> 2013/4/6 Scott Carey <[email protected]> >> >>> This is due to using the JSON encoding for avro and not the binary >>> encoding. It would appear that the Python version is a little bit lax on >>> the spec. Some have built variations of the JSON encoding that do not >>> label the union, but there are drawbacks to this too, as the type can be >>> ambiguous in a very large number of cases without a label. >>> >>> Why are you using the JSON encoding for Avro? The primary purpose of >>> the JSON serialization form as it is now is for transforming the binary to >>> human readable form. >>> Instead of building your GenericRecord from a JSON string, try using >>> GenericRecordBuilder. >>> >>> -Scott >>> >>> On 4/5/13 4:59 AM, "Jonathan Coveney" <[email protected]> wrote: >>> >>> Ok, I figured out the issue: >>> >>> If you make string c the following: >>> String c = "{\"name\": \"Alyssa\", \"favorite_number\": {\"int\": 256}, >>> \"favorite_color\": {\"string\": \"blue\"}}"; >>> >>> Then this works. >>> >>> This represents a divergence between the python and the Java >>> implementation... the above does not work in Python, but it does work in >>> Java. And of course, vice versa. >>> >>> I think I know how to fix this (and can file a bug with my reproduction >>> and the fix), but I'm not sure which one is the expected case? Which >>> implementation is wrong? >>> >>> Thanks >>> >>> >>> 2013/4/5 Jonathan Coveney <[email protected]> >>> >>>> Correction: the issue is when reading the string according to the avro >>>> schema, not on writing. it fails before I get a chance to write :) >>>> >>>> >>>> 2013/4/5 Jonathan Coveney <[email protected]> >>>> >>>>> I implemented essentially the Java avro example but using the >>>>> GenericDatumWriter and GenericDatumReader and hit an issue. >>>>> >>>>> https://gist.github.com/jcoveney/5317904 >>>>> >>>>> This is the error: >>>>> Exception in thread "main" java.lang.RuntimeException: >>>>> org.apache.avro.AvroTypeException: Expected start-union. Got >>>>> VALUE_NUMBER_INT >>>>> at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:45) >>>>> Caused by: org.apache.avro.AvroTypeException: Expected start-union. >>>>> Got VALUE_NUMBER_INT >>>>> at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697) >>>>> at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441) >>>>> at >>>>> org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229) >>>>> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88) >>>>> at >>>>> org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206) >>>>> at >>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:152) >>>>> at >>>>> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177) >>>>> at >>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148) >>>>> at >>>>> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139) >>>>> at com.spotify.hadoop.mapred.Hrm.main(Hrm.java:38) >>>>> >>>>> Am I doing something wrong? Is this a bug? I'm digging in now but am >>>>> curious if anyone has seen this before? >>>>> >>>>> I get the feeling I am working with Avro in a way that most people do >>>>> not :) >>>>> >>>>> >>>> >>> >> >
