Fwiw, I do this in web apps all the time via the python avro lib and json.dumps
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com On Sep 18, 2012, at 12:38 PM, Doug Cutting <[email protected]> wrote: > On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <[email protected]> > wrote: >> Json.Writer is indeed what I had in mind and I have successfully managed to >> convert my existing JSON to avro using it. >> However using GenericDatumReader on this feels pretty unnatural, as I seem >> to be unable to access fields directly. It seems I have to access the >> "value" field on each record which returns a Map which uses Utf8 Objects as >> keys for the actual fields. Or am I doing something wrong here? > > Hmm. We could re-factor Json.SCHEMA so the union is the top-level > element. That would get rid of the wrapper around every value. It's > a more redundant way to write the schema, but the binary encoding is > identical (since a record wrapper adds no bytes). It would hence > require no changes to Json.Reader or Json.Writer. > > [ "long", > "double", > "string", > "boolean", > "null", > {"type" : "array", > "items" : { > "type" : "record", > "name" : "org.apache.avro.data.Json", > "fields" : [ { > "name" : "value", > "type" : [ "long", "double", "string", "boolean", "null", > {"type" : "array", "items" : "Json"}, > {"type" : "map", "values" : "Json"} > ] > } ] > } > }, > {"type" : "map", "values" : "Json"} > ] > > You can try this by placing this schema in > share/schemas/org/apache/avro/data/Json.avsc and re-building the avro > jar. > > Would such a change be useful to you? If so, please file an issue in Jira. > > Or we could even refactor this schema so that a Json object is the > top-level structure: > > {"type" : "map", > "values" : [ "long", > "double", > "string", > "boolean", > "null", > {"type" : "array", > "items" : { > "type" : "record", > "name" : "org.apache.avro.data.Json", > "fields" : [ { > "name" : "value", > "type" : [ "long", "double", "string", "boolean", > "null", > {"type" : "array", "items" : "Json"}, > {"type" : "map", "values" : "Json"} > ] > } ] > } > }, > {"type" : "map", "values" : "Json"} > ] > } > > This would change the binary format but would not change the > representation that GenericDatumReader would hand you from my first > example above (since the generic representation unwraps unions). > Using this schema would require changes to Json.Writer and > Json.Reader. It would better conform to the definition of Json, which > only permits objects as the top-level type. > >> Concerning the more specific schema, you are of course completely right. >> Unfortunately more or less all the fields in the JSON data format are >> optional and many have substructures, so, at least in my understanding, I >> have to use unions of null and the actual type throughout the schema. I >> tried using JsonDecoder first (or rather the fromjson option of the avro >> tool, which, I think, uses JsonDecoder) but given the current JSON >> structures, this didn't work. > >> So I'll probably have to look into implementing my own converter. However >> given the rather complex structure of the original JSON I'm wondering if >> trying to represent the data in avro is such a good idea in the first place. > > It would be interesting to see whether, with the appropriate schema, > whether the dataset is smaller and faster to process as Avro than as > Json. If you have 1000 fields in your data but the typical record > only has one or two non-null, then an Avro record is perhaps not a > good representation. An Avro map might be better, but if the values > are similarly variable then Json might be competitive. > > Cheers, > > Doug
