Re: Converting arbitrary JSON to avro

Russell Jurney Tue, 18 Sep 2012 16:19:00 -0700

Fwiw, I do this in web apps all the time via the python avro lib and json.dumps


Russell Jurney
twitter.com/rjurney
[email protected]
datasyndrome.com

On Sep 18, 2012, at 12:38 PM, Doug Cutting <[email protected]> wrote:

> On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <[email protected]> 
> wrote:
>> Json.Writer is indeed what I had in mind and I have successfully managed to 
>> convert my existing JSON to avro using it.
>> However using GenericDatumReader on this feels pretty unnatural, as I seem 
>> to be unable to access fields directly. It seems I have to access the 
>> "value" field on each record which returns a Map which uses Utf8 Objects as 
>> keys for the actual fields. Or am I doing something wrong here?
>
> Hmm.  We could re-factor Json.SCHEMA so the union is the top-level
> element.  That would get rid of the wrapper around every value.  It's
> a more redundant way to write the schema, but the binary encoding is
> identical (since a record wrapper adds no bytes).  It would hence
> require no changes to Json.Reader or Json.Writer.
>
> [ "long",
>  "double",
>  "string",
>  "boolean",
>  "null",
>  {"type" : "array",
>   "items" : {
>       "type" : "record",
>       "name" : "org.apache.avro.data.Json",
>       "fields" : [ {
>           "name" : "value",
>           "type" : [ "long", "double", "string", "boolean", "null",
>                      {"type" : "array", "items" : "Json"},
>                      {"type" : "map", "values" : "Json"}
>                    ]
>       } ]
>   }
>  },
>  {"type" : "map", "values" : "Json"}
> ]
>
> You can try this by placing this schema in
> share/schemas/org/apache/avro/data/Json.avsc and re-building the avro
> jar.
>
> Would such a change be useful to you?  If so, please file an issue in Jira.
>
> Or we could even refactor this schema so that a Json object is the
> top-level structure:
>
> {"type" : "map",
> "values" : [ "long",
>              "double",
>              "string",
>              "boolean",
>              "null",
>              {"type" : "array",
>               "items" : {
>                   "type" : "record",
>                   "name" : "org.apache.avro.data.Json",
>                   "fields" : [ {
>                       "name" : "value",
>                       "type" : [ "long", "double", "string", "boolean", 
> "null",
>                                  {"type" : "array", "items" : "Json"},
>                                  {"type" : "map", "values" : "Json"}
>                                ]
>                   } ]
>               }
>              },
>              {"type" : "map", "values" : "Json"}
>            ]
> }
>
> This would change the binary format but would not change the
> representation that GenericDatumReader would hand you from my first
> example above (since the generic representation unwraps unions).
> Using this schema would require changes to Json.Writer and
> Json.Reader.  It would better conform to the definition of Json, which
> only permits objects as the top-level type.
>
>> Concerning the more specific schema, you are of course completely right. 
>> Unfortunately more or less all the fields in the JSON data format are 
>> optional and many have substructures, so, at least in my understanding, I 
>> have to use unions of null and the actual type throughout the schema. I 
>> tried using JsonDecoder first (or rather the fromjson option of the avro 
>> tool, which, I think, uses JsonDecoder) but given the current JSON 
>> structures, this didn't work.
>
>> So I'll probably have to look into implementing my own converter.  However 
>> given the rather complex structure of the original JSON I'm wondering if 
>> trying to represent the data in avro is such a good idea in the first place.
>
> It would be interesting to see whether, with the appropriate schema,
> whether the dataset is smaller and faster to process as Avro than as
> Json.  If you have 1000 fields in your data but the typical record
> only has one or two non-null, then an Avro record is perhaps not a
> good representation.  An Avro map might be better, but if the values
> are similarly variable then Json might be competitive.
>
> Cheers,
>
> Doug

Re: Converting arbitrary JSON to avro

Reply via email to