Re: Converting arbitrary JSON to avro

Markus Strickler Wed, 19 Sep 2012 08:45:11 -0700

Hi Doug,

thanks for the suggestion, I wasn't aware that one can specify anything else 
than a record as top level element of a schema. 
I tried this, and it works well for a flat data, but with nested structures you 
still have to use the value indirection, or so it seems.
Also I think this change might break existing code relying on the current 
structure.


As far as size and performance go, I probably have to run some tests on real 
data, once I've come up with an appropriate schema that actually matches the 
data.

Again, thanks a lot for your help.

Best,
-markus

Am 18.09.2012 um 21:38 schrieb Doug Cutting:

> On Tue, Sep 18, 2012 at 11:34 AM, Markus Strickler <[email protected]> 
> wrote:
>> Json.Writer is indeed what I had in mind and I have successfully managed to 
>> convert my existing JSON to avro using it.
>> However using GenericDatumReader on this feels pretty unnatural, as I seem 
>> to be unable to access fields directly. It seems I have to access the 
>> "value" field on each record which returns a Map which uses Utf8 Objects as 
>> keys for the actual fields. Or am I doing something wrong here?
> 
> Hmm.  We could re-factor Json.SCHEMA so the union is the top-level
> element.  That would get rid of the wrapper around every value.  It's
> a more redundant way to write the schema, but the binary encoding is
> identical (since a record wrapper adds no bytes).  It would hence
> require no changes to Json.Reader or Json.Writer.
> 
> [ "long",
>  "double",
>  "string",
>  "boolean",
>  "null",
>  {"type" : "array",
>   "items" : {
>       "type" : "record",
>       "name" : "org.apache.avro.data.Json",
>       "fields" : [ {
>           "name" : "value",
>           "type" : [ "long", "double", "string", "boolean", "null",
>                      {"type" : "array", "items" : "Json"},
>                      {"type" : "map", "values" : "Json"}
>                    ]
>       } ]
>   }
>  },
>  {"type" : "map", "values" : "Json"}
> ]
> 
> You can try this by placing this schema in
> share/schemas/org/apache/avro/data/Json.avsc and re-building the avro
> jar.
> 
> Would such a change be useful to you?  If so, please file an issue in Jira.
> 
> Or we could even refactor this schema so that a Json object is the
> top-level structure:
> 
> {"type" : "map",
> "values" : [ "long",
>              "double",
>              "string",
>              "boolean",
>              "null",
>              {"type" : "array",
>               "items" : {
>                   "type" : "record",
>                   "name" : "org.apache.avro.data.Json",
>                   "fields" : [ {
>                       "name" : "value",
>                       "type" : [ "long", "double", "string", "boolean", 
> "null",
>                                  {"type" : "array", "items" : "Json"},
>                                  {"type" : "map", "values" : "Json"}
>                                ]
>                   } ]
>               }
>              },
>              {"type" : "map", "values" : "Json"}
>            ]
> }
> 
> This would change the binary format but would not change the
> representation that GenericDatumReader would hand you from my first
> example above (since the generic representation unwraps unions).
> Using this schema would require changes to Json.Writer and
> Json.Reader.  It would better conform to the definition of Json, which
> only permits objects as the top-level type.
> 
>> Concerning the more specific schema, you are of course completely right. 
>> Unfortunately more or less all the fields in the JSON data format are 
>> optional and many have substructures, so, at least in my understanding, I 
>> have to use unions of null and the actual type throughout the schema. I 
>> tried using JsonDecoder first (or rather the fromjson option of the avro 
>> tool, which, I think, uses JsonDecoder) but given the current JSON 
>> structures, this didn't work.
> 
>> So I'll probably have to look into implementing my own converter.  However 
>> given the rather complex structure of the original JSON I'm wondering if 
>> trying to represent the data in avro is such a good idea in the first place.
> 
> It would be interesting to see whether, with the appropriate schema,
> whether the dataset is smaller and faster to process as Avro than as
> Json.  If you have 1000 fields in your data but the typical record
> only has one or two non-null, then an Avro record is perhaps not a
> good representation.  An Avro map might be better, but if the values
> are similarly variable then Json might be competitive.
> 
> Cheers,
> 
> Doug

Re: Converting arbitrary JSON to avro

Reply via email to