Re: Using Arrays in Apache Avro

graham sanderson Tue, 24 Sep 2013 12:44:29 -0700

I had no hand in the design, but it is very elegant and I'll throw in my two 
cents.

Avro is an interchange format. The in memory representation is entirely up to 
you and your implementation language of choice.

The provided Java implementation, allows for seamless mixing of Generic 
(everything is Object with some conventions e.g. Strings must be some sort of 
CharSequence but are generally read as String or Utf8, arrays are handled as 
java.util.List), Specific (Which allows generated java classes for Record 
schemas and use of real Java enums for Enum schemas), and Reflect (which allows 
you to serialize/deserialize regular Java objects via reflection - and 
incidentally DOES support (de)serialization of native arrays).

Since at the Generic level any in memory representation of any schema is an 
Object (that includes the primitive types which must be boxed and null we could 
argue about semantically), it would be hard to deal with unboxed primitive 
array elements anyway. At that point, I don't think there is any real benefit 
to using native arrays, and as mentioned, java.util.List provides a more 
flexible interface (note when (not de)serializating any java.util.Collection 
will do, though it is to your benefit to use one with a defined ordering). Note 
also that Avro supports object re-use during deserialization which is more 
likely to be effective with a List implementation (since you can't change the 
size of an array)

Were you really to care (as per my elegant point above) you can implement your 
own in memory representations (though you'd want to have a pretty good reason, 
and I'm not suggesting this is one of them). Indeed this is a feature we do use 
ourselves where for a certain application data type the most natural in memory 
representation is quite different from the most efficient serialized schema. 
Avro makes it easy for us to do this without "hacking" anything, though at the 
cost of implementing a relatively small amount of code, and in our case we only 
care about it in Java

On Sep 24, 2013, at 2:20 PM, Mika Ristimaki <[email protected]> wrote:

> 
> On Sep 24, 2013, at 9:46 PM, Raihan Jamal <[email protected]> wrote:
> 
>> Thanks a lot Mika. Yeah, it works now but my second question is- Does the 
>> avro schema that I have made looks good as compared to JSON value that we 
>> were using previously?
>> I thought we can use an array for that so designed like that using an Apache 
>> Avro..
>> 
> 
> This is an application design question, and not related to Avro. If you have 
> a list of prices,  array is a good place to store them.
> 
>> And also why Avro Array uses java.util.List datatype? Just curious to know 
>> on that as well.
> 
> Someone who has actually designed Avro can answer this better, but I assume 
> that List was chosen because it is much more convenient to use than java 
> arrays. You don't need to know the size before hand, etc.
> 
> -Mika
> 
>> 
>> Thanks for the help.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Raihan Jamal
>> 
>> 
>> On Tue, Sep 24, 2013 at 11:40 AM, Mika Ristimaki <[email protected]> 
>> wrote:
>> Hi,
>> 
>> Avro array uses java.util.List datatype. So you must do something like
>> 
>> List<Double> nums = new ArrayList<Double>();
>> nums.add(new Double(9.97));
>> .
>> .
>> 
>> On Sep 24, 2013, at 9:02 PM, Raihan Jamal <[email protected]> wrote:
>> 
>>> Earlier, I was using JSON in our project so one of our attribute data looks 
>>> like below in JSON format. Below is the attribute `e3` data in JSON format.
>>>     
>>>     {"lv":[{"v":{"prc":9.97}},{"v":{"prc":5.56}},{"v":{"prc":21.48}}]}
>>>     
>>> Now, I am planning to use Apache Avro for our Data Serialization format. So 
>>> I decided to design the Avro schema for the above attributes data. And I 
>>> came up with the below design.
>>>   
>>>     {
>>>      "namespace": "com.avro.test.AvroExperiment",
>>>      "type": "record",
>>>      "name": "AVG_PRICE",
>>>      "doc": "AVG_PRICE data",
>>>      "fields": [
>>>          {"name": "prc", "type": {"type": "array", "items": "double"}}
>>>      ]
>>>     }
>>> 
>>> Now, I am not sure whether the above schema looks right or not 
>>> corresponding to the values I have in JSON? Can anyone help me on that? 
>>> Assuming the above schema looks correct, if I try to serialize the data 
>>> using the above avro schema, I always get the below error-
>>>   
>>>     double[] nums = new double[] { 9.97, 5.56, 21.48 };
>>>     
>>>     Schema schema = new 
>>> Parser().parse((AvroExperiment.class.getResourceAsStream("/aspmc.avsc")));
>>>     GenericRecord record = new GenericData.Record(schema);
>>>     record.put("prc", nums);
>>>     
>>>     GenericDatumWriter<GenericRecord> writer = new 
>>> GenericDatumWriter<GenericRecord>(schema); 
>>>     ByteArrayOutputStream os = new ByteArrayOutputStream(); 
>>> 
>>>     Encoder e = EncoderFactory.get().binaryEncoder(os, null);
>>>     
>>>     // this line gives me exception..
>>>     writer.write(record, e); 
>>>     
>>> Below is the exception, I always get-
>>> 
>>>     Exception in thread "main" java.lang.ClassCastException: [D 
>>> incompatible with java.util.Collection
>>>     
>>> Any idea what wrong I am doing here?
>> 
>> 
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Using Arrays in Apache Avro

Reply via email to