Jim Pivarski created AVRO-1922:
----------------------------------

             Summary: Fixed dimension for array
                 Key: AVRO-1922
                 URL: https://issues.apache.org/jira/browse/AVRO-1922
             Project: Avro
          Issue Type: New Feature
          Components: spec
    Affects Versions: 1.8.1
            Reporter: Jim Pivarski
             Fix For: 1.9.0


This is a feature request for future versions of the Avro specification.

We have found one kind of data structure that is hard to express in Avro: 
tensors. Although we can (and do) build matrices as {"type": "array", "items": 
{"type": "array", "items": "double"}}, this type does not specify that the grid 
of numbers is rectangular. We believe that rectangular arrays of numbers (or 
other nested types) would be a strong addition to Avro, both as a type system 
and as a serialization format. With the total size of all dimensions fixed in 
the schema, they would not need to be repeated in each serialized datum.

For instance, suppose there was an extension of type "array" to specify 
dimensions:

    {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}

This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann curvature 
tensor in 3-space) specifies that 81 double-precision numbers (3*3*3*3) are 
expected for each datum. With nested arrays, the size, "3," would have to be 
separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each datum, even though 
they would never change in a dataset of Riemann tensors. With a "dimensions" 
attribute in the schema, only the content needs to be serialized. Moreover, 
this extension can clearly be used with any other "items" type, to make dense 
tables of strings, for instance.

Avro has been extended in a similar way in the past. The "fixed" type is a 
"bytes" without the need to specify the number of bytes for each datum. Our 
proposal provides a similar packing for structured objects that can be 
significant for large numbers of dimensions, as shown above. The advantage to 
consumers of Avro data is that we can write functions that do not need to check 
all array sizes at runtime (for operations like tensor contractions and 
products).

We have searched the web and the Avro JIRA site for similar proposals and found 
none, so we're adding this proposal to JIRA in addition to this e-mail. Please 
let us know if you have any comments, or if we can provide any more information.

Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to