On 06/08/2010 11:10 AM, Markus Weimer wrote:
Is there a way to "stream" the doubles into the output without holding a
copy in memory? Or is there another way to encode a double[] in a schema?

Avro arrays and maps are written in a blocked representation, so the binary encoding does support arbitrarily large arrays. But Java's specific API does not currently take advantage of this.

The BlockingBinaryEncoder will break large arrays into blocks on write. Note that it's clever, since arrays may contain nested objects and arrays, yet BlockingBinaryEncoder only starts a new block when the specified buffer size is exceeded. The assumption is simply that no primitive leaf value exceed the buffer size.

BlockingBinaryEncoder can be used with ValidatingEncoder and ValidatingDecoder to safely write code that streams instances of a schema. For example, for your schema:

{"type": "record", "name": "LinearModel", "fields": [
   {"name": "weights", "type": {"type":"array", "items":"double"}}
]}

You could write instances with something like:

public writeLinearModel(Encoder out,
                        Iteratable<List<Double>> buffers) {
  out.writeArrayStart();
  for (List<Double> buffer : buffers) {
    out.setItemCount(buffer.size());
    for (double d : buffer)
      out.writeDouble(d);
  }
  out.writeArrayEnd();
}

This would re-buffer, writing a block of doubles only when the BlockingBinaryEncoder's buffer is filled. A ValidatingDecoder could ensure that the sequence of calls conforms to the declared schema. One could structure the control flow alternately, if Iterable<List<Double>> is not the natural way in which doubles are produced. So for example, one could instead do something like:

public writeLinearModel(Encoder out, Iteratable<Double> dubs) {
  out.writeArrayStart();
  for (d : dubs) {
    out.setItemCount(1);
    out.writeDouble(d);
  }
  out.writeArrayEnd();
}

And still rely on BlockingBinaryEncoder to only generate blocks when the buffer's filled, e.g., every 64kB. Note that there are no per-record encoder/decoder calls, only per-value. The validator infers the record from the other calls.

Similarly, one could write a reader something like:

public Iterable<Double> readLinearModel(final Decoder in);

See http://avro.apache.org/docs/current/api/java/org/apache/avro/io/Decoder.html#readArrayStart() for the calls this should make. ValidatingDecoder could ensure that calls conform to the schema written.

These classes were implemented precisely to support this use case.

Doug

Reply via email to