Re: How to store a rather large double[]

Doug Cutting Tue, 08 Jun 2010 12:47:08 -0700

On 06/08/2010 11:10 AM, Markus Weimer wrote:

Is there a way to "stream" the doubles into the output without holding a
copy in memory? Or is there another way to encode a double[] in a schema?

Avro arrays and maps are written in a blocked representation, so thebinary encoding does support arbitrarily large arrays. But Java'sspecific API does not currently take advantage of this.

The BlockingBinaryEncoder will break large arrays into blocks on write.Note that it's clever, since arrays may contain nested objects andarrays, yet BlockingBinaryEncoder only starts a new block when thespecified buffer size is exceeded. The assumption is simply that noprimitive leaf value exceed the buffer size.

BlockingBinaryEncoder can be used with ValidatingEncoder andValidatingDecoder to safely write code that streams instances of aschema. For example, for your schema:


{"type": "record", "name": "LinearModel", "fields": [
   {"name": "weights", "type": {"type":"array", "items":"double"}}
]}

You could write instances with something like:

public writeLinearModel(Encoder out,
                        Iteratable<List<Double>> buffers) {
  out.writeArrayStart();
  for (List<Double> buffer : buffers) {
    out.setItemCount(buffer.size());
    for (double d : buffer)
      out.writeDouble(d);
  }
  out.writeArrayEnd();
}

This would re-buffer, writing a block of doubles only when theBlockingBinaryEncoder's buffer is filled. A ValidatingDecoder couldensure that the sequence of calls conforms to the declared schema. Onecould structure the control flow alternately, if Iterable<List<Double>>is not the natural way in which doubles are produced. So for example,one could instead do something like:


public writeLinearModel(Encoder out, Iteratable<Double> dubs) {
  out.writeArrayStart();
  for (d : dubs) {
    out.setItemCount(1);
    out.writeDouble(d);
  }
  out.writeArrayEnd();
}

And still rely on BlockingBinaryEncoder to only generate blocks when thebuffer's filled, e.g., every 64kB. Note that there are no per-recordencoder/decoder calls, only per-value. The validator infers the recordfrom the other calls.


Similarly, one could write a reader something like:

public Iterable<Double> readLinearModel(final Decoder in);

Seehttp://avro.apache.org/docs/current/api/java/org/apache/avro/io/Decoder.html#readArrayStart()for the calls this should make. ValidatingDecoder could ensure thatcalls conform to the schema written.


These classes were implemented precisely to support this use case.

Doug

Re: How to store a rather large double[]

Reply via email to