On 06/08/2010 11:10 AM, Markus Weimer wrote:
Is there a way to "stream" the doubles into the output without holding a
copy in memory? Or is there another way to encode a double[] in a schema?
Avro arrays and maps are written in a blocked representation, so the
binary encoding does support arbitrarily large arrays. But Java's
specific API does not currently take advantage of this.
The BlockingBinaryEncoder will break large arrays into blocks on write.
Note that it's clever, since arrays may contain nested objects and
arrays, yet BlockingBinaryEncoder only starts a new block when the
specified buffer size is exceeded. The assumption is simply that no
primitive leaf value exceed the buffer size.
BlockingBinaryEncoder can be used with ValidatingEncoder and
ValidatingDecoder to safely write code that streams instances of a
schema. For example, for your schema:
{"type": "record", "name": "LinearModel", "fields": [
{"name": "weights", "type": {"type":"array", "items":"double"}}
]}
You could write instances with something like:
public writeLinearModel(Encoder out,
Iteratable<List<Double>> buffers) {
out.writeArrayStart();
for (List<Double> buffer : buffers) {
out.setItemCount(buffer.size());
for (double d : buffer)
out.writeDouble(d);
}
out.writeArrayEnd();
}
This would re-buffer, writing a block of doubles only when the
BlockingBinaryEncoder's buffer is filled. A ValidatingDecoder could
ensure that the sequence of calls conforms to the declared schema. One
could structure the control flow alternately, if Iterable<List<Double>>
is not the natural way in which doubles are produced. So for example,
one could instead do something like:
public writeLinearModel(Encoder out, Iteratable<Double> dubs) {
out.writeArrayStart();
for (d : dubs) {
out.setItemCount(1);
out.writeDouble(d);
}
out.writeArrayEnd();
}
And still rely on BlockingBinaryEncoder to only generate blocks when the
buffer's filled, e.g., every 64kB. Note that there are no per-record
encoder/decoder calls, only per-value. The validator infers the record
from the other calls.
Similarly, one could write a reader something like:
public Iterable<Double> readLinearModel(final Decoder in);
See
http://avro.apache.org/docs/current/api/java/org/apache/avro/io/Decoder.html#readArrayStart()
for the calls this should make. ValidatingDecoder could ensure that
calls conform to the schema written.
These classes were implemented precisely to support this use case.
Doug