Hello,
We are starting up a project using map/reduce to produce avro files. In short,
our job produces avro records which can contain very large arrays. In effect,
we really can't practically predict how large some of them can get.
When we hit one of these "very large" records, the BufferedBinaryEncoder seems
to blow out the heap when calling
org.apache.avro.mapred.AvroMultipleOutputs$1.collect() from a reducer (see
stack trace below).
Browsing through the avro code and the Jira's, it seems that AVRO-105 could be
part of the solution here, as I believe we would probably want to be able to
use the BlockingBinaryEncoder (or perhaps even the DirectBinaryEncoder?? ) to
be able to write these large arrays in a memory-efficient manner.
Am I on the right track here? If so, it also seems that we would need an
additional feature to be able to configure/enable this from mapred via the
JobConf etc..
Since I'm as-of-yet not that familiar with the internals of avro, I would
appreciate it if anyone could give me a sanity check, and/or potentially offer
other suggestions as to how we may be able to work around this problem.
Thanks in advance for your help,
-Mike
Error running child : java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2786)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
at
org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:216)
at
org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:93)
at
org.apache.avro.io.BufferedBinaryEncoder.ensureBounds(BufferedBinaryEncoder.java:108)
at
org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:153)
at org.apache.avro.io.Encoder.writeFixed(Encoder.java:174)
at
org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:164)
at org.apache.avro.io.BinaryEncoder.writeBytes(BinaryEncoder.java:65)
at
org.apache.avro.generic.GenericDatumWriter.writeBytes(GenericDatumWriter.java:212)
at
org.apache.avro.reflect.ReflectDatumWriter.writeBytes(ReflectDatumWriter.java:93)
at
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:77)
at
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
at
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
at
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
at
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106)
at
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
at
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
at
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:131)
at
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
at
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
at
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:106)
at
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
at
org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:104)
at
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:257)
at
org.apache.avro.mapred.AvroOutputFormat$1.write(AvroOutputFormat.java:160)
at
org.apache.avro.mapred.AvroOutputFormat$1.write(AvroOutputFormat.java:157)
at
org.apache.avro.mapred.AvroMultipleOutputs$RecordWriterWithCounter.write(AvroMultipleOutputs.java:436)
at
org.apache.avro.mapred.AvroMultipleOutputs$1.collect(AvroMultipleOutputs.java:499)
>