Hi,

I'm trying to produce output avro data that contains a bytes field which
could be fairly large should mostly be 1-10KB, but might occasionally be
greater than 1MB). The input files are stored on disk, and I'd like to
convert them to avro without reading entire files into memory.

My first attempt at this was to override GenericDatumWriter as follows:

class GenericByteStreamDatumWriter<D> extends GenericDatumWriter {
>      @Override
>      protected void writeBytes(Object datum, Encoder out) throws
>  IOException {
>          if (datum instanceof BufferedInputStream) {
>              BufferedInputStream in = (BufferedInputStream) datum;
>              byte[] buf = new byte[4096];
>              int bytesRead = 0;
>              while (bytesRead != -1) {
>                  bytesRead = in.read(buf, 0, buf.length);
>                  if (bytesRead > 0) {
>                      out.writeBytes(buf, 0, bytesRead - 1);
>                  }
>              }
>          } else {
>              super.writeBytes(datum, out);
>          }
>      }
>  }



And then setting the bytes field of a GenericRecord to the proper
BufferedInputStream. This works, but it seems quite slow and uses a lot of
memory.

Any ideas on how to do this properly? This seems like a fairly common task,
so I imagine it's come up before.

Reply via email to