You might instead try using the blob's reader method?

Something like:

InputStream input = Channels.newInputStream(blob.reader());
try {
  return new DataFileStream(input, new GenericDatumReader()).getSchema();
} finally {
  input.close();
}

Doug

On Wed, May 2, 2018 at 4:30 PM Rodrigo Ipince <rodr...@leanplum.com> wrote:

> Hi,
>
>
> (Disclaimer: I'm new to Avro and Beam)
>
>
> Question: *is there a way to read the schema from an Avro file in GCS
> without having to read the entire file?*
>
>
> Context:
>
> I have a bunch of large files in GCS
>
> I want to process them by doing
> AvroIO.readGenericRecords(theSchema).from(filePattern) (this is from the
> Apache Beam SDK). However, I don’t know the schema up front.
>
>
> Now, I can read one of the files and extract the schema from it up front,
> sort of like this:
>
> ```
> Blob avroFile = … // get Blob from GCS
>
> SeekableInput seekableInput = new
> SeekableByteArrayInput(avroFile.getContent());
>
> DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
>
> try (DataFileReader<GenericRecord> dataFileReader = new
> DataFileReader<>(seekableInput, datumReader)) {
>
>   String schema = dataFileReader.getSchema().toString();
>
> }
>
> ```
>
>
> but.. the file is really large, and my nodes are really tiny, so they run
> out of memory. Is there a way to not have to read the entire file in order
> to extract the schema?
>
> Thanks!
>
>

Reply via email to