You might instead try using the blob's reader method? Something like:
InputStream input = Channels.newInputStream(blob.reader()); try { return new DataFileStream(input, new GenericDatumReader()).getSchema(); } finally { input.close(); } Doug On Wed, May 2, 2018 at 4:30 PM Rodrigo Ipince <rodr...@leanplum.com> wrote: > Hi, > > > (Disclaimer: I'm new to Avro and Beam) > > > Question: *is there a way to read the schema from an Avro file in GCS > without having to read the entire file?* > > > Context: > > I have a bunch of large files in GCS > > I want to process them by doing > AvroIO.readGenericRecords(theSchema).from(filePattern) (this is from the > Apache Beam SDK). However, I don’t know the schema up front. > > > Now, I can read one of the files and extract the schema from it up front, > sort of like this: > > ``` > Blob avroFile = … // get Blob from GCS > > SeekableInput seekableInput = new > SeekableByteArrayInput(avroFile.getContent()); > > DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(); > > try (DataFileReader<GenericRecord> dataFileReader = new > DataFileReader<>(seekableInput, datumReader)) { > > String schema = dataFileReader.getSchema().toString(); > > } > > ``` > > > but.. the file is really large, and my nodes are really tiny, so they run > out of memory. Is there a way to not have to read the entire file in order > to extract the schema? > > Thanks! > >