[Python SDK] Avro read/write & Indexing

Shannon Duncan Tue, 09 Jul 2019 14:22:03 -0700

So being pretty new to beam and big data I have been working on
standardizing some input output items for different
hadoop/beam/spark/bigquery jobs and processes.


So what I'm working on is having them all read/write Avro files which is
actually pretty straight forward. So basic read/write I have down.

What I'm looking for and hoping someone on this list knows, is how to index
an Avro file and be able to search quickly through that index to only open
a partial part of an Avro file in beam.

For example currently our pipeline is able to do this with Hadoop and
Sequence Files since they store <K,V> with bytesoffest.

So given a key I'd like to only pull that key from the Avro file reducing
IO / Network costs.

Any ideas, thoughts, suggestions?

Thanks!
Shannon

[Python SDK] Avro read/write & Indexing

Reply via email to