So being pretty new to beam and big data I have been working on standardizing some input output items for different hadoop/beam/spark/bigquery jobs and processes.
So what I'm working on is having them all read/write Avro files which is actually pretty straight forward. So basic read/write I have down. What I'm looking for and hoping someone on this list knows, is how to index an Avro file and be able to search quickly through that index to only open a partial part of an Avro file in beam. For example currently our pipeline is able to do this with Hadoop and Sequence Files since they store <K,V> with bytesoffest. So given a key I'd like to only pull that key from the Avro file reducing IO / Network costs. Any ideas, thoughts, suggestions? Thanks! Shannon
