I'm not finding the reference to `DataFileReader` in the Python SDK. Do you have a link to the source code or api documentation for that?
On Wed, Jul 10, 2019 at 5:46 AM Reza Rokni <[email protected]> wrote: > Hi, > > I have not tried this ( and don't have a chance to test it at the moment) > so apologies if its incorrect, but could you use something like > the DataFileReader within a DoFn to get access to your key? It looks like > it has seek / sync methods that might work for this. Assuming of course > that the data for the key is small enough to not need to be parallelized on > the read. > > Cheers > Reza > > > > On Tue, 9 Jul 2019 at 23:52, Lukasz Cwik <[email protected]> wrote: > >> Typically this would be done by reading in the contents of the entire >> file into a map side input and then consuming that side input within a DoFn. >> >> Unfortunately, only Dataflow supports really large side inputs with an >> efficient access pattern and only when using Beam Java for bounded >> pipelines. Support for really large side inputs for Beam Python bounded >> pipelines on Dataflow is coming but not yet available. >> >> Otherwise, you could still read the Avro files and still create a map and >> store the index as a side input and as long as the index fits in memory, >> this would work well across all runners. >> >> The programming guide[1] has a basic example on how to get started using >> side inputs. >> >> 1: https://beam.apache.org/documentation/programming-guide/#side-inputs >> >> >> On Tue, Jul 9, 2019 at 2:21 PM Shannon Duncan <[email protected]> >> wrote: >> >>> So being pretty new to beam and big data I have been working on >>> standardizing some input output items for different >>> hadoop/beam/spark/bigquery jobs and processes. >>> >>> So what I'm working on is having them all read/write Avro files which is >>> actually pretty straight forward. So basic read/write I have down. >>> >>> What I'm looking for and hoping someone on this list knows, is how to >>> index an Avro file and be able to search quickly through that index to only >>> open a partial part of an Avro file in beam. >>> >>> For example currently our pipeline is able to do this with Hadoop and >>> Sequence Files since they store <K,V> with bytesoffest. >>> >>> So given a key I'd like to only pull that key from the Avro file >>> reducing IO / Network costs. >>> >>> Any ideas, thoughts, suggestions? >>> >>> Thanks! >>> Shannon >>> >> > > -- > > This email may be confidential and privileged. If you received this > communication by mistake, please don't forward it to anyone else, please > erase all copies and attachments, and please let me know that it has gone > to the wrong person. > > The above terms reflect a potential business arrangement, are provided > solely as a basis for further discussion, and are not intended to be and do > not constitute a legally binding obligation. No legally binding obligations > will be created, implied, or inferred until an agreement in final form is > executed in writing by all parties involved. >
