Re: [Python SDK] Avro read/write & Indexing

Shannon Duncan Wed, 10 Jul 2019 06:58:39 -0700

I'm not finding the reference to `DataFileReader` in the Python SDK. Do you
have a link to the source code or api documentation for that?


On Wed, Jul 10, 2019 at 5:46 AM Reza Rokni <[email protected]> wrote:

> Hi,
>
> I have not tried this ( and don't have a chance to test it at the moment)
> so apologies if its incorrect, but could you use something like
> the DataFileReader within a DoFn to get access to your key? It looks like
> it has seek / sync methods that might work for this. Assuming of course
> that the data for the key is small enough to not need to be parallelized on
> the read.
>
> Cheers
> Reza
>
>
>
> On Tue, 9 Jul 2019 at 23:52, Lukasz Cwik <[email protected]> wrote:
>
>> Typically this would be done by reading in the contents of the entire
>> file into a map side input and then consuming that side input within a DoFn.
>>
>> Unfortunately, only Dataflow supports really large side inputs with an
>> efficient access pattern and only when using Beam Java for bounded
>> pipelines. Support for really large side inputs for Beam Python bounded
>> pipelines on Dataflow is coming but not yet available.
>>
>> Otherwise, you could still read the Avro files and still create a map and
>> store the index as a side input and as long as the index fits in memory,
>> this would work well across all runners.
>>
>> The programming guide[1] has a basic example on how to get started using
>> side inputs.
>>
>> 1: https://beam.apache.org/documentation/programming-guide/#side-inputs
>>
>>
>> On Tue, Jul 9, 2019 at 2:21 PM Shannon Duncan <[email protected]>
>> wrote:
>>
>>> So being pretty new to beam and big data I have been working on
>>> standardizing some input output items for different
>>> hadoop/beam/spark/bigquery jobs and processes.
>>>
>>> So what I'm working on is having them all read/write Avro files which is
>>> actually pretty straight forward. So basic read/write I have down.
>>>
>>> What I'm looking for and hoping someone on this list knows, is how to
>>> index an Avro file and be able to search quickly through that index to only
>>> open a partial part of an Avro file in beam.
>>>
>>> For example currently our pipeline is able to do this with Hadoop and
>>> Sequence Files since they store <K,V> with bytesoffest.
>>>
>>> So given a key I'd like to only pull that key from the Avro file
>>> reducing IO / Network costs.
>>>
>>> Any ideas, thoughts, suggestions?
>>>
>>> Thanks!
>>> Shannon
>>>
>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Re: [Python SDK] Avro read/write & Indexing

Reply via email to