Re: [Python] Read Hadoop Sequence File?

Ismaël Mejía Tue, 02 Jul 2019 01:58:09 -0700

(Adding dev@ and Solomon Duskis to the discussion)

I was not aware of these thanks for sharing David. Definitely it would
be a great addition if we could have those donated as an extension in
the Beam side. We can even evolve them in the future to be more FileIO
like. Any chance this can happen? Maybe Solomon and his team?




On Tue, Jul 2, 2019 at 9:39 AM David Morávek <d...@apache.org> wrote:
>
> Hi, you can use SequenceFileSink and Source, from a BigTable client. Those 
> works nice with FileIO.
>
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>
> It would be really cool to move these into Beam, but that's up to Googlers to 
> decide, whether they want to donate this.
>
> D.
>
> On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <joseph.dun...@liveramp.com> 
> wrote:
>>
>> It's not outside the realm of possibilities. For now I've created an 
>> intermediary step of a hadoop job that converts from sequence to text file.
>>
>> Looking into better options.
>>
>> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <chamik...@google.com> wrote:
>>>
>>> Java SDK has a HadoopInputFormatIO using which you should be able to read 
>>> Sequence files: 
>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> I don't think there's a direct alternative for this for Python.
>>>
>>> Is it possible to write to a well-known format such as Avro instead of a 
>>> Hadoop specific format which will allow you to read from both 
>>> Dataproc/Hadoop and Beam Python SDK ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <joseph.dun...@liveramp.com> 
>>> wrote:
>>>>
>>>> That's a pretty big hole for a missing source/sink when looking at 
>>>> transitioning from Dataproc to Dataflow using GCS as storage buffer 
>>>> instead of a traditional hdfs.
>>>>
>>>> From what I've been able to tell from source code and documentation, Java 
>>>> is able to but not Python?
>>>>
>>>> Thanks,
>>>> Shannon
>>>>
>>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <chamik...@google.com> 
>>>> wrote:
>>>>>
>>>>> I don't think we have a source/sink for reading Hadoop sequence files. 
>>>>> Your best bet currently will probably be to use FileSystem abstraction to 
>>>>> create a file from a ParDo and read directly from there using a library 
>>>>> that can read sequence files.
>>>>>
>>>>> Thanks,
>>>>> Cham
>>>>>
>>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
>>>>> <joseph.dun...@liveramp.com> wrote:
>>>>>>
>>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google 
>>>>>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python 
>>>>>> SDK.
>>>>>>
>>>>>> I cannot locate any good adapters for this, and the one Hadoop 
>>>>>> Filesystem reader seems to only read from a "hdfs://" url.
>>>>>>
>>>>>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam 
>>>>>> pipelines with our current Hadoop Pipelines.
>>>>>>
>>>>>> Is this a feature that is supported or will be supported in the future?
>>>>>> Does anyone have any good suggestions for this that is performant?
>>>>>>
>>>>>> I'd also like to be able to write back out to a SequenceFile if possible.
>>>>>>
>>>>>> Thanks!
>>>>>>

Re: [Python] Read Hadoop Sequence File?

Reply via email to