Re: [Python] Read Hadoop Sequence File?

Ismaël Mejía Wed, 03 Jul 2019 01:54:52 -0700

That's great. I can help whenever you need. We just need to choose its
destination. Both the `hadoop-format` and `hadoop-file-system` modules
are good candidates, I would even feel inclined to put it in its own
module `sdks/java/extensions/sequencefile` to make it more easy to
discover by the final users.


A thing to consider is the SeekableByteChannel adapters, we can move
that into hadoop-common if needed and refactor the modules to share
code. Worth to take a look at
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
to see if some of it could be useful.

On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <igorbernst...@google.com> wrote:
>
> Hi all,
>
> I wrote those classes with the intention of upstreaming them to Beam. I can 
> try to make some time this quarter to clean them up. I would need a bit of 
> guidance from a beam expert in how to make them coexist with HadoopFormatIO 
> though.
>
>
> On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis <sdus...@google.com> wrote:
>>
>> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>
>> Solomon Duskis | Google Cloud clients | sdus...@google.com | 914-462-0531
>>
>>
>> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>>
>>> (Adding dev@ and Solomon Duskis to the discussion)
>>>
>>> I was not aware of these thanks for sharing David. Definitely it would
>>> be a great addition if we could have those donated as an extension in
>>> the Beam side. We can even evolve them in the future to be more FileIO
>>> like. Any chance this can happen? Maybe Solomon and his team?
>>>
>>>
>>>
>>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek <d...@apache.org> wrote:
>>> >
>>> > Hi, you can use SequenceFileSink and Source, from a BigTable client. 
>>> > Those works nice with FileIO.
>>> >
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>> >
>>> > It would be really cool to move these into Beam, but that's up to 
>>> > Googlers to decide, whether they want to donate this.
>>> >
>>> > D.
>>> >
>>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan 
>>> > <joseph.dun...@liveramp.com> wrote:
>>> >>
>>> >> It's not outside the realm of possibilities. For now I've created an 
>>> >> intermediary step of a hadoop job that converts from sequence to text 
>>> >> file.
>>> >>
>>> >> Looking into better options.
>>> >>
>>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <chamik...@google.com> 
>>> >> wrote:
>>> >>>
>>> >>> Java SDK has a HadoopInputFormatIO using which you should be able to 
>>> >>> read Sequence files: 
>>> >>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> >>> I don't think there's a direct alternative for this for Python.
>>> >>>
>>> >>> Is it possible to write to a well-known format such as Avro instead of 
>>> >>> a Hadoop specific format which will allow you to read from both 
>>> >>> Dataproc/Hadoop and Beam Python SDK ?
>>> >>>
>>> >>> Thanks,
>>> >>> Cham
>>> >>>
>>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan 
>>> >>> <joseph.dun...@liveramp.com> wrote:
>>> >>>>
>>> >>>> That's a pretty big hole for a missing source/sink when looking at 
>>> >>>> transitioning from Dataproc to Dataflow using GCS as storage buffer 
>>> >>>> instead of a traditional hdfs.
>>> >>>>
>>> >>>> From what I've been able to tell from source code and documentation, 
>>> >>>> Java is able to but not Python?
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Shannon
>>> >>>>
>>> >>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
>>> >>>> <chamik...@google.com> wrote:
>>> >>>>>
>>> >>>>> I don't think we have a source/sink for reading Hadoop sequence 
>>> >>>>> files. Your best bet currently will probably be to use FileSystem 
>>> >>>>> abstraction to create a file from a ParDo and read directly from 
>>> >>>>> there using a library that can read sequence files.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>> Cham
>>> >>>>>
>>> >>>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
>>> >>>>> <joseph.dun...@liveramp.com> wrote:
>>> >>>>>>
>>> >>>>>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google 
>>> >>>>>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the 
>>> >>>>>> Python SDK.
>>> >>>>>>
>>> >>>>>> I cannot locate any good adapters for this, and the one Hadoop 
>>> >>>>>> Filesystem reader seems to only read from a "hdfs://" url.
>>> >>>>>>
>>> >>>>>> I'm wanting to use Dataflow and GCS exclusively to start mixing in 
>>> >>>>>> Beam pipelines with our current Hadoop Pipelines.
>>> >>>>>>
>>> >>>>>> Is this a feature that is supported or will be supported in the 
>>> >>>>>> future?
>>> >>>>>> Does anyone have any good suggestions for this that is performant?
>>> >>>>>>
>>> >>>>>> I'd also like to be able to write back out to a SequenceFile if 
>>> >>>>>> possible.
>>> >>>>>>
>>> >>>>>> Thanks!
>>> >>>>>>

Re: [Python] Read Hadoop Sequence File?

Reply via email to