Re: [Python] Read Hadoop Sequence File?

2019-07-16 Thread Shannon Duncan
I am still having the problem that local file system (DirectRunner) will
not allow a local GLOB string to be passed as a file source. I have tried
both relative path and fully qualified paths.

I can confirm the same inputFile source GLOB returns data on a simple cat
command. So I know the GLOB is good.

Error: "java.io.FileNotFoundException: No files matched spec:
/Users//github//io/sequenceFile/part-*/data

Any assistance would be greatly appreciated. This is on the Java SDK.

I tested this with TextIO.read().from(ValueProvider); Still the
same.

Thanks,
Shannon

On Fri, Jul 12, 2019 at 2:14 PM Igor Bernstein 
wrote:

> I'm not sure to be honest. The pattern expansion happens in
> FileBasedSource via FileSystems.match(), so it should follow the same
> expansion rules other file based sinks like TextIO. Maybe someone with more
> beam experience can help?
>
> On Fri, Jul 12, 2019 at 2:55 PM Shannon Duncan 
> wrote:
>
>> Clarification on previous message. Only happens on local file system
>> where it is unable to match a pattern string. Via a `gs://` link it
>> is able to do multiple file matching.
>>
>> On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> Awesome. I got it working for a single file, but for a structure of:
>>>
>>> /part-0001/index
>>> /part-0001/data
>>> /part-0002/index
>>> /part-0002/data
>>>
>>> I tried to do /part-*  and /part-*/data
>>>
>>> It does not find the multipart files. However if I just do
>>> /part-0001/data it will find it and read it.
>>>
>>> Any ideas why?
>>>
>>> I am using this to generate the source:
>>>
>>> static SequenceFileSource createSource(
>>> ValueProvider sourcePattern) {
>>> return new SequenceFileSource(
>>> sourcePattern,
>>> Text.class,
>>> WritableSerialization.class,
>>> Text.class,
>>> WritableSerialization.class,
>>> SequenceFile.SYNC_INTERVAL);
>>> }
>>>
>>> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein <
>>> igorbernst...@google.com> wrote:
>>>
 It should be fairly straight forward:
 1. Copy SequenceFileSource.java
 
  to
 your project
 2. Add the source to your pipeline, configuring it with appropriate
 serializers. See here
 
 for an example for hbase Results

 On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
 joseph.dun...@liveramp.com> wrote:

> If I wanted to go ahead and include this within a new Java Pipeline,
> what would I be looking at for level of work to integrate?
>
> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía  wrote:
>
>> That's great. I can help whenever you need. We just need to choose its
>> destination. Both the `hadoop-format` and `hadoop-file-system` modules
>> are good candidates, I would even feel inclined to put it in its own
>> module `sdks/java/extensions/sequencefile` to make it more easy to
>> discover by the final users.
>>
>> A thing to consider is the SeekableByteChannel adapters, we can move
>> that into hadoop-common if needed and refactor the modules to share
>> code. Worth to take a look at
>>
>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
>> to see if some of it could be useful.
>>
>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <
>> igorbernst...@google.com> wrote:
>> >
>> > Hi all,
>> >
>> > I wrote those classes with the intention of upstreaming them to
>> Beam. I can try to make some time this quarter to clean them up. I would
>> need a bit of guidance from a beam expert in how to make them coexist 
>> with
>> HadoopFormatIO though.
>> >
>> >
>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis 
>> wrote:
>> >>
>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>> >>
>> >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
>> 914-462-0531
>> >>
>> >>
>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía 
>> wrote:
>> >>>
>> >>> (Adding dev@ and Solomon Duskis to the discussion)
>> >>>
>> >>> I was not aware of these thanks for sharing David. Definitely it
>> would
>> >>> be a great addition if we could have those donated as an
>> extension in
>> >>> the Beam side. We can even evolve them in the future to be more
>> FileIO
>> >>> like. Any chance this can happen? Maybe Solomon and his team?
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek 
>> wrote:
>> >>> >
>> >>> > Hi, you can use SequenceFileSink 

Re: [Python] Read Hadoop Sequence File?

2019-07-12 Thread Shannon Duncan
Clarification on previous message. Only happens on local file system where
it is unable to match a pattern string. Via a `gs://` link it is
able to do multiple file matching.

On Fri, Jul 12, 2019 at 1:36 PM Shannon Duncan 
wrote:

> Awesome. I got it working for a single file, but for a structure of:
>
> /part-0001/index
> /part-0001/data
> /part-0002/index
> /part-0002/data
>
> I tried to do /part-*  and /part-*/data
>
> It does not find the multipart files. However if I just do /part-0001/data
> it will find it and read it.
>
> Any ideas why?
>
> I am using this to generate the source:
>
> static SequenceFileSource createSource(
> ValueProvider sourcePattern) {
> return new SequenceFileSource(
> sourcePattern,
> Text.class,
> WritableSerialization.class,
> Text.class,
> WritableSerialization.class,
> SequenceFile.SYNC_INTERVAL);
> }
>
> On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein 
> wrote:
>
>> It should be fairly straight forward:
>> 1. Copy SequenceFileSource.java
>> 
>>  to
>> your project
>> 2. Add the source to your pipeline, configuring it with appropriate
>> serializers. See here
>> 
>> for an example for hbase Results
>>
>> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> If I wanted to go ahead and include this within a new Java Pipeline,
>>> what would I be looking at for level of work to integrate?
>>>
>>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía  wrote:
>>>
 That's great. I can help whenever you need. We just need to choose its
 destination. Both the `hadoop-format` and `hadoop-file-system` modules
 are good candidates, I would even feel inclined to put it in its own
 module `sdks/java/extensions/sequencefile` to make it more easy to
 discover by the final users.

 A thing to consider is the SeekableByteChannel adapters, we can move
 that into hadoop-common if needed and refactor the modules to share
 code. Worth to take a look at

 org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
 to see if some of it could be useful.

 On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein <
 igorbernst...@google.com> wrote:
 >
 > Hi all,
 >
 > I wrote those classes with the intention of upstreaming them to Beam.
 I can try to make some time this quarter to clean them up. I would need a
 bit of guidance from a beam expert in how to make them coexist with
 HadoopFormatIO though.
 >
 >
 > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis 
 wrote:
 >>
 >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
 >>
 >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
 914-462-0531
 >>
 >>
 >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía 
 wrote:
 >>>
 >>> (Adding dev@ and Solomon Duskis to the discussion)
 >>>
 >>> I was not aware of these thanks for sharing David. Definitely it
 would
 >>> be a great addition if we could have those donated as an extension
 in
 >>> the Beam side. We can even evolve them in the future to be more
 FileIO
 >>> like. Any chance this can happen? Maybe Solomon and his team?
 >>>
 >>>
 >>>
 >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek 
 wrote:
 >>> >
 >>> > Hi, you can use SequenceFileSink and Source, from a BigTable
 client. Those works nice with FileIO.
 >>> >
 >>> >
 https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
 >>> >
 https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
 >>> >
 >>> > It would be really cool to move these into Beam, but that's up to
 Googlers to decide, whether they want to donate this.
 >>> >
 >>> > D.
 >>> >
 >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
 joseph.dun...@liveramp.com> wrote:
 >>> >>
 >>> >> It's not outside the realm of possibilities. For now I've
 created an intermediary step of a hadoop job that converts from sequence to
 text file.
 >>> >>
 >>> >> Looking into better options.
 >>> >>
 >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
 chamik...@google.com> wrote:
 >>> >>>
 >>> >>> Java SDK has a HadoopInputFormatIO using which you should be
 able 

Re: [Python] Read Hadoop Sequence File?

2019-07-12 Thread Shannon Duncan
Awesome. I got it working for a single file, but for a structure of:

/part-0001/index
/part-0001/data
/part-0002/index
/part-0002/data

I tried to do /part-*  and /part-*/data

It does not find the multipart files. However if I just do /part-0001/data
it will find it and read it.

Any ideas why?

I am using this to generate the source:

static SequenceFileSource createSource(
ValueProvider sourcePattern) {
return new SequenceFileSource(
sourcePattern,
Text.class,
WritableSerialization.class,
Text.class,
WritableSerialization.class,
SequenceFile.SYNC_INTERVAL);
}

On Wed, Jul 10, 2019 at 10:52 AM Igor Bernstein 
wrote:

> It should be fairly straight forward:
> 1. Copy SequenceFileSource.java
> 
>  to
> your project
> 2. Add the source to your pipeline, configuring it with appropriate
> serializers. See here
> 
> for an example for hbase Results
>
> On Wed, Jul 10, 2019 at 10:58 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
>
>> If I wanted to go ahead and include this within a new Java Pipeline, what
>> would I be looking at for level of work to integrate?
>>
>> On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía  wrote:
>>
>>> That's great. I can help whenever you need. We just need to choose its
>>> destination. Both the `hadoop-format` and `hadoop-file-system` modules
>>> are good candidates, I would even feel inclined to put it in its own
>>> module `sdks/java/extensions/sequencefile` to make it more easy to
>>> discover by the final users.
>>>
>>> A thing to consider is the SeekableByteChannel adapters, we can move
>>> that into hadoop-common if needed and refactor the modules to share
>>> code. Worth to take a look at
>>>
>>> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
>>> to see if some of it could be useful.
>>>
>>> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein 
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > I wrote those classes with the intention of upstreaming them to Beam.
>>> I can try to make some time this quarter to clean them up. I would need a
>>> bit of guidance from a beam expert in how to make them coexist with
>>> HadoopFormatIO though.
>>> >
>>> >
>>> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis 
>>> wrote:
>>> >>
>>> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>> >>
>>> >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
>>> 914-462-0531
>>> >>
>>> >>
>>> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía 
>>> wrote:
>>> >>>
>>> >>> (Adding dev@ and Solomon Duskis to the discussion)
>>> >>>
>>> >>> I was not aware of these thanks for sharing David. Definitely it
>>> would
>>> >>> be a great addition if we could have those donated as an extension in
>>> >>> the Beam side. We can even evolve them in the future to be more
>>> FileIO
>>> >>> like. Any chance this can happen? Maybe Solomon and his team?
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek 
>>> wrote:
>>> >>> >
>>> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable
>>> client. Those works nice with FileIO.
>>> >>> >
>>> >>> >
>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>> >>> >
>>> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>> >>> >
>>> >>> > It would be really cool to move these into Beam, but that's up to
>>> Googlers to decide, whether they want to donate this.
>>> >>> >
>>> >>> > D.
>>> >>> >
>>> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>> >>> >>
>>> >>> >> It's not outside the realm of possibilities. For now I've created
>>> an intermediary step of a hadoop job that converts from sequence to text
>>> file.
>>> >>> >>
>>> >>> >> Looking into better options.
>>> >>> >>
>>> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>> >>> >>>
>>> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be
>>> able to read Sequence files:
>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> >>> >>> I don't think there's a direct alternative for this for Python.
>>> >>> >>>
>>> >>> >>> Is it possible to write to a well-known format such as Avro
>>> instead of a Hadoop specific format which will allow you to read from both

Re: [Python] Read Hadoop Sequence File?

2019-07-10 Thread Shannon Duncan
If I wanted to go ahead and include this within a new Java Pipeline, what
would I be looking at for level of work to integrate?

On Wed, Jul 3, 2019 at 3:54 AM Ismaël Mejía  wrote:

> That's great. I can help whenever you need. We just need to choose its
> destination. Both the `hadoop-format` and `hadoop-file-system` modules
> are good candidates, I would even feel inclined to put it in its own
> module `sdks/java/extensions/sequencefile` to make it more easy to
> discover by the final users.
>
> A thing to consider is the SeekableByteChannel adapters, we can move
> that into hadoop-common if needed and refactor the modules to share
> code. Worth to take a look at
>
> org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
> to see if some of it could be useful.
>
> On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein 
> wrote:
> >
> > Hi all,
> >
> > I wrote those classes with the intention of upstreaming them to Beam. I
> can try to make some time this quarter to clean them up. I would need a bit
> of guidance from a beam expert in how to make them coexist with
> HadoopFormatIO though.
> >
> >
> > On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis 
> wrote:
> >>
> >> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
> >>
> >> Solomon Duskis | Google Cloud clients | sdus...@google.com |
> 914-462-0531
> >>
> >>
> >> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía  wrote:
> >>>
> >>> (Adding dev@ and Solomon Duskis to the discussion)
> >>>
> >>> I was not aware of these thanks for sharing David. Definitely it would
> >>> be a great addition if we could have those donated as an extension in
> >>> the Beam side. We can even evolve them in the future to be more FileIO
> >>> like. Any chance this can happen? Maybe Solomon and his team?
> >>>
> >>>
> >>>
> >>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek  wrote:
> >>> >
> >>> > Hi, you can use SequenceFileSink and Source, from a BigTable client.
> Those works nice with FileIO.
> >>> >
> >>> >
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
> >>> >
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
> >>> >
> >>> > It would be really cool to move these into Beam, but that's up to
> Googlers to decide, whether they want to donate this.
> >>> >
> >>> > D.
> >>> >
> >>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
> >>> >>
> >>> >> It's not outside the realm of possibilities. For now I've created
> an intermediary step of a hadoop job that converts from sequence to text
> file.
> >>> >>
> >>> >> Looking into better options.
> >>> >>
> >>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >>>
> >>> >>> Java SDK has a HadoopInputFormatIO using which you should be able
> to read Sequence files:
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
> >>> >>> I don't think there's a direct alternative for this for Python.
> >>> >>>
> >>> >>> Is it possible to write to a well-known format such as Avro
> instead of a Hadoop specific format which will allow you to read from both
> Dataproc/Hadoop and Beam Python SDK ?
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Cham
> >>> >>>
> >>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
> >>> 
> >>>  That's a pretty big hole for a missing source/sink when looking
> at transitioning from Dataproc to Dataflow using GCS as storage buffer
> instead of a traditional hdfs.
> >>> 
> >>>  From what I've been able to tell from source code and
> documentation, Java is able to but not Python?
> >>> 
> >>>  Thanks,
> >>>  Shannon
> >>> 
> >>>  On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >
> >>> > I don't think we have a source/sink for reading Hadoop sequence
> files. Your best bet currently will probably be to use FileSystem
> abstraction to create a file from a ParDo and read directly from there
> using a library that can read sequence files.
> >>> >
> >>> > Thanks,
> >>> > Cham
> >>> >
> >>> > On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
> joseph.dun...@liveramp.com> wrote:
> >>> >>
> >>> >> I'm wanting to read a Sequence/Map file from Hadoop stored on
> Google Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the
> Python SDK.
> >>> >>
> >>> >> I cannot locate any good adapters for this, and the one Hadoop
> Filesystem reader seems to only read from a "hdfs://" url.
> >>> >>
> >>> >> I'm wanting to use Dataflow and GCS exclusively to start mixing
> in Beam 

Re: [Python] Read Hadoop Sequence File?

2019-07-03 Thread Ismaël Mejía
That's great. I can help whenever you need. We just need to choose its
destination. Both the `hadoop-format` and `hadoop-file-system` modules
are good candidates, I would even feel inclined to put it in its own
module `sdks/java/extensions/sequencefile` to make it more easy to
discover by the final users.

A thing to consider is the SeekableByteChannel adapters, we can move
that into hadoop-common if needed and refactor the modules to share
code. Worth to take a look at
org.apache.beam.sdk.io.hdfs.HadoopFileSystem.HadoopSeekableByteChannel#HadoopSeekableByteChannel
to see if some of it could be useful.

On Tue, Jul 2, 2019 at 11:46 PM Igor Bernstein  wrote:
>
> Hi all,
>
> I wrote those classes with the intention of upstreaming them to Beam. I can 
> try to make some time this quarter to clean them up. I would need a bit of 
> guidance from a beam expert in how to make them coexist with HadoopFormatIO 
> though.
>
>
> On Tue, Jul 2, 2019 at 10:55 AM Solomon Duskis  wrote:
>>
>> +Igor Bernstein who wrote the Cloud Bigtable Sequence File classes.
>>
>> Solomon Duskis | Google Cloud clients | sdus...@google.com | 914-462-0531
>>
>>
>> On Tue, Jul 2, 2019 at 4:57 AM Ismaël Mejía  wrote:
>>>
>>> (Adding dev@ and Solomon Duskis to the discussion)
>>>
>>> I was not aware of these thanks for sharing David. Definitely it would
>>> be a great addition if we could have those donated as an extension in
>>> the Beam side. We can even evolve them in the future to be more FileIO
>>> like. Any chance this can happen? Maybe Solomon and his team?
>>>
>>>
>>>
>>> On Tue, Jul 2, 2019 at 9:39 AM David Morávek  wrote:
>>> >
>>> > Hi, you can use SequenceFileSink and Source, from a BigTable client. 
>>> > Those works nice with FileIO.
>>> >
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
>>> > https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>>> >
>>> > It would be really cool to move these into Beam, but that's up to 
>>> > Googlers to decide, whether they want to donate this.
>>> >
>>> > D.
>>> >
>>> > On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan 
>>> >  wrote:
>>> >>
>>> >> It's not outside the realm of possibilities. For now I've created an 
>>> >> intermediary step of a hadoop job that converts from sequence to text 
>>> >> file.
>>> >>
>>> >> Looking into better options.
>>> >>
>>> >> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath  
>>> >> wrote:
>>> >>>
>>> >>> Java SDK has a HadoopInputFormatIO using which you should be able to 
>>> >>> read Sequence files: 
>>> >>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> >>> I don't think there's a direct alternative for this for Python.
>>> >>>
>>> >>> Is it possible to write to a well-known format such as Avro instead of 
>>> >>> a Hadoop specific format which will allow you to read from both 
>>> >>> Dataproc/Hadoop and Beam Python SDK ?
>>> >>>
>>> >>> Thanks,
>>> >>> Cham
>>> >>>
>>> >>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan 
>>> >>>  wrote:
>>> 
>>>  That's a pretty big hole for a missing source/sink when looking at 
>>>  transitioning from Dataproc to Dataflow using GCS as storage buffer 
>>>  instead of a traditional hdfs.
>>> 
>>>  From what I've been able to tell from source code and documentation, 
>>>  Java is able to but not Python?
>>> 
>>>  Thanks,
>>>  Shannon
>>> 
>>>  On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
>>>   wrote:
>>> >
>>> > I don't think we have a source/sink for reading Hadoop sequence 
>>> > files. Your best bet currently will probably be to use FileSystem 
>>> > abstraction to create a file from a ParDo and read directly from 
>>> > there using a library that can read sequence files.
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
>>> >  wrote:
>>> >>
>>> >> I'm wanting to read a Sequence/Map file from Hadoop stored on Google 
>>> >> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the 
>>> >> Python SDK.
>>> >>
>>> >> I cannot locate any good adapters for this, and the one Hadoop 
>>> >> Filesystem reader seems to only read from a "hdfs://" url.
>>> >>
>>> >> I'm wanting to use Dataflow and GCS exclusively to start mixing in 
>>> >> Beam pipelines with our current Hadoop Pipelines.
>>> >>
>>> >> Is this a feature that is supported or will be supported in the 
>>> >> future?
>>> >> Does anyone have any good suggestions for this that is performant?
>>> >>
>>> >> I'd also like to be able to write back out to a SequenceFile if 
>>> >> 

Re: [Python] Read Hadoop Sequence File?

2019-07-02 Thread Ismaël Mejía
(Adding dev@ and Solomon Duskis to the discussion)

I was not aware of these thanks for sharing David. Definitely it would
be a great addition if we could have those donated as an extension in
the Beam side. We can even evolve them in the future to be more FileIO
like. Any chance this can happen? Maybe Solomon and his team?



On Tue, Jul 2, 2019 at 9:39 AM David Morávek  wrote:
>
> Hi, you can use SequenceFileSink and Source, from a BigTable client. Those 
> works nice with FileIO.
>
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
> https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java
>
> It would be really cool to move these into Beam, but that's up to Googlers to 
> decide, whether they want to donate this.
>
> D.
>
> On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan  
> wrote:
>>
>> It's not outside the realm of possibilities. For now I've created an 
>> intermediary step of a hadoop job that converts from sequence to text file.
>>
>> Looking into better options.
>>
>> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath  wrote:
>>>
>>> Java SDK has a HadoopInputFormatIO using which you should be able to read 
>>> Sequence files: 
>>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>>> I don't think there's a direct alternative for this for Python.
>>>
>>> Is it possible to write to a well-known format such as Avro instead of a 
>>> Hadoop specific format which will allow you to read from both 
>>> Dataproc/Hadoop and Beam Python SDK ?
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan  
>>> wrote:

 That's a pretty big hole for a missing source/sink when looking at 
 transitioning from Dataproc to Dataflow using GCS as storage buffer 
 instead of a traditional hdfs.

 From what I've been able to tell from source code and documentation, Java 
 is able to but not Python?

 Thanks,
 Shannon

 On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath  
 wrote:
>
> I don't think we have a source/sink for reading Hadoop sequence files. 
> Your best bet currently will probably be to use FileSystem abstraction to 
> create a file from a ParDo and read directly from there using a library 
> that can read sequence files.
>
> Thanks,
> Cham
>
> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
>  wrote:
>>
>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google 
>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python 
>> SDK.
>>
>> I cannot locate any good adapters for this, and the one Hadoop 
>> Filesystem reader seems to only read from a "hdfs://" url.
>>
>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam 
>> pipelines with our current Hadoop Pipelines.
>>
>> Is this a feature that is supported or will be supported in the future?
>> Does anyone have any good suggestions for this that is performant?
>>
>> I'd also like to be able to write back out to a SequenceFile if possible.
>>
>> Thanks!
>>


Re: [Python] Read Hadoop Sequence File?

2019-07-02 Thread David Morávek
Hi, you can use SequenceFileSink and Source, from a BigTable client. Those
works nice with FileIO.

https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSink.java
https://github.com/googleapis/cloud-bigtable-client/blob/master/bigtable-dataflow-parent/bigtable-beam-import/src/main/java/com/google/cloud/bigtable/beam/sequencefiles/SequenceFileSource.java

It would be really cool to move these into Beam, but that's up to Googlers
to decide, whether they want to donate this.

D.

On Tue, Jul 2, 2019 at 2:07 AM Shannon Duncan 
wrote:

> It's not outside the realm of possibilities. For now I've created an
> intermediary step of a hadoop job that converts from sequence to text file.
>
> Looking into better options.
>
> On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath 
> wrote:
>
>> Java SDK has a HadoopInputFormatIO using which you should be able to read
>> Sequence files:
>> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
>> I don't think there's a direct alternative for this for Python.
>>
>> Is it possible to write to a well-known format such as Avro instead of a
>> Hadoop specific format which will allow you to read from both
>> Dataproc/Hadoop and Beam Python SDK ?
>>
>> Thanks,
>> Cham
>>
>> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan 
>> wrote:
>>
>>> That's a pretty big hole for a missing source/sink when looking at
>>> transitioning from Dataproc to Dataflow using GCS as storage buffer instead
>>> of a traditional hdfs.
>>>
>>> From what I've been able to tell from source code and documentation,
>>> Java is able to but not Python?
>>>
>>> Thanks,
>>> Shannon
>>>
>>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
>>> wrote:
>>>
 I don't think we have a source/sink for reading Hadoop sequence files.
 Your best bet currently will probably be to use FileSystem abstraction to
 create a file from a ParDo and read directly from there using a library
 that can read sequence files.

 Thanks,
 Cham

 On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
 joseph.dun...@liveramp.com> wrote:

> I'm wanting to read a Sequence/Map file from Hadoop stored on Google
> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python 
> SDK.
>
> I cannot locate any good adapters for this, and the one Hadoop
> Filesystem reader seems to only read from a "hdfs://" url.
>
> I'm wanting to use Dataflow and GCS exclusively to start mixing in
> Beam pipelines with our current Hadoop Pipelines.
>
> Is this a feature that is supported or will be supported in the future?
> Does anyone have any good suggestions for this that is performant?
>
> I'd also like to be able to write back out to a SequenceFile if
> possible.
>
> Thanks!
>
>


Re: [Python] Read Hadoop Sequence File?

2019-07-01 Thread Shannon Duncan
It's not outside the realm of possibilities. For now I've created an
intermediary step of a hadoop job that converts from sequence to text file.

Looking into better options.

On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath 
wrote:

> Java SDK has a HadoopInputFormatIO using which you should be able to read
> Sequence files:
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
> I don't think there's a direct alternative for this for Python.
>
> Is it possible to write to a well-known format such as Avro instead of a
> Hadoop specific format which will allow you to read from both
> Dataproc/Hadoop and Beam Python SDK ?
>
> Thanks,
> Cham
>
> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan 
> wrote:
>
>> That's a pretty big hole for a missing source/sink when looking at
>> transitioning from Dataproc to Dataflow using GCS as storage buffer instead
>> of a traditional hdfs.
>>
>> From what I've been able to tell from source code and documentation, Java
>> is able to but not Python?
>>
>> Thanks,
>> Shannon
>>
>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
>> wrote:
>>
>>> I don't think we have a source/sink for reading Hadoop sequence files.
>>> Your best bet currently will probably be to use FileSystem abstraction to
>>> create a file from a ParDo and read directly from there using a library
>>> that can read sequence files.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
 I'm wanting to read a Sequence/Map file from Hadoop stored on Google
 Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK.

 I cannot locate any good adapters for this, and the one Hadoop
 Filesystem reader seems to only read from a "hdfs://" url.

 I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam
 pipelines with our current Hadoop Pipelines.

 Is this a feature that is supported or will be supported in the future?
 Does anyone have any good suggestions for this that is performant?

 I'd also like to be able to write back out to a SequenceFile if
 possible.

 Thanks!




Re: [Python] Read Hadoop Sequence File?

2019-07-01 Thread Chamikara Jayalath
Java SDK has a HadoopInputFormatIO using which you should be able to read
Sequence files:
https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
I don't think there's a direct alternative for this for Python.

Is it possible to write to a well-known format such as Avro instead of a
Hadoop specific format which will allow you to read from both
Dataproc/Hadoop and Beam Python SDK ?

Thanks,
Cham

On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan 
wrote:

> That's a pretty big hole for a missing source/sink when looking at
> transitioning from Dataproc to Dataflow using GCS as storage buffer instead
> of a traditional hdfs.
>
> From what I've been able to tell from source code and documentation, Java
> is able to but not Python?
>
> Thanks,
> Shannon
>
> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
> wrote:
>
>> I don't think we have a source/sink for reading Hadoop sequence files.
>> Your best bet currently will probably be to use FileSystem abstraction to
>> create a file from a ParDo and read directly from there using a library
>> that can read sequence files.
>>
>> Thanks,
>> Cham
>>
>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
>> wrote:
>>
>>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google
>>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK.
>>>
>>> I cannot locate any good adapters for this, and the one Hadoop
>>> Filesystem reader seems to only read from a "hdfs://" url.
>>>
>>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam
>>> pipelines with our current Hadoop Pipelines.
>>>
>>> Is this a feature that is supported or will be supported in the future?
>>> Does anyone have any good suggestions for this that is performant?
>>>
>>> I'd also like to be able to write back out to a SequenceFile if possible.
>>>
>>> Thanks!
>>>
>>>


Re: [Python] Read Hadoop Sequence File?

2019-07-01 Thread Shannon Duncan
That's a pretty big hole for a missing source/sink when looking at
transitioning from Dataproc to Dataflow using GCS as storage buffer instead
of a traditional hdfs.

>From what I've been able to tell from source code and documentation, Java
is able to but not Python?

Thanks,
Shannon

On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath 
wrote:

> I don't think we have a source/sink for reading Hadoop sequence files.
> Your best bet currently will probably be to use FileSystem abstraction to
> create a file from a ParDo and read directly from there using a library
> that can read sequence files.
>
> Thanks,
> Cham
>
> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
> wrote:
>
>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google
>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK.
>>
>> I cannot locate any good adapters for this, and the one Hadoop Filesystem
>> reader seems to only read from a "hdfs://" url.
>>
>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam
>> pipelines with our current Hadoop Pipelines.
>>
>> Is this a feature that is supported or will be supported in the future?
>> Does anyone have any good suggestions for this that is performant?
>>
>> I'd also like to be able to write back out to a SequenceFile if possible.
>>
>> Thanks!
>>
>>


Re: [Python] Read Hadoop Sequence File?

2019-07-01 Thread Chamikara Jayalath
I don't think we have a source/sink for reading Hadoop sequence files. Your
best bet currently will probably be to use FileSystem abstraction to create
a file from a ParDo and read directly from there using a library that can
read sequence files.

Thanks,
Cham

On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan 
wrote:

> I'm wanting to read a Sequence/Map file from Hadoop stored on Google Cloud
> Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK.
>
> I cannot locate any good adapters for this, and the one Hadoop Filesystem
> reader seems to only read from a "hdfs://" url.
>
> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam
> pipelines with our current Hadoop Pipelines.
>
> Is this a feature that is supported or will be supported in the future?
> Does anyone have any good suggestions for this that is performant?
>
> I'd also like to be able to write back out to a SequenceFile if possible.
>
> Thanks!
>
>


[Python] Read Hadoop Sequence File?

2019-07-01 Thread Shannon Duncan
I'm wanting to read a Sequence/Map file from Hadoop stored on Google Cloud
Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK.

I cannot locate any good adapters for this, and the one Hadoop Filesystem
reader seems to only read from a "hdfs://" url.

I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam
pipelines with our current Hadoop Pipelines.

Is this a feature that is supported or will be supported in the future?
Does anyone have any good suggestions for this that is performant?

I'd also like to be able to write back out to a SequenceFile if possible.

Thanks!