It's not outside the realm of possibilities. For now I've created an
intermediary step of a hadoop job that converts from sequence to text file.

Looking into better options.

On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <[email protected]>
wrote:

> Java SDK has a HadoopInputFormatIO using which you should be able to read
> Sequence files:
> https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java
> I don't think there's a direct alternative for this for Python.
>
> Is it possible to write to a well-known format such as Avro instead of a
> Hadoop specific format which will allow you to read from both
> Dataproc/Hadoop and Beam Python SDK ?
>
> Thanks,
> Cham
>
> On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <[email protected]>
> wrote:
>
>> That's a pretty big hole for a missing source/sink when looking at
>> transitioning from Dataproc to Dataflow using GCS as storage buffer instead
>> of a traditional hdfs.
>>
>> From what I've been able to tell from source code and documentation, Java
>> is able to but not Python?
>>
>> Thanks,
>> Shannon
>>
>> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>> I don't think we have a source/sink for reading Hadoop sequence files.
>>> Your best bet currently will probably be to use FileSystem abstraction to
>>> create a file from a ParDo and read directly from there using a library
>>> that can read sequence files.
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <
>>> [email protected]> wrote:
>>>
>>>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google
>>>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK.
>>>>
>>>> I cannot locate any good adapters for this, and the one Hadoop
>>>> Filesystem reader seems to only read from a "hdfs://" url.
>>>>
>>>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam
>>>> pipelines with our current Hadoop Pipelines.
>>>>
>>>> Is this a feature that is supported or will be supported in the future?
>>>> Does anyone have any good suggestions for this that is performant?
>>>>
>>>> I'd also like to be able to write back out to a SequenceFile if
>>>> possible.
>>>>
>>>> Thanks!
>>>>
>>>>

Reply via email to