That's a pretty big hole for a missing source/sink when looking at transitioning from Dataproc to Dataflow using GCS as storage buffer instead of a traditional hdfs.
>From what I've been able to tell from source code and documentation, Java is able to but not Python? Thanks, Shannon On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <[email protected]> wrote: > I don't think we have a source/sink for reading Hadoop sequence files. > Your best bet currently will probably be to use FileSystem abstraction to > create a file from a ParDo and read directly from there using a library > that can read sequence files. > > Thanks, > Cham > > On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <[email protected]> > wrote: > >> I'm wanting to read a Sequence/Map file from Hadoop stored on Google >> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK. >> >> I cannot locate any good adapters for this, and the one Hadoop Filesystem >> reader seems to only read from a "hdfs://" url. >> >> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam >> pipelines with our current Hadoop Pipelines. >> >> Is this a feature that is supported or will be supported in the future? >> Does anyone have any good suggestions for this that is performant? >> >> I'd also like to be able to write back out to a SequenceFile if possible. >> >> Thanks! >> >>
