Java SDK has a HadoopInputFormatIO using which you should be able to read Sequence files: https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java I don't think there's a direct alternative for this for Python.
Is it possible to write to a well-known format such as Avro instead of a Hadoop specific format which will allow you to read from both Dataproc/Hadoop and Beam Python SDK ? Thanks, Cham On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <[email protected]> wrote: > That's a pretty big hole for a missing source/sink when looking at > transitioning from Dataproc to Dataflow using GCS as storage buffer instead > of a traditional hdfs. > > From what I've been able to tell from source code and documentation, Java > is able to but not Python? > > Thanks, > Shannon > > On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <[email protected]> > wrote: > >> I don't think we have a source/sink for reading Hadoop sequence files. >> Your best bet currently will probably be to use FileSystem abstraction to >> create a file from a ParDo and read directly from there using a library >> that can read sequence files. >> >> Thanks, >> Cham >> >> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan <[email protected]> >> wrote: >> >>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google >>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK. >>> >>> I cannot locate any good adapters for this, and the one Hadoop >>> Filesystem reader seems to only read from a "hdfs://" url. >>> >>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam >>> pipelines with our current Hadoop Pipelines. >>> >>> Is this a feature that is supported or will be supported in the future? >>> Does anyone have any good suggestions for this that is performant? >>> >>> I'd also like to be able to write back out to a SequenceFile if possible. >>> >>> Thanks! >>> >>>
