It's not outside the realm of possibilities. For now I've created an intermediary step of a hadoop job that converts from sequence to text file.
Looking into better options. On Mon, Jul 1, 2019, 5:50 PM Chamikara Jayalath <[email protected]> wrote: > Java SDK has a HadoopInputFormatIO using which you should be able to read > Sequence files: > https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-format/src/main/java/org/apache/beam/sdk/io/hadoop/format/HadoopFormatIO.java > I don't think there's a direct alternative for this for Python. > > Is it possible to write to a well-known format such as Avro instead of a > Hadoop specific format which will allow you to read from both > Dataproc/Hadoop and Beam Python SDK ? > > Thanks, > Cham > > On Mon, Jul 1, 2019 at 3:37 PM Shannon Duncan <[email protected]> > wrote: > >> That's a pretty big hole for a missing source/sink when looking at >> transitioning from Dataproc to Dataflow using GCS as storage buffer instead >> of a traditional hdfs. >> >> From what I've been able to tell from source code and documentation, Java >> is able to but not Python? >> >> Thanks, >> Shannon >> >> On Mon, Jul 1, 2019 at 5:29 PM Chamikara Jayalath <[email protected]> >> wrote: >> >>> I don't think we have a source/sink for reading Hadoop sequence files. >>> Your best bet currently will probably be to use FileSystem abstraction to >>> create a file from a ParDo and read directly from there using a library >>> that can read sequence files. >>> >>> Thanks, >>> Cham >>> >>> On Mon, Jul 1, 2019 at 8:42 AM Shannon Duncan < >>> [email protected]> wrote: >>> >>>> I'm wanting to read a Sequence/Map file from Hadoop stored on Google >>>> Cloud Storage via a " gs://bucket/link/SequenceFile-* " via the Python SDK. >>>> >>>> I cannot locate any good adapters for this, and the one Hadoop >>>> Filesystem reader seems to only read from a "hdfs://" url. >>>> >>>> I'm wanting to use Dataflow and GCS exclusively to start mixing in Beam >>>> pipelines with our current Hadoop Pipelines. >>>> >>>> Is this a feature that is supported or will be supported in the future? >>>> Does anyone have any good suggestions for this that is performant? >>>> >>>> I'd also like to be able to write back out to a SequenceFile if >>>> possible. >>>> >>>> Thanks! >>>> >>>>
