HDFS support in Beam was recently[1] improved to support more than one
cluster.

1:
https://github.com/apache/beam/commit/f1dc92f8ec2d4d78b9b60440f821df43dc374e21

On Tue, Aug 20, 2019 at 7:56 AM Alexey Romanenko <[email protected]>
wrote:

> Hi all,
>
> I’m looking for a working solution for cases where it’s needed (or even
> required) to use different file system configuration (HDFS, S3, GC) in the
> same pipeline and where IO is Beam FileSystems based (FileIO, TextIO, etc).
> For example:
> - reading data from one HDFS cluster and writing results into another one
> which requires different configuration;
> - reading objects from one S3 bucket, writing into another one and we need
> to use different credentials and/or regions for that;
> - we even can have heterogeneous case, where we need to read data from
> HDFS and write results into S3 or vice versa.
>
> Usually, in other IOs, we can do this easily by having specific methods,
> like “withConfiguration()”, “withCredentialsProvider()”, etc. for Read and
> Write, but FileSystems based IO could be configured only
> with PipelineOptions afaik. There was a thread about that a while ago [1]
> where Lukasz Cwik said that it’s feasible by using different schemes but,
> unfortunately, I haven’t managed to make it working on my side (neither for
> HDFS nor for S3).
>
> So, any additional inputs or working solutions would be very welcomed if
> someone has any. In the long term, I’d like to document this in details
> since, I guess, this case can be quite demanded.
>
> [1]
> https://lists.apache.org/thread.html/bb5f98c4154cc72d097ce5b404ff0b3bcb52b7360b0834af7116883b@%3Cdev.beam.apache.org%3E
>
>
>

Reply via email to