Beam partitioned file reading and writing

Newport, Billy Thu, 23 Mar 2017 09:47:32 -0700

Is there builtin support for writing partitioned Collections. For example:

PCollection<KV<Enum,GenericRecord>> data;


We want to write the GenericRecords in data in to different files based on the 
enum. We've did this in flink by making a proxy hadoopoutputformat which has 
the N real outputformats and the write method checks the enum and forwards the 
write call for the genericrecord to the correct outputformat.

Given the lack of beam parquet support, the final writer we want to use with 
beam is Avro.

We used the proxy outputformat trick in flink because performance was very poor 
using a filter to split it and then a map to convert from Enum,GenericRecord to 
just GenericRecord.

I'm nervous to use side outputs in beam given I think they will be implemented 
as described here which performs poorly.

So basically, has anyone implemented a demuxing AvroIO.Writer?

Thanks

Beam partitioned file reading and writing

Reply via email to