Re: Beam partitioned file reading and writing

Narek Amirbekian Tue, 28 Mar 2017 01:04:44 -0700

We had to implement a custom sink in order to get partitioned destination.
This is the least hacky solution, but the down-side of this approach is and
it is a lot of code that custom sinks only work in batch mode. If you need
to do this in streaming mode you may be able to get away with writing as a
side-effect of a pardo, but you have to account for concurrent writes, and
chunks might be processed more than once so you have to account for
duplicate values in your output.


On Thu, Mar 23, 2017 at 9:46 AM Newport, Billy <[email protected]> wrote:

> Is there builtin support for writing partitioned Collections. For example:
>
>
>
> PCollection<KV<Enum,GenericRecord>> data;
>
>
>
> We want to write the GenericRecords in data in to different files based on
> the enum. We’ve did this in flink by making a proxy hadoopoutputformat
> which has the N real outputformats and the write method checks the enum and
> forwards the write call for the genericrecord to the correct outputformat.
>
>
>
> Given the lack of beam parquet support, the final writer we want to use
> with beam is Avro.
>
>
>
> We used the proxy outputformat trick in flink because performance was very
> poor using a filter to split it and then a map to convert from
> Enum,GenericRecord to just GenericRecord.
>
>
>
> I’m nervous to use side outputs in beam given I think they will be
> implemented as described here which performs poorly.
>
>
>
> So basically, has anyone implemented a demuxing AvroIO.Writer?
>
>
>
> Thanks
>

Re: Beam partitioned file reading and writing

Reply via email to