I would simply use a Partition transform to convert the PCollection<KV<Enum,GenericRecord>> data to a list of PCollection<GenericRecord> and then write each one as desired. Either that or a DoFn with multiple side outputs (one for each enum value). There's no reason this should perform poorly on the runners I am aware of, and is the most straightforward (and least error-prone) solution. (I don't know what your original solution looked like, but if you used N filters rather than one DoFn with N outputs that would perform much worse.)
https://beam.apache.org/documentation/programming-guide/#transforms-flatten-partition On Tue, Mar 28, 2017 at 1:04 AM, Narek Amirbekian <[email protected]> wrote: > We had to implement a custom sink in order to get partitioned destination. > This is the least hacky solution, but the down-side of this approach is and > it is a lot of code that custom sinks only work in batch mode. If you need > to do this in streaming mode you may be able to get away with writing as a > side-effect of a pardo, but you have to account for concurrent writes, and > chunks might be processed more than once so you have to account for > duplicate values in your output. > > On Thu, Mar 23, 2017 at 9:46 AM Newport, Billy <[email protected]> > wrote: > >> Is there builtin support for writing partitioned Collections. For example: >> >> >> >> PCollection<KV<Enum,GenericRecord>> data; >> >> >> >> We want to write the GenericRecords in data in to different files based >> on the enum. We’ve did this in flink by making a proxy hadoopoutputformat >> which has the N real outputformats and the write method checks the enum and >> forwards the write call for the genericrecord to the correct outputformat. >> >> >> >> Given the lack of beam parquet support, the final writer we want to use >> with beam is Avro. >> >> >> >> We used the proxy outputformat trick in flink because performance was >> very poor using a filter to split it and then a map to convert from >> Enum,GenericRecord to just GenericRecord. >> >> >> >> I’m nervous to use side outputs in beam given I think they will be >> implemented as described here which performs poorly. >> >> >> >> So basically, has anyone implemented a demuxing AvroIO.Writer? >> >> >> >> Thanks >> >
