Since your using a small number of shards, add a Partition transform which
uses a deterministic hash of the key to choose one of 4 partitions. Write
each partition with a single shard.

(Fixed width diagram below)
Pipeline -> AvroIO(numShards = 4)
Becomes:
Pipeline -> Partition --> AvroIO(numShards = 1)
                      |-> AvroIO(numShards = 1)
                      |-> AvroIO(numShards = 1)
                      \-> AvroIO(numShards = 1)

On Wed, May 24, 2017 at 1:05 AM, Josh <jof...@gmail.com> wrote:

> Hi,
>
> I am using a FileBasedSink (AvroIO.write) on an unbounded stream
> (withWindowedWrites, hourly windows, numShards=4).
>
> I would like to partition the stream by some key in the element, so that
> all elements with the same key will get processed by the same shard writer,
> and therefore written to the same file. Is there a way to do this? Note
> that in my stream the number of keys is very large (most elements have a
> unique key, while a few elements share a key).
>
> Thanks,
> Josh
>

Reply via email to