Hi Lukasz, Thanks for the example. That sounds like a nice solution - I am running on Dataflow though, which dynamically sets numShards - so if I set numShards to 1 on each of those AvroIO writers, I can't be sure that Dataflow isn't going to override my setting right? I guess this should work fine as long as I partition my stream into a large enough number of partitions so that Dataflow won't override numShards.
Josh On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote: > Since your using a small number of shards, add a Partition transform which > uses a deterministic hash of the key to choose one of 4 partitions. Write > each partition with a single shard. > > (Fixed width diagram below) > Pipeline -> AvroIO(numShards = 4) > Becomes: > Pipeline -> Partition --> AvroIO(numShards = 1) > |-> AvroIO(numShards = 1) > |-> AvroIO(numShards = 1) > \-> AvroIO(numShards = 1) > > On Wed, May 24, 2017 at 1:05 AM, Josh <jof...@gmail.com> wrote: > >> Hi, >> >> I am using a FileBasedSink (AvroIO.write) on an unbounded stream >> (withWindowedWrites, hourly windows, numShards=4). >> >> I would like to partition the stream by some key in the element, so that >> all elements with the same key will get processed by the same shard writer, >> and therefore written to the same file. Is there a way to do this? Note >> that in my stream the number of keys is very large (most elements have a >> unique key, while a few elements share a key). >> >> Thanks, >> Josh >> > >