Hi Lukasz,

Thanks for the example. That sounds like a nice solution -
I am running on Dataflow though, which dynamically sets numShards - so if I
set numShards to 1 on each of those AvroIO writers, I can't be sure that
Dataflow isn't going to override my setting right? I guess this should work
fine as long as I partition my stream into a large enough number of
partitions so that Dataflow won't override numShards.

Josh

On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote:

> Since your using a small number of shards, add a Partition transform which
> uses a deterministic hash of the key to choose one of 4 partitions. Write
> each partition with a single shard.
>
> (Fixed width diagram below)
> Pipeline -> AvroIO(numShards = 4)
> Becomes:
> Pipeline -> Partition --> AvroIO(numShards = 1)
>                       |-> AvroIO(numShards = 1)
>                       |-> AvroIO(numShards = 1)
>                       \-> AvroIO(numShards = 1)
>
> On Wed, May 24, 2017 at 1:05 AM, Josh <jof...@gmail.com> wrote:
>
>> Hi,
>>
>> I am using a FileBasedSink (AvroIO.write) on an unbounded stream
>> (withWindowedWrites, hourly windows, numShards=4).
>>
>> I would like to partition the stream by some key in the element, so that
>> all elements with the same key will get processed by the same shard writer,
>> and therefore written to the same file. Is there a way to do this? Note
>> that in my stream the number of keys is very large (most elements have a
>> unique key, while a few elements share a key).
>>
>> Thanks,
>> Josh
>>
>
>

Reply via email to