Re: balanced of Stream Codec

2016-10-17 Thread Vlad Rozov
Using different hash function will help only in case data is equally distributed across categories. In many cases data is skewed and some categories occur more frequently than others. In such case generic hash function will not help. Can you try to sample data and see if the data is equally

Re: balanced of Stream Codec

2016-10-16 Thread Pramod Immaneni
Hi Sunil, Have you tried an alternate hashing function other than java hashcode that might provide a more uniform distribution of your data? The google guava library provides a set of hashing strategies, like murmur hash, that is reported to have lesser hash collisions in different cases. Below

Re: balanced of Stream Codec

2016-10-15 Thread Thomas Weise
Without knowing the operations following the indeterministic partitioning, assume that you cannot have exactly-once results because processing won't be idempotent. If there are only stateless operations, then it should be OK. If there are stateful operations (windowing with any form of aggregation

Re: balanced of Stream Codec

2016-10-15 Thread Amol Kekre
Sunil, Round robin in an internal operator could be used in exactly once writes to external system for certain operations. I do know what your business logic is, but in case it can be split into partitions and then unified (for example aggregates), you have a situation where you can use round

Re: balanced of Stream Codec

2016-10-15 Thread Munagala Ramanath
If you want round-robin distribution which will give you uniform load across all partitions you can use a StreamCodec like this (provided the number of partitions is known and static): *public class CatagoryStreamCodec extends KryoSerializableStreamCodec {* * private int n = 0;* * @Override* *

Re: balanced of Stream Codec

2016-10-14 Thread Ashwin Chandra Putta
Sunil, For key based partitioning, the getPartition method is supposed to return a consistent integer representing the key for partitioning. Typically the java hashCode of the key. The tuples are then routed based on the integer and looking at its lower bits on the mask (number of lower bits)