Using different hash function will help only in case data is equally
distributed across categories. In many cases data is skewed and some
categories occur more frequently than others. In such case generic hash
function will not help. Can you try to sample data and see if the data
is equally distributed across categories?
Vlad
On 10/16/16 10:40, Pramod Immaneni wrote:
Hi Sunil,
Have you tried an alternate hashing function other than java hashcode
that might provide a more uniform distribution of your data? The
google guava library provides a set of hashing strategies, like murmur
hash, that is reported to have lesser hash collisions in different
cases. Below is a link explaining these from their website
https://github.com/google/guava/wiki/HashingExplained
Here is a link where someone has done a comparative study of different
hashing functions
http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
If you end up choosing hashing function from google guava library,
make sure you use the documentation from guava version 11.0 as this
version of guava is already included in Hadoop classpath.
Thanks
On Fri, Oct 14, 2016 at 1:17 PM, Sunil Parmar
<spar...@threatmetrix.com <mailto:spar...@threatmetrix.com>> wrote:
We’re using Stream codec to consistently / parallel processing of
the data across the operator partitions. Our requirement is to
serialize processing of the data based on particular tuple
attribute let’s call it ‘catagory_name’ . In order to achieve the
parallel processing of different category names we’re written our
stream codec as following.
public class CatagoryStreamCodec extends
KryoSerializableStreamCodec<Object> {
private static final long serialVersionUID = -687991492884005033L;
@Override
public int getPartition(Object in) {
try {
InputTuple tuple = (InputTuple) in;
String partitionKehy = tuple.getName();
if(partitionKehy != null) {
return partitionKehy.hashCode();
}
}
}
It’s working as expected *but *we observed inconsistent partitions
when we run this in production env with 20 partitioner of the
operator following the codec in the dag.
* Some operator instance didn’t process any data
* Some operator instance process as many tuples as combined
everybody else
Questions :
* getPartition method supposed to return the actual partition or
just some lower bit used for deciding partition ?
* Number of partitions is known to application properties and
can vary between deployments or environments. Is it best
practice to use that property in the stream codec ?
* Any recommended hash function for getting consistent
variations in the lower bit with less variety of data. we’ve
~100+ categories and I’m thinking to have 10+ operator
partitions.
Thanks,
Sunil