Re: balanced of Stream Codec

Vlad Rozov Mon, 17 Oct 2016 17:02:39 -0700

Using different hash function will help only in case data is equallydistributed across categories. In many cases data is skewed and somecategories occur more frequently than others. In such case generic hashfunction will not help. Can you try to sample data and see if the datais equally distributed across categories?


Vlad



On 10/16/16 10:40, Pramod Immaneni wrote:

Hi Sunil,

Have you tried an alternate hashing function other than java hashcodethat might provide a more uniform distribution of your data? Thegoogle guava library provides a set of hashing strategies, like murmurhash, that is reported to have lesser hash collisions in differentcases. Below is a link explaining these from their website


https://github.com/google/guava/wiki/HashingExplained

Here is a link where someone has done a comparative study of differenthashing functions

http://programmers.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

If you end up choosing hashing function from google guava library,make sure you use the documentation from guava version 11.0 as thisversion of guava is already included in Hadoop classpath.


Thanks

On Fri, Oct 14, 2016 at 1:17 PM, Sunil Parmar<spar...@threatmetrix.com <mailto:spar...@threatmetrix.com>> wrote:


    We’re using Stream codec to consistently / parallel processing of
    the data across the operator partitions. Our requirement is to
    serialize processing of the data based on particular tuple
    attribute let’s call it ‘catagory_name’ . In order to achieve the
    parallel processing of different category names we’re written our
    stream codec as following.

       public class CatagoryStreamCodec extends
    KryoSerializableStreamCodec<Object> {

    private static final long serialVersionUID = -687991492884005033L;

    @Override

    public int getPartition(Object in) {

    try {

    InputTuple tuple = (InputTuple) in;

    String partitionKehy = tuple.getName();

    if(partitionKehy != null) {

    return partitionKehy.hashCode();

    }

        }

       }

    It’s working as expected *but *we observed inconsistent partitions
    when we run this in production env with 20 partitioner of the
    operator following the codec in the dag.

      * Some operator instance didn’t process any data
      * Some operator instance process as many tuples as combined
        everybody else


    Questions :

      * getPartition method supposed to return the actual partition or
        just some lower bit used for deciding partition ?
      * Number of partitions is known to application properties and
        can vary between deployments or environments. Is it best
        practice to use that property in the stream codec ?
      * Any recommended hash function for getting consistent
        variations in the lower bit with less variety of data. we’ve
        ~100+ categories and I’m thinking to have 10+ operator
        partitions.


    Thanks,
    Sunil

Re: balanced of Stream Codec

Reply via email to