We're using Stream codec to consistently / parallel processing of the data 
across the operator partitions. Our requirement is to serialize processing of 
the data based on particular tuple attribute let's call it 'catagory_name' . In 
order to achieve the parallel processing of different category names we're 
written our stream codec as following.


   public class CatagoryStreamCodec extends KryoSerializableStreamCodec<Object> 
{

private static final long serialVersionUID = -687991492884005033L;



    @Override

    public int getPartition(Object in) {

        try {

        InputTuple tuple = (InputTuple) in;

String partitionKehy = tuple.getName();

        if(partitionKehy != null) {

return partitionKehy.hashCode();

        }

    }

   }

It's working as expected but we observed inconsistent partitions when we run 
this in production env with 20 partitioner of the operator following the codec 
in the dag.

  *   Some operator instance didn't process any data
  *   Some operator instance process as many tuples as combined everybody else

Questions :

  *   getPartition method supposed to return the actual partition or just some 
lower bit used for deciding partition ?
  *   Number of partitions is known to application properties and can vary 
between deployments or environments. Is it best practice to use that property 
in the stream codec ?
  *   Any recommended hash function for getting consistent variations in the 
lower bit with less variety of data. we've ~100+ categories and I'm thinking to 
have 10+ operator partitions.

Thanks,
Sunil

Reply via email to