Hi there,

We're running Map-Reduce jobs in java and Reducers write to Kudu.
In java we use hashCode() function to send results from Mappers to Reducers, 
e.g.
    public int getPartition(ArchiveKey key, Object value, int numReduceTasks) {
        int hash = key.getCaseId().hashCode();
        return (hash & Integer.MAX_VALUE) % numReduceTasks;
    }
There is also a partitioning hash function in Kudu tables.

Therefore, there are 2 questions:
1. We have multiple Kudu clients (Reducers).
Would it be better if each one has a single session to a single tablet writing 
large number of records,
or multiple sessions writing to different tablets (total number of records is 
the same)?
2. Assuming it is preferable to have 1-to-1 relationship, i.e. 1 Reducers 
should write to 1 Tablet. What would be the proper implementation to reduce 
amount of connections between reducers to different tablets, i.e. if there are 
128 reducers (each gathers its own set of unique hashes) and 128 tablets, then 
ideally each reducer should write to 1 tablet, but not to each of 128 tablets.

Why have the questions arised: there is a hash implementation in Java and 
another one in Kudu. Is there any chance to ensure Java Reducers use the same 
hash function as Kudu partitioning hash?


Best regards,
Sergejs

Reply via email to