Hi there,
We're running Map-Reduce jobs in java and Reducers write to Kudu.
In java we use hashCode() function to send results from Mappers to Reducers,
e.g.
public int getPartition(ArchiveKey key, Object value, int numReduceTasks) {
int hash = key.getCaseId().hashCode();
return (hash & Integer.MAX_VALUE) % numReduceTasks;
}
There is also a partitioning hash function in Kudu tables.
Therefore, there are 2 questions:
1. We have multiple Kudu clients (Reducers).
Would it be better if each one has a single session to a single tablet writing
large number of records,
or multiple sessions writing to different tablets (total number of records is
the same)?
2. Assuming it is preferable to have 1-to-1 relationship, i.e. 1 Reducers
should write to 1 Tablet. What would be the proper implementation to reduce
amount of connections between reducers to different tablets, i.e. if there are
128 reducers (each gathers its own set of unique hashes) and 128 tablets, then
ideally each reducer should write to 1 tablet, but not to each of 128 tablets.
Why have the questions arised: there is a hash implementation in Java and
another one in Kudu. Is there any chance to ensure Java Reducers use the same
hash function as Kudu partitioning hash?
Best regards,
Sergejs