> so do you think if we want the same result from Hive and Spark or the
>other freamwork, how could we try this one ?

There's a special backwards compat slow codepath that gets triggered if
you do

set mapred.reduce.tasks=199; (or any number)

This will produce the exact same hash-code as the java hashcode for
Strings & Integers.

The bucket-id is determined by

(hashCode & Integer.MAX_VALUE) % numberOfBuckets

but this also triggers a non-stable sort on an entirely empty key, which
will shuffle the data so the output file's order bears no resemblance to
the input file's order.


Even with that setting, the only consistent layout produced by Hive is the
CLUSTER BY, which will sort on the same key used for distribution & uses
the java hashCode if the auto-parallelism is turned off by setting a fixed
reducer count.

Cheers,
Gopal


Reply via email to