> so do you think if we want the same result from Hive and Spark or the >other freamwork, how could we try this one ?
There's a special backwards compat slow codepath that gets triggered if you do set mapred.reduce.tasks=199; (or any number) This will produce the exact same hash-code as the java hashcode for Strings & Integers. The bucket-id is determined by (hashCode & Integer.MAX_VALUE) % numberOfBuckets but this also triggers a non-stable sort on an entirely empty key, which will shuffle the data so the output file's order bears no resemblance to the input file's order. Even with that setting, the only consistent layout produced by Hive is the CLUSTER BY, which will sort on the same key used for distribution & uses the java hashCode if the auto-parallelism is turned off by setting a fixed reducer count. Cheers, Gopal
