Thank you very much for the explanation Alexander. On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov <[email protected]> wrote:
> 1. sort by - > key are distributed according to MR partitioner (controlled by > distributed by in hive) > > Lets assume hash partitioned uses the same column as sort by and uses x > mod 16 formula to get reducer id > > reduced 0 will have keys > 0 > 16 > 32 > > reducer 1 will have keys > 1 > 17 > 33 > > > if you merge reducer 0 and reducer 1 output you will have > 0 > 16 > 32 > 1 > 17 > 33 > > > 2. "order by" will use 1 reducer and hive will send all keys to reducer 0 > > So "order by" in hive works different from terasort. In case of terasort > you can merge output files and get one file with globally sorted data. > > > > > On Sun, Mar 8, 2015 at 7:55 AM, max scalf <[email protected]> wrote: > >> Thank you Alexander. So is it fair to assume when sort by is used and >> multiple files are produced per reducer at the end of it all of then are >> put togeather/merged to get the results back? >> >> And can sort by be used without distributed by and expect same result as >> order by ? >> >> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <[email protected] >> > wrote: >> >>> sort by query produces multiple independent files. >>> >>> order by - just one file >>> >>> usually sort by is used with distributed by. >>> In older hive versions (0.7) they might be used to implement local sort >>> within partition >>> similar to RANK() OVER (PARTITION BY A ORDER BY B) >>> >>> >>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <[email protected]> >>> wrote: >>> >>>> Hello all, >>>> >>>> I am a new to hadoop and hive in general and i am reading "hadoop the >>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom >>>> says below with regards to soritng >>>> >>>> *Sorting and Aggregating* >>>> *Sorting data in Hive can be achieved by using a standard ORDER BY >>>> clause. ORDER BY performs a parallel total sort of the input (like that >>>> described in “Total Sort” on page 261). When a globally sorted result is >>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard >>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* >>>> >>>> >>>> My Questions is, what exactly does he mean by "globally sorted >>>> result"?, if the sort by operation produces a sorted file per reducer does >>>> that mean at the end of the sort all the reducer are put back together to >>>> give the correct results ? >>>> >>>> >>>> >>>> >>> >> >
