Re: sorting in hive -- general

max scalf Sun, 08 Mar 2015 12:03:58 -0700

Thank you very much for the explanation Alexander.

On Sun, Mar 8, 2015 at 1:14 PM, Alexander Pivovarov <[email protected]>
wrote:


> 1. sort by -
> key are distributed according to MR partitioner  (controlled by
> distributed by in hive)
>
> Lets assume hash partitioned uses the same column as sort by and uses x
> mod 16 formula to get reducer id
>
> reduced 0 will have keys
> 0
> 16
> 32
>
> reducer 1 will have keys
> 1
> 17
> 33
>
>
> if you merge reducer 0 and reducer 1 output you will have
> 0
> 16
> 32
> 1
> 17
> 33
>
>
> 2. "order by" will use 1 reducer and hive will send all keys to reducer 0
>
> So "order by" in hive works different from terasort. In case of terasort
> you can merge output files and get one file with globally sorted data.
>
>
>
>
> On Sun, Mar 8, 2015 at 7:55 AM, max scalf <[email protected]> wrote:
>
>> Thank you Alexander.  So is it fair to assume when sort by is used and
>> multiple files are produced per reducer at the end of it all of then are
>> put togeather/merged to get the results back?
>>
>> And can sort by be used without distributed by and expect same result as
>> order by ?
>>
>> On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov <[email protected]
>> > wrote:
>>
>>> sort by query produces multiple independent files.
>>>
>>> order by - just one file
>>>
>>> usually sort by is used with distributed by.
>>> In older hive versions (0.7) they might be used to implement local sort
>>> within partition
>>> similar to RANK() OVER (PARTITION BY A ORDER BY B)
>>>
>>>
>>> On Sat, Mar 7, 2015 at 3:02 PM, max scalf <[email protected]>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am a new to hadoop and hive in general and i am reading "hadoop the
>>>> definitive guide" by Tom White and on page 504 for the hive chapter, Tom
>>>> says below with regards to soritng
>>>>
>>>> *Sorting and Aggregating*
>>>> *Sorting data in Hive can be achieved by using a standard ORDER BY
>>>> clause. ORDER BY performs a parallel total sort of the input (like that
>>>> described in “Total Sort” on page 261). When a globally sorted result is
>>>> not required—and in many cases it isn’t—you can use Hive’s nonstandard
>>>> extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*
>>>>
>>>>
>>>> My Questions is, what exactly does he mean by "globally sorted
>>>> result"?, if the sort by operation produces a sorted file per reducer does
>>>> that mean at the end of the sort all the reducer are put back together to
>>>> give the correct results ?
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: sorting in hive -- general

Reply via email to