Hi Julian,

      Yes, the model used for the memory spilling is the “hybrid” one — the 
memory space is pre-divided into “partitions” (based on the hash values of the 
key columns), and then as needed only some of the partitions would end up 
spilled to secondary storage (e.g. HDD), hence after all the input is read, 
some of the partitions would still be fully in memory (the term chosen for 
those was “pristine”); hence for the data in those partitions there would be no 
extra overhead. The remaining partitions would incur the needed IO overhead.
   In the worst case all the partitions would end up spilled; in a typical case 
only some of them; whenever the memory limit is reached — the decision on which 
partition to spill next is a delicate one — choosing a “pristine” partition 
would increase the total overhead, vs. choosing a previously spilled partition 
which yields little memory (and increases disk seeks later).
   All this is explained in the document; hope it is accessible now.

       And the same model should be used for the Hash Join later on.

               — Boaz

> On Jan 13, 2017, at 11:00 PM, Julian Hyde <jh...@apache.org> wrote:
> 
> The attachment didn’t come through. I’m hoping that you settled on a “hybrid” 
> hash algorithm that can write to disk, or write to memory, and the cost of 
> discovering that is wrong is not too great. With Goetz Graefe’s hybrid hash 
> join (which can be easily adapted to hybrid hash aggregate) if the input 
> ALMOST fits in memory you could process most of it in memory, then revisit 
> the stuff you spilled to disk.
> 
>> On Jan 13, 2017, at 7:46 PM, Boaz Ben-Zvi <bben-...@mapr.com> wrote:
>> 
>> Hi Drill developers,
>> 
>>     Attached is a document describing the design for memory spilling 
>> implementation for the Hash Aggregate operator.
>> 
>>     Please send me any comments or questions,
>> 
>>        -- Boaz
> 

Reply via email to