Re: Tuple Flow Question and Answer

Eldon Carman Wed, 04 Dec 2013 16:36:25 -0800

Whats the reason for Hyracks selecting a single frame for each operator?
Many of the rewrite rules focus on minimizing the data we store in this
single frame.



On Wed, Dec 4, 2013 at 1:34 PM, Eldon Carman <[email protected]> wrote:

> I posted a question about tuple flow to the Hyracks group. Here is a copy
> of the dialogue.
>
>
> On Tue, Dec 3, 2013 at 11:48 PM, Vinayak Borkar wrote:
>
>> The standard strategy used by every operator to send data to the next
>> operator in the pipeline is to use one pre-allocated memory buffer that is
>> reused.
>>
>> Say OP0 feeds data to OP1 (unnest) feeds data to OP2. OP0 and OP1 create
>> a "frame" each, say F0 and F1 respectively, at the beginning of query
>> execution.
>>
>> OP0 would then pack as many tuples (whose format is the sequential
>> juxtaposition of its field values) into F0 until no more tuples can fit. At
>> this time OP0 invokes the nextFrame() method on OP1 (through connectors, if
>> applicable) to pass the data to OP1. OP1 iterates over F0 and processes
>> each tuple creating the result tuples in F1. One of two things can happen
>> now; either F0 is exhausted and F1 still has room, or F1 is full and F0
>> still contains tuples to be processed. In the first case, OP1 would return
>> from the next frame call back to OP0 which would refill F0 with the next
>> set of tuples. In the second case, OP1 would invoke OP2.nextFrame(F1).
>>
>> In your specific example, the unnest operator would end up copying $$1
>> three times, once for each output tuple. However, in terms of memory
>> consumption, this does not lead to more space usage when the operators
>> pipeline the frames as described above. It is however inefficient to make
>> the copies.
>>
>> Two broad strategies are possible to improve the performance of the
>> system.
>>
>> 1. In VXQuery, we use a sequence of unnest operators, one for each path
>> expression. So a/b/c will become three unnest operators. This is not
>> necessary. VXQuery could have a rewrite rule that converts
>>
>> unnest iterate($$1, "b") -> $$2
>>   unnest iterate($$0, "a") -> $$1
>>
>> into unnest iterate($$0, "a/b") -> $$2 when $$1 is not required.
>>
>> This concept can be further used to rewrite
>>
>> unnest iterate(...)
>>   data-scan(...)
>>
>> into data-scan with the path that is needed pushed into the source when
>> the binding of the data-scan itself is not needed for anything other than
>> the unnest. We could extend the parser to only produce XML trees for the
>> given path steps. This will eliminate a whole bunch of copies.
>>
>>
>> 2. The second strategy is at the Algebricks/Hyracks level. Every operator
>> could accept a "projection" list (a list of fields that are not needed
>> upstream). So the unnest could then not copy the input field into the
>> output when its not needed anymore. This will also help with fixing the
>> extra copying.
>>
>> In VXQuery, (1) will show a huge improvement in terms of performance.
>>
>>
>> Vinayak
>>
>>
>>
>> On 12/3/13, 12:18 PM, prestonc wrote:
>>
>>> How does the tuple information flow between operators? I want to
>>> understand better the dynamics of adding or removing fields from the
>>> tuple stream. As I understand it, the operator adds tuples are added to
>>> a frame until it is full and then is passed on to the next operator.
>>> Does the next operator start working on that frame as soon as it gets
>>> the frame?
>>>
>>> The frame passed on to the next operator. In the situation, where more
>>> information is added to the tuple. Does the operator start a new frame
>>> and put the new tuple with the additional information in this new frame?
>>> Possibly creating many frames from a single frame input? What happens to
>>> the old frame of data?
>>>
>>> Consider an UNNEST operator. The operator reads a sequence field ($$1)
>>> and creates individual items in a new field ($$2).
>>> {{$$1-->(1, 2, 3)}} becomes  {{$$1-->(1, 2, 3), $$2-->1}}  {{$$1-->(1,
>>> 2, 3), $$2-->2}}  {{$$1-->(1, 2, 3), $$2-->3}}
>>> Does this mean that $$1 is now copied throughout each tuple and has just
>>> tripled the amount of space its taking up?
>>>
>>>

Re: Tuple Flow Question and Answer

Reply via email to