I posted a question about tuple flow to the Hyracks group. Here is a copy
of the dialogue.
On Tue, Dec 3, 2013 at 11:48 PM, Vinayak Borkar wrote:
> The standard strategy used by every operator to send data to the next
> operator in the pipeline is to use one pre-allocated memory buffer that is
> reused.
>
> Say OP0 feeds data to OP1 (unnest) feeds data to OP2. OP0 and OP1 create a
> "frame" each, say F0 and F1 respectively, at the beginning of query
> execution.
>
> OP0 would then pack as many tuples (whose format is the sequential
> juxtaposition of its field values) into F0 until no more tuples can fit. At
> this time OP0 invokes the nextFrame() method on OP1 (through connectors, if
> applicable) to pass the data to OP1. OP1 iterates over F0 and processes
> each tuple creating the result tuples in F1. One of two things can happen
> now; either F0 is exhausted and F1 still has room, or F1 is full and F0
> still contains tuples to be processed. In the first case, OP1 would return
> from the next frame call back to OP0 which would refill F0 with the next
> set of tuples. In the second case, OP1 would invoke OP2.nextFrame(F1).
>
> In your specific example, the unnest operator would end up copying $$1
> three times, once for each output tuple. However, in terms of memory
> consumption, this does not lead to more space usage when the operators
> pipeline the frames as described above. It is however inefficient to make
> the copies.
>
> Two broad strategies are possible to improve the performance of the system.
>
> 1. In VXQuery, we use a sequence of unnest operators, one for each path
> expression. So a/b/c will become three unnest operators. This is not
> necessary. VXQuery could have a rewrite rule that converts
>
> unnest iterate($$1, "b") -> $$2
> unnest iterate($$0, "a") -> $$1
>
> into unnest iterate($$0, "a/b") -> $$2 when $$1 is not required.
>
> This concept can be further used to rewrite
>
> unnest iterate(...)
> data-scan(...)
>
> into data-scan with the path that is needed pushed into the source when
> the binding of the data-scan itself is not needed for anything other than
> the unnest. We could extend the parser to only produce XML trees for the
> given path steps. This will eliminate a whole bunch of copies.
>
>
> 2. The second strategy is at the Algebricks/Hyracks level. Every operator
> could accept a "projection" list (a list of fields that are not needed
> upstream). So the unnest could then not copy the input field into the
> output when its not needed anymore. This will also help with fixing the
> extra copying.
>
> In VXQuery, (1) will show a huge improvement in terms of performance.
>
>
> Vinayak
>
>
>
> On 12/3/13, 12:18 PM, prestonc wrote:
>
>> How does the tuple information flow between operators? I want to
>> understand better the dynamics of adding or removing fields from the
>> tuple stream. As I understand it, the operator adds tuples are added to
>> a frame until it is full and then is passed on to the next operator.
>> Does the next operator start working on that frame as soon as it gets
>> the frame?
>>
>> The frame passed on to the next operator. In the situation, where more
>> information is added to the tuple. Does the operator start a new frame
>> and put the new tuple with the additional information in this new frame?
>> Possibly creating many frames from a single frame input? What happens to
>> the old frame of data?
>>
>> Consider an UNNEST operator. The operator reads a sequence field ($$1)
>> and creates individual items in a new field ($$2).
>> {{$$1-->(1, 2, 3)}} becomes {{$$1-->(1, 2, 3), $$2-->1}} {{$$1-->(1,
>> 2, 3), $$2-->2}} {{$$1-->(1, 2, 3), $$2-->3}}
>> Does this mean that $$1 is now copied throughout each tuple and has just
>> tripled the amount of space its taking up?
>>
>>