Whats the reason for Hyracks selecting a single frame for each operator? Many of the rewrite rules focus on minimizing the data we store in this single frame.
On Wed, Dec 4, 2013 at 1:34 PM, Eldon Carman <[email protected]> wrote: > I posted a question about tuple flow to the Hyracks group. Here is a copy > of the dialogue. > > > On Tue, Dec 3, 2013 at 11:48 PM, Vinayak Borkar wrote: > >> The standard strategy used by every operator to send data to the next >> operator in the pipeline is to use one pre-allocated memory buffer that is >> reused. >> >> Say OP0 feeds data to OP1 (unnest) feeds data to OP2. OP0 and OP1 create >> a "frame" each, say F0 and F1 respectively, at the beginning of query >> execution. >> >> OP0 would then pack as many tuples (whose format is the sequential >> juxtaposition of its field values) into F0 until no more tuples can fit. At >> this time OP0 invokes the nextFrame() method on OP1 (through connectors, if >> applicable) to pass the data to OP1. OP1 iterates over F0 and processes >> each tuple creating the result tuples in F1. One of two things can happen >> now; either F0 is exhausted and F1 still has room, or F1 is full and F0 >> still contains tuples to be processed. In the first case, OP1 would return >> from the next frame call back to OP0 which would refill F0 with the next >> set of tuples. In the second case, OP1 would invoke OP2.nextFrame(F1). >> >> In your specific example, the unnest operator would end up copying $$1 >> three times, once for each output tuple. However, in terms of memory >> consumption, this does not lead to more space usage when the operators >> pipeline the frames as described above. It is however inefficient to make >> the copies. >> >> Two broad strategies are possible to improve the performance of the >> system. >> >> 1. In VXQuery, we use a sequence of unnest operators, one for each path >> expression. So a/b/c will become three unnest operators. This is not >> necessary. VXQuery could have a rewrite rule that converts >> >> unnest iterate($$1, "b") -> $$2 >> unnest iterate($$0, "a") -> $$1 >> >> into unnest iterate($$0, "a/b") -> $$2 when $$1 is not required. >> >> This concept can be further used to rewrite >> >> unnest iterate(...) >> data-scan(...) >> >> into data-scan with the path that is needed pushed into the source when >> the binding of the data-scan itself is not needed for anything other than >> the unnest. We could extend the parser to only produce XML trees for the >> given path steps. This will eliminate a whole bunch of copies. >> >> >> 2. The second strategy is at the Algebricks/Hyracks level. Every operator >> could accept a "projection" list (a list of fields that are not needed >> upstream). So the unnest could then not copy the input field into the >> output when its not needed anymore. This will also help with fixing the >> extra copying. >> >> In VXQuery, (1) will show a huge improvement in terms of performance. >> >> >> Vinayak >> >> >> >> On 12/3/13, 12:18 PM, prestonc wrote: >> >>> How does the tuple information flow between operators? I want to >>> understand better the dynamics of adding or removing fields from the >>> tuple stream. As I understand it, the operator adds tuples are added to >>> a frame until it is full and then is passed on to the next operator. >>> Does the next operator start working on that frame as soon as it gets >>> the frame? >>> >>> The frame passed on to the next operator. In the situation, where more >>> information is added to the tuple. Does the operator start a new frame >>> and put the new tuple with the additional information in this new frame? >>> Possibly creating many frames from a single frame input? What happens to >>> the old frame of data? >>> >>> Consider an UNNEST operator. The operator reads a sequence field ($$1) >>> and creates individual items in a new field ($$2). >>> {{$$1-->(1, 2, 3)}} becomes {{$$1-->(1, 2, 3), $$2-->1}} {{$$1-->(1, >>> 2, 3), $$2-->2}} {{$$1-->(1, 2, 3), $$2-->3}} >>> Does this mean that $$1 is now copied throughout each tuple and has just >>> tripled the amount of space its taking up? >>> >>>
