Re: [PIPELINE] Questions about pipeline

Ken Tanaka Mon, 03 Nov 2008 14:40:26 -0800


Tim Dudgeon wrote:

See comments below.

Tim

Ken Tanaka wrote:
Tim Dudgeon wrote:
Ken Tanaka wrote:
Hi Tim,
...
At present, the structure for storing stages is a linked list, andbranches are implemented as additional pipelines accessed by a namethrough a HashMap. To generally handle branching and merging, adirected acyclic graph (DAG) would better serve, but that wouldrequire the pipeline code to be rewritten at this level. Argumentscould also be made for allowing cycles, as in directed graphs, butthat would be harder to debug, and with a GUI might be a steptoward a visual programming language--so I don't think this shouldbe pursued yet unless there are volunteers...
I agree, DAG would be better, but cycles could be needeed too, so DGwould be better too.
But, yes, I am ideally wanting visual designer too.
I'd like a visual designer too at some point, but that's a ways offinto the future.
Taken together I can see a generalisation here using named ports(input and outut), which is similar, but not identical, to yourcurrent concept of branches.
So you have:
BaseStage.emit(String branch, Object obj);
whereas I would conceptually see this as:
emit(String port, Object obj);
and you have:
Stage.process(Object obj);
whereas I would would conceptually see this as:
Stage.process(String port, Object obj);
And when a pipeline is being assembled a downstream stage isattached to a particular port of a stage, not the stage itself. Itthen just recieves data sent to that particular port, but not theother ports.
I could see that this would work, but would need either modifying anumber of stages already written, or maybe creating a compatibilitystage driver that takes older style stages so that the input objectcomes from a configured port name, usually "input" and a sends theoutput to configured output ports named "output" and whatever theprevious branch name(s) were, if any. Stages that used to look forevents for input should be rewritten to read multiple inputs (Stage.process(String port, Object obj) as you suggested). Eventswould then be reserved for truly out-of-band signals between stagesrather than carrying data for processing.
Agreed, I think with would be good. I think existing stages could bemade compatible by having a default input and output port, and touse those if not specific port was specified.A default in/out port would probably be necessary to allow simpleauto-wiring.
I'd love to hear how compatible the current system is with thisway of seeing things. Are we just talking about a new type ofStage implementation, or a more fundamental incompatibility at theAPI level.
I think you have some good ideas. This is changing the Stageimplementation, which affects on the order of 60 stages for us thatoverride the process method, unless the compatibility stage driverworks out. The top level pipeline would also be restructured. Theamount of work required puts this out of the near term for me towork on it, but there may be other developers/contributors to takethis on.
I need to investigate more fully here, and consider the other options.
But potentially this is certainly of interest.
So is all that's necessary to prototype this to create a new Stageimplementation, with new emit( ... ) and process( ... ) methods?
I'm thinking it's more involved than that. To really deal well withthe arbitrary number of downstream stages rather than just one meanschanging the digester rules<http://commons.apache.org/sandbox/pipeline/xref/org/apache/commons/pipeline/config/PipelineRuleSet.html>on specifying what follows. Normally a stage is connected to thepreceding stage if it is listed in that order in the configurationfile. This should be a default behavior, but if stage2 and stage3both follow stage1 then some notation of which is the previous stageis needed.
stage1----stage2
   |
   |-----stage3

might be set up as conf_pipe.xml:
<pipeline>
  ...
<stage className="com.demo.pipeline.stages.Stage1"driverFactoryId="df1" stageId="stage1"/><stage className="com.demo.pipeline.stages.Stage2"driverFactoryId="df1"/><stage className="com.demo.pipeline.stages.Stage3"driverFactoryId="df1" follows="stage1"/>
</pipeline>
I propose the 'follows="stage1"' attribute to connect stage3 tostage1 instead of stage2 immediately preceding. This seems cleanerthan setting up a branch and matching up branch key names between thebranching stage and the secondary pipeline(s). Can you think of acleaner way to configure this?
I think we're in danger of looking at this the wrong way. The XMLshould reflect the underlying data model, not drive it. But to stickwith this paradigm I would think it might be best to explicity definethe connections in the model definition. Maybe something more like this:
<pipeline>
  ...
<stage className="com.demo.pipeline.stages.Stage1"driverFactoryId="df1" stageId="stage1">
  </stage>
  <stage className="com.demo.pipeline.stages.Stage2"
    driverFactoryId="df1">
    <input stageId="stage1" outputPort="pass"/>
  </stage>

Just to clarify, for Stage2, when you specify '<input stageId="stage1"outputPort="pass"/>', 'outputPort="pass"' refers to an output port ofstage1 and named "pass" and is not specifying that the stage2 outputport is named "pass", right? So Stage1 has two output ports, named"pass" and "fail", and this would be documented somewhere so you knewwhat to connect to when you wrote the configuration XML?

  <stage className="com.demo.pipeline.stages.Stage3"
    driverFactoryId="df1">
    <input stageId="stage1" outputPort="pass"/>
  </stage>
  <stage className="com.demo.pipeline.stages.Stage4"
    driverFactoryId="df1">
    <input stageId="stage1" outputPort="fail" inputPort="aPort"/>
  </stage>

So here Stage4 has an input port named "aPort" and it is loaded from thestage1 output port named "fail"?

</pipeline>

I think this would allow more flexibility, as:
1. a stage could define multiple inputs if it needed to.

If I understand you correctly, suppose there is a stage5 that has inputports "aPort" and "bPort" that we would like to receive data from stage2and stage3 ("pass" output port from both). Then it would be specified asfollows:


 <stage className="com.demo.pipeline.stages.Stage5"
   driverFactoryId="df1">
   <input stageId="stage2" outputPort="pass" inputPort="aPort"/>
   <input stageId="stage3" outputPort="pass" inputPort="bPort"/>
 </stage>

I also assume that Stage2 and Stage3 are given stageIds of "stage2" and"stage3" respectively.


[stage1]------------>[stage2]------------>[stage5]
    |   pass->(in)           pass->aPort   ^
    |                                      |
    +-------------->[stage3]---------------+
    |   pass->(in)           pass->bPort
    |
    +-------------->[stage4]
        fail->aPort

2. each connection is explicity defined and could have extraattributes added in future (e.g. a disable attribute to disableexecution of that part of the pipeline.3. The concept of input can probably be generalised to include the"feed", allowing multiple feeds to be used (as discussed earlier inthis thread). e.g. stage1 would also have an input that would be thefeed.

Do you envision a stage with two inputs (aPort and bPort) waiting untilthere are inputs on both before its stageDriver invokes the processmethod? If stage5 needs two inputs, and stage2 provides 3 values andstage3 provides 2 values, there are just 2 complete pairs of values. Thethird value from stage2 could wait indefinitely for a matching inputfrom stage3. Currently stages run until their queue is empty, but withmultiple inputs that could be imbalanced, it might be better to set thequit condition to any one queue is empty and all upstream stages claimto be complete. Any non-empty queues on exit can trigger a warning.

The Pipeline.java class will need to be modified to build andmaintain a DAG structure rather than a linked list. The incoming dataare managed by a queue in the stage driver, which would change to agroup of queues, allowing multiple inputs (ports). I'm assuming thereis an open source directed acyclic graph library out there that canreplace the linked list.
If defined as I propose I'm not sure a specific graph library isnecessary. The model just comprises a set of stages that know how theyare connected. e.g. the connections are already implicit in the model.
But this probably needs more thought.

Currently the linked list of stages is used for lazy initialization, tofind the next stage feeder the first time it is used. To allow generalconnections, the downstream feeder link could become an array ofsubsequent stageDrivers, with the connections set up as the pipeline isbuilt. In that case, then a DAG library would not be needed, and wecould keep the linked list as is.



-Ken

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [PIPELINE] Questions about pipeline

Reply via email to