Re: [PIPELINE] Questions about pipeline

Tim Dudgeon Sun, 02 Nov 2008 03:29:19 -0800

See comments below.

Tim


Ken Tanaka wrote:

Tim Dudgeon wrote:
Ken Tanaka wrote:
Hi Tim,
...
At present, the structure for storing stages is a linked list, andbranches are implemented as additional pipelines accessed by a namethrough a HashMap. To generally handle branching and merging, adirected acyclic graph (DAG) would better serve, but that wouldrequire the pipeline code to be rewritten at this level. Argumentscould also be made for allowing cycles, as in directed graphs, butthat would be harder to debug, and with a GUI might be a step towarda visual programming language--so I don't think this should bepursued yet unless there are volunteers...
I agree, DAG would be better, but cycles could be needeed too, so DGwould be better too.
But, yes, I am ideally wanting visual designer too.
I'd like a visual designer too at some point, but that's a ways off intothe future.
Taken together I can see a generalisation here using named ports(input and outut), which is similar, but not identical, to yourcurrent concept of branches.
So you have:
BaseStage.emit(String branch, Object obj);
whereas I would conceptually see this as:
emit(String port, Object obj);
and you have:
Stage.process(Object obj);
whereas I would would conceptually see this as:
Stage.process(String port, Object obj);
And when a pipeline is being assembled a downstream stage isattached to a particular port of a stage, not the stage itself. Itthen just recieves data sent to that particular port, but not theother ports.
I could see that this would work, but would need either modifying anumber of stages already written, or maybe creating a compatibilitystage driver that takes older style stages so that the input objectcomes from a configured port name, usually "input" and a sends theoutput to configured output ports named "output" and whatever theprevious branch name(s) were, if any. Stages that used to look forevents for input should be rewritten to read multiple inputs (Stage.process(String port, Object obj) as you suggested). Eventswould then be reserved for truly out-of-band signals between stagesrather than carrying data for processing.
Agreed, I think with would be good. I think existing stages could bemade compatible by having a default input and output port, and to usethose if not specific port was specified.A default in/out port would probably be necessary to allow simpleauto-wiring.
I'd love to hear how compatible the current system is with this wayof seeing things. Are we just talking about a new type of Stageimplementation, or a more fundamental incompatibility at the API level.
I think you have some good ideas. This is changing the Stageimplementation, which affects on the order of 60 stages for us thatoverride the process method, unless the compatibility stage driverworks out. The top level pipeline would also be restructured. Theamount of work required puts this out of the near term for me to workon it, but there may be other developers/contributors to take this on.
I need to investigate more fully here, and consider the other options.
But potentially this is certainly of interest.
So is all that's necessary to prototype this to create a new Stageimplementation, with new emit( ... ) and process( ... ) methods?
I'm thinking it's more involved than that. To really deal well with thearbitrary number of downstream stages rather than just one meanschanging the digester rules<http://commons.apache.org/sandbox/pipeline/xref/org/apache/commons/pipeline/config/PipelineRuleSet.html>on specifying what follows. Normally a stage is connected to thepreceding stage if it is listed in that order in the configuration file.This should be a default behavior, but if stage2 and stage3 both followstage1 then some notation of which is the previous stage is needed.
stage1----stage2
   |
   |-----stage3

might be set up as conf_pipe.xml:
<pipeline>
  ...
<stage className="com.demo.pipeline.stages.Stage1"driverFactoryId="df1" stageId="stage1"/><stage className="com.demo.pipeline.stages.Stage2"driverFactoryId="df1"/><stage className="com.demo.pipeline.stages.Stage3"driverFactoryId="df1" follows="stage1"/>
</pipeline>
I propose the 'follows="stage1"' attribute to connect stage3 to stage1instead of stage2 immediately preceding. This seems cleaner than settingup a branch and matching up branch key names between the branching stageand the secondary pipeline(s). Can you think of a cleaner way toconfigure this?

I think we're in danger of looking at this the wrong way. The XML shouldreflect the underlying data model, not drive it. But to stick with thisparadigm I would think it might be best to explicity define theconnections in the model definition. Maybe something more like this:


<pipeline>
  ...
  <stage className="com.demo.pipeline.stages.Stage1"       
        driverFactoryId="df1" stageId="stage1">
  </stage>
  <stage className="com.demo.pipeline.stages.Stage2"
        driverFactoryId="df1">
        <input stageId="stage1" outputPort="pass"/>
  </stage>
  <stage className="com.demo.pipeline.stages.Stage3"
        driverFactoryId="df1">
        <input stageId="stage1" outputPort="pass"/>
  </stage>
  <stage className="com.demo.pipeline.stages.Stage4"
        driverFactoryId="df1">
        <input stageId="stage1" outputPort="fail" inputPort="aPort"/>
  </stage>
</pipeline>

I think this would allow more flexibility, as:
1. a stage could define multiple inputs if it needed to.

2. each connection is explicity defined and could have extra attributesadded in future (e.g. a disable attribute to disable execution of thatpart of the pipeline.3. The concept of input can probably be generalised to include the"feed", allowing multiple feeds to be used (as discussed earlier in thisthread). e.g. stage1 would also have an input that would be the feed.

The Pipeline.java class will need to be modified to build and maintain aDAG structure rather than a linked list. The incoming data are managedby a queue in the stage driver, which would change to a group of queues,allowing multiple inputs (ports). I'm assuming there is an open sourcedirected acyclic graph library out there that can replace the linked list.

If defined as I propose I'm not sure a specific graph library isnecessary. The model just comprises a set of stages that know how theyare connected. e.g. the connections are already implicit in the model.

But this probably needs more thought.


-Ken



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [PIPELINE] Questions about pipeline

Reply via email to