Re: [PIPELINE] Questions about pipeline

Ken Tanaka Mon, 27 Oct 2008 12:49:19 -0700

Hi Tim,

Tim Dudgeon wrote:

Hi Ken,
Thanks for the rapid response.
First, let me explain some background here.
I am looking for Java based pipelining solutions to incorporate intoan exisiting application. The use of pipelining is well established inthe sector, with applications like Pipeline Pilot and Knime, and somany of the common needs have been well established over several yearsby these applciations.

Have you also looked at Pentaho?

Key issues that my initial investigations of Jakarta Pipeline seem toidentify are:
1. Branching is very common. This typically takes 2 forms:
1.1. Splitting data. A stage could (for instance) have 2 output ports,"pass" and "fail". Data is processed by the stage and sent towhichever port is appropriate. Different stages would be attached toeach port, resulting in the pipeline being brached by this pass/faildecision.
1.2. Attaching multiple stages to a particular output port.
The stage just sends its output onwards. It has no interest in whathappens once the data is sent, and is not concerned whether zero, oneor 100 stages receive the output. This is the stage1,2,3,4 scenario Ioutlined previously.
2. Merging is also common (though less common than branching).
By analogy with braching, I would see this conceptually as a stagehaving multiple input ports (A and B in the merging example).

At present, the structure for storing stages is a linked list, andbranches are implemented as additional pipelines accessed by a namethrough a HashMap. To generally handle branching and merging, a directedacyclic graph (DAG) would better serve, but that would require thepipeline code to be rewritten at this level. Arguments could also bemade for allowing cycles, as in directed graphs, but that would beharder to debug, and with a GUI might be a step toward a visualprogramming language--so I don't think this should be pursued yet unlessthere are volunteers...

Taken together I can see a generalisation here using named ports(input and outut), which is similar, but not identical, to yourcurrent concept of branches.
So you have:
BaseStage.emit(String branch, Object obj);
whereas I would conceptually see this as:
emit(String port, Object obj);
and you have:
Stage.process(Object obj);
whereas I would would conceptually see this as:
Stage.process(String port, Object obj);
And when a pipeline is being assembled a downstream stage is attachedto a particular port of a stage, not the stage itself. It then justrecieves data sent to that particular port, but not the other ports.

I could see that this would work, but would need either modifying anumber of stages already written, or maybe creating a compatibilitystage driver that takes older style stages so that the input objectcomes from a configured port name, usually "input" and a sends theoutput to configured output ports named "output" and whatever theprevious branch name(s) were, if any. Stages that used to look forevents for input should be rewritten to read multiple inputs (Stage.process(String port, Object obj) as you suggested). Events wouldthen be reserved for truly out-of-band signals between stages ratherthan carrying data for processing.

I'd love to hear how compatible the current system is with this way ofseeing things. Are we just talking about a new type of Stageimplementation, or a more fundamental incompatibility at the API level.

I think you have some good ideas. This is changing the Stageimplementation, which affects on the order of 60 stages for us thatoverride the process method, unless the compatibility stage driver worksout. The top level pipeline would also be restructured. The amount ofwork required puts this out of the near term for me to work on it, butthere may be other developers/contributors to take this on.


-Ken


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [PIPELINE] Questions about pipeline

Reply via email to