Hi Tim,

Tim Dudgeon wrote:
Hi Ken,

Thanks for the rapid response.
First, let me explain some background here.
I am looking for Java based pipelining solutions to incorporate into an exisiting application. The use of pipelining is well established in the sector, with applications like Pipeline Pilot and Knime, and so many of the common needs have been well established over several years by these applciations.
Have you also looked at Pentaho?

Key issues that my initial investigations of Jakarta Pipeline seem to identify are:

1. Branching is very common. This typically takes 2 forms:
1.1. Splitting data. A stage could (for instance) have 2 output ports, "pass" and "fail". Data is processed by the stage and sent to whichever port is appropriate. Different stages would be attached to each port, resulting in the pipeline being brached by this pass/fail decision.
1.2. Attaching multiple stages to a particular output port.
The stage just sends its output onwards. It has no interest in what happens once the data is sent, and is not concerned whether zero, one or 100 stages receive the output. This is the stage1,2,3,4 scenario I outlined previously.

2. Merging is also common (though less common than branching).
By analogy with braching, I would see this conceptually as a stage having multiple input ports (A and B in the merging example).

At present, the structure for storing stages is a linked list, and branches are implemented as additional pipelines accessed by a name through a HashMap. To generally handle branching and merging, a directed acyclic graph (DAG) would better serve, but that would require the pipeline code to be rewritten at this level. Arguments could also be made for allowing cycles, as in directed graphs, but that would be harder to debug, and with a GUI might be a step toward a visual programming language--so I don't think this should be pursued yet unless there are volunteers...


Taken together I can see a generalisation here using named ports (input and outut), which is similar, but not identical, to your current concept of branches.

So you have:
BaseStage.emit(String branch, Object obj);
whereas I would conceptually see this as:
emit(String port, Object obj);
and you have:
Stage.process(Object obj);
whereas I would would conceptually see this as:
Stage.process(String port, Object obj);

And when a pipeline is being assembled a downstream stage is attached to a particular port of a stage, not the stage itself. It then just recieves data sent to that particular port, but not the other ports.
I could see that this would work, but would need either modifying a number of stages already written, or maybe creating a compatibility stage driver that takes older style stages so that the input object comes from a configured port name, usually "input" and a sends the output to configured output ports named "output" and whatever the previous branch name(s) were, if any. Stages that used to look for events for input should be rewritten to read multiple inputs ( Stage.process(String port, Object obj) as you suggested). Events would then be reserved for truly out-of-band signals between stages rather than carrying data for processing.

I'd love to hear how compatible the current system is with this way of seeing things. Are we just talking about a new type of Stage implementation, or a more fundamental incompatibility at the API level.

I think you have some good ideas. This is changing the Stage implementation, which affects on the order of 60 stages for us that override the process method, unless the compatibility stage driver works out. The top level pipeline would also be restructured. The amount of work required puts this out of the near term for me to work on it, but there may be other developers/contributors to take this on.

-Ken


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to