Hi Tim,
Tim Dudgeon wrote:
Hi Ken,
Thanks for the rapid response.
First, let me explain some background here.
I am looking for Java based pipelining solutions to incorporate into
an exisiting application. The use of pipelining is well established in
the sector, with applications like Pipeline Pilot and Knime, and so
many of the common needs have been well established over several years
by these applciations.
Have you also looked at Pentaho?
Key issues that my initial investigations of Jakarta Pipeline seem to
identify are:
1. Branching is very common. This typically takes 2 forms:
1.1. Splitting data. A stage could (for instance) have 2 output ports,
"pass" and "fail". Data is processed by the stage and sent to
whichever port is appropriate. Different stages would be attached to
each port, resulting in the pipeline being brached by this pass/fail
decision.
1.2. Attaching multiple stages to a particular output port.
The stage just sends its output onwards. It has no interest in what
happens once the data is sent, and is not concerned whether zero, one
or 100 stages receive the output. This is the stage1,2,3,4 scenario I
outlined previously.
2. Merging is also common (though less common than branching).
By analogy with braching, I would see this conceptually as a stage
having multiple input ports (A and B in the merging example).
At present, the structure for storing stages is a linked list, and
branches are implemented as additional pipelines accessed by a name
through a HashMap. To generally handle branching and merging, a directed
acyclic graph (DAG) would better serve, but that would require the
pipeline code to be rewritten at this level. Arguments could also be
made for allowing cycles, as in directed graphs, but that would be
harder to debug, and with a GUI might be a step toward a visual
programming language--so I don't think this should be pursued yet unless
there are volunteers...
Taken together I can see a generalisation here using named ports
(input and outut), which is similar, but not identical, to your
current concept of branches.
So you have:
BaseStage.emit(String branch, Object obj);
whereas I would conceptually see this as:
emit(String port, Object obj);
and you have:
Stage.process(Object obj);
whereas I would would conceptually see this as:
Stage.process(String port, Object obj);
And when a pipeline is being assembled a downstream stage is attached
to a particular port of a stage, not the stage itself. It then just
recieves data sent to that particular port, but not the other ports.
I could see that this would work, but would need either modifying a
number of stages already written, or maybe creating a compatibility
stage driver that takes older style stages so that the input object
comes from a configured port name, usually "input" and a sends the
output to configured output ports named "output" and whatever the
previous branch name(s) were, if any. Stages that used to look for
events for input should be rewritten to read multiple inputs (
Stage.process(String port, Object obj) as you suggested). Events would
then be reserved for truly out-of-band signals between stages rather
than carrying data for processing.
I'd love to hear how compatible the current system is with this way of
seeing things. Are we just talking about a new type of Stage
implementation, or a more fundamental incompatibility at the API level.
I think you have some good ideas. This is changing the Stage
implementation, which affects on the order of 60 stages for us that
override the process method, unless the compatibility stage driver works
out. The top level pipeline would also be restructured. The amount of
work required puts this out of the near term for me to work on it, but
there may be other developers/contributors to take this on.
-Ken
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]