Re: [PIPELINE] Questions about pipeline

Ken Tanaka Fri, 24 Oct 2008 11:38:40 -0700


Tim Dudgeon wrote:

Ken Tanaka wrote:
The Pipeline Basics tutorial has now been incorporated into theproject page. Thanks to some help and cleanup from Rahul Akolkar thedocumentation submitted was installed quickly. See
http://commons.apache.org/sandbox/pipeline/pipeline_basics.html

-Ken
That documentation is really useful. Thanks!

Wow, someone is actually looking at this. I'll work on cleaning up thedocumentation some. I hope people realize that some of the color-codedexamples got some inadvertent newlines added--but this isn't relevant toyour questions.

Could I follow up one of the earlier questions in this thread onbranching and merging.
From those docs it looks to me like the way data was set to a branchis a bit strange. There appears to be a FileReaderStage class that hasJava bean property called htmlPipelineKey:
<stage className="com.demo.pipeline.stages.FileReaderStage"
driverFactoryId="df1" htmlPipelineKey="sales2html"/>
and later in the pipeline a branch is defined that names the pipelineaccording to that name:
<pipeline key="sales2html">
This seems pretty inflexible to me. Any branches have to be hardcodedinto the stage definition. I was expecting a situation where multiplestages could be the recipients of the output of any stage, and thesecan be "wired up" dynamically. e.g. something like this:
         |--stage2
         |
stage1---+--stage3
         |
         |--stage4
so that all you needed to do was to define a stage5 as one moredownstream stage for stage 1 and it would transparently receive the data.
Is this possible, or does the branching have to be hard-coded into thestage definition?

I wouldn't call the way branches are specified "hard coding", since thexml file here is a configuration file. For our current use, branches arepretty rare, so the pipeline framework deals best with simple cases thatare fairly linear. Also, if stage1 is a branching stage, then that stagewas written with branching in mind, and the "htmlPipelineKey" is ahard-coded property name in the stage source code, so it can directoutput when it passes data out to the framework. To simplify matters,all your branching stages could follow a convention of using "branchKey"(or some other generic name), then you wouldn't have to remember whatvariable holds the branch name for which stage.

A stage could be written to take an arbitrary number of branch names,and thus send output down multiple branches, although it can getcomplicated configuring rules on what goes where if the same thing isn'tgoing to all the branches. So rather than making stage1 a branchingstage, it could be followed by "stageMulti", which would send copies ofit's input to a number of outputs:


                 |-----stage2
                 |
stage1----stageMulti----stage3
                 |
                 |-----stage4

stageMulti could then be used to add branching to any stage it follows.

I can imagine making configuration files a little simpler with regardsto setting up branching, but the more intelligent configuration filereader to handle that hasn't been written.

Similarly for merging. To follow up the previous question, let say Ihad stageA that output some A's and stage B that output some B's (letsassume both A's and B's are simple numbers). Now I wanted to have astageC that takes all A's and all B's and generates some output withthe, (lets assume the output is A * B so that every combination of A *B is output). So this would look like this:
stageA--+
        |
        |----stageC
        |
stageB--+
Is it possble to do this, so that stageA and stageB are both writingto stageC, but that stageC can distinguish the 2 different streams ofdata?

First off, the current design expects all pipelines to start with onestage, to accept feed values out of the config file (or place commandline arguments into the first stage queue if the main pipelineapplication was been written to do that). So maybe you have a stageInitwhich takes a single number like "3"


feed "3" --> stageInit----stageA
               |
               ----------stageB

stageInit can then pass "3" on to stageA and stageB, possibly causingstageA to create 3 2-digit numbers and stageB to create 3 3-digit numbers.

For merging, stageC will accept normal input from a stage as well aswatch for events carrying additional data. stageC may well have toaccumulate input and then produce output as events are received. Stagesnormally accept one input, which is either a feed or the output of thestage immediately preceding them. Input from elsewhere or from more thanone source is currently handled as events raised by the source andreceived by a "notify" method in the receiving stage.

feed "3" --> stageInit----stageA-------------stageC --> 10*111, 10*222,10*333, 20*111, 20*222, 20*333, 30*111....

               |      3          10, 20, 30    :
               ----------stageB................:
                      3          111, 222, 333
---- normal data flow
.... event passed data

Like branching, for our uses merging is rare. Also beware of running outof memory if you are doing any accumulation of data to merge input frommore than one stage.


-Ken

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [PIPELINE] Questions about pipeline

Reply via email to