More specifically, the InProcessPipelineRunner (soon to be renamed to the
DirectRunner) will run on a single machine, with a number of threads based
on the number of available processors in the JVM, fanning out work to these
threads as appropriate; It will not perform any cross-process (including
cross-machine) communication. No configuration is required to get this
threading behavior, but the number of threads is also not currently
configurable.

Can you say more about what you require to be parallel? In the current
implementation, Read transforms (and the Source that underlies them) are
currently exercised by only one thread, as are PTransforms downstream of
them prior to a GroupByKey, based on how work is scheduled. However, all
transforms after a GroupByKey execute in parallel based on the number of
available keys.

On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi David,
>
> if you use the InProcessPipelineRunner (the "new" DirectPipelineRunner),
> than it can creates several threads.
>
> Regards
> JB
>
>
> On 05/24/2016 04:38 PM, David Olsen wrote:
>
>> A naive question about DirectPipelineRunner: Is it possible to
>> execute DirectPipelineRunner with multiple threads/ instances (across
>> machines) or the parallelism is only supported by runner such as
>> SparkPipelineRunner?
>>
>> My requirement is to run pipeline in parallel, either threading or
>> multiple machines. And I just start to investigating Apache Beam.
>>
>> When reading google dataflow doc, the options setting mention that
>> numWorkers can be configured for the instances to use (I understand it's
>> still different from Apache Beam). However, searching Apache Beam source
>> on github with the keyword 'numWorkers' doesn't come up related source
>> snippet. So I am wondering if the only way to execute pipeline process
>> in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner (meaning
>> I have to use Apache Beam + Spark/ Flink) or make use of Google Cloud
>> Platform?
>>
>> Thanks
>>
>> [1].
>>
>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Reply via email to