More specifically, the InProcessPipelineRunner (soon to be renamed to the DirectRunner) will run on a single machine, with a number of threads based on the number of available processors in the JVM, fanning out work to these threads as appropriate; It will not perform any cross-process (including cross-machine) communication. No configuration is required to get this threading behavior, but the number of threads is also not currently configurable.
Can you say more about what you require to be parallel? In the current implementation, Read transforms (and the Source that underlies them) are currently exercised by only one thread, as are PTransforms downstream of them prior to a GroupByKey, based on how work is scheduled. However, all transforms after a GroupByKey execute in parallel based on the number of available keys. On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi David, > > if you use the InProcessPipelineRunner (the "new" DirectPipelineRunner), > than it can creates several threads. > > Regards > JB > > > On 05/24/2016 04:38 PM, David Olsen wrote: > >> A naive question about DirectPipelineRunner: Is it possible to >> execute DirectPipelineRunner with multiple threads/ instances (across >> machines) or the parallelism is only supported by runner such as >> SparkPipelineRunner? >> >> My requirement is to run pipeline in parallel, either threading or >> multiple machines. And I just start to investigating Apache Beam. >> >> When reading google dataflow doc, the options setting mention that >> numWorkers can be configured for the instances to use (I understand it's >> still different from Apache Beam). However, searching Apache Beam source >> on github with the keyword 'numWorkers' doesn't come up related source >> snippet. So I am wondering if the only way to execute pipeline process >> in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner (meaning >> I have to use Apache Beam + Spark/ Flink) or make use of Google Cloud >> Platform? >> >> Thanks >> >> [1]. >> >> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options >> > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >