At the moment I would need to read split data locally, to perform external calls, and then to aggregate results based on a particular key (e.g. per user) if needed.
This seems to me the current InProcessPipelineRunner fulfill my requirement, but not sure if 'Read transforms with one thread' means reading underlying data (in my case i.e. to read split data locally) will use one single thread to go through all split data? If this is the case, anyway I can read splits with more than one thread or any workaround? (Eventually I will go with across machines pipeline runner; now I don't have enough resources so have to start from single machine.) Thanks for all the input. It's very useful! On 25 May 2016 at 02:41, Jean-Baptiste Onofré <[email protected]> wrote: > I second Thomas: thanks for the details explanation (I forgot the mention > the "unique" JVM ;)). > > Regards > JB > > On 05/24/2016 07:28 PM, Thomas Groh wrote: > >> More specifically, the InProcessPipelineRunner (soon to be renamed to >> the DirectRunner) will run on a single machine, with a number of threads >> based on the number of available processors in the JVM, fanning out work >> to these threads as appropriate; It will not perform any cross-process >> (including cross-machine) communication. No configuration is required to >> get this threading behavior, but the number of threads is also not >> currently configurable. >> >> Can you say more about what you require to be parallel? In the current >> implementation, Read transforms (and the Source that underlies them) are >> currently exercised by only one thread, as are PTransforms downstream of >> them prior to a GroupByKey, based on how work is scheduled. However, all >> transforms after a GroupByKey execute in parallel based on the number of >> available keys. >> >> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi David, >> >> if you use the InProcessPipelineRunner (the "new" >> DirectPipelineRunner), than it can creates several threads. >> >> Regards >> JB >> >> >> On 05/24/2016 04:38 PM, David Olsen wrote: >> >> A naive question about DirectPipelineRunner: Is it possible to >> execute DirectPipelineRunner with multiple threads/ instances >> (across >> machines) or the parallelism is only supported by runner such as >> SparkPipelineRunner? >> >> My requirement is to run pipeline in parallel, either threading or >> multiple machines. And I just start to investigating Apache Beam. >> >> When reading google dataflow doc, the options setting mention that >> numWorkers can be configured for the instances to use (I >> understand it's >> still different from Apache Beam). However, searching Apache >> Beam source >> on github with the keyword 'numWorkers' doesn't come up related >> source >> snippet. So I am wondering if the only way to execute pipeline >> process >> in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner >> (meaning >> I have to use Apache Beam + Spark/ Flink) or make use of Google >> Cloud >> Platform? >> >> Thanks >> >> [1]. >> >> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options >> >> >> -- >> Jean-Baptiste Onofré >> [email protected] <mailto:[email protected]> >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >> >> > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com >
