I can do that. Thanks for the suggestion and the tracking. I appreciate it!
On 26 May 2016 at 02:36, Thomas Groh <[email protected]> wrote: > Each source (e.g. TextIO.Read.from("/foo/bar")) will currently be invoked > by only a single thread at a time; Multiple sources (e.g. > TextIO.Read.from("/foo/bar"); ... TextIO.Read.from("/foo/baz"); ...) will > be read from independently and in different threads. If you have a known > distribution of sources, splitting the read into multiple sources is a > workaround to produce additional parallelism in the InProcessPipelineRunner > (Flattening the produced PCollections together). Additionally, downstream > transforms prior to a GroupByKey will also be executed by a single thread > at a time, but independently; so if multiple transforms are applied to the > same PCollection, they will be executed in parallel). > > I've added https://issues.apache.org/jira/browse/BEAM-310 to track the > feature to split input sources. > > On Wed, May 25, 2016 at 8:41 AM, David Olsen <[email protected]> > wrote: > >> At the moment I would need to read split data locally, to perform >> external calls, and then to aggregate results based on a particular key >> (e.g. per user) if needed. >> >> This seems to me the current InProcessPipelineRunner fulfill my >> requirement, but not sure if 'Read transforms with one thread' means >> reading underlying data (in my case i.e. to read split data locally) will >> use one single thread to go through all split data? If this is the case, >> anyway I can read splits with more than one thread or any workaround? >> (Eventually I will go with across machines pipeline runner; now I don't >> have enough resources so have to start from single machine.) >> >> Thanks for all the input. It's very useful! >> >> On 25 May 2016 at 02:41, Jean-Baptiste Onofré <[email protected]> wrote: >> >>> I second Thomas: thanks for the details explanation (I forgot the >>> mention the "unique" JVM ;)). >>> >>> Regards >>> JB >>> >>> On 05/24/2016 07:28 PM, Thomas Groh wrote: >>> >>>> More specifically, the InProcessPipelineRunner (soon to be renamed to >>>> the DirectRunner) will run on a single machine, with a number of threads >>>> based on the number of available processors in the JVM, fanning out work >>>> to these threads as appropriate; It will not perform any cross-process >>>> (including cross-machine) communication. No configuration is required to >>>> get this threading behavior, but the number of threads is also not >>>> currently configurable. >>>> >>>> Can you say more about what you require to be parallel? In the current >>>> implementation, Read transforms (and the Source that underlies them) are >>>> currently exercised by only one thread, as are PTransforms downstream of >>>> them prior to a GroupByKey, based on how work is scheduled. However, all >>>> transforms after a GroupByKey execute in parallel based on the number of >>>> available keys. >>>> >>>> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> Hi David, >>>> >>>> if you use the InProcessPipelineRunner (the "new" >>>> DirectPipelineRunner), than it can creates several threads. >>>> >>>> Regards >>>> JB >>>> >>>> >>>> On 05/24/2016 04:38 PM, David Olsen wrote: >>>> >>>> A naive question about DirectPipelineRunner: Is it possible to >>>> execute DirectPipelineRunner with multiple threads/ instances >>>> (across >>>> machines) or the parallelism is only supported by runner such as >>>> SparkPipelineRunner? >>>> >>>> My requirement is to run pipeline in parallel, either threading >>>> or >>>> multiple machines. And I just start to investigating Apache >>>> Beam. >>>> >>>> When reading google dataflow doc, the options setting mention >>>> that >>>> numWorkers can be configured for the instances to use (I >>>> understand it's >>>> still different from Apache Beam). However, searching Apache >>>> Beam source >>>> on github with the keyword 'numWorkers' doesn't come up related >>>> source >>>> snippet. So I am wondering if the only way to execute pipeline >>>> process >>>> in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner >>>> (meaning >>>> I have to use Apache Beam + Spark/ Flink) or make use of Google >>>> Cloud >>>> Platform? >>>> >>>> Thanks >>>> >>>> [1]. >>>> >>>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options >>>> >>>> >>>> -- >>>> Jean-Baptiste Onofré >>>> [email protected] <mailto:[email protected]> >>>> http://blog.nanthrax.net >>>> Talend - http://www.talend.com >>>> >>>> >>>> >>> -- >>> Jean-Baptiste Onofré >>> [email protected] >>> http://blog.nanthrax.net >>> Talend - http://www.talend.com >>> >> >> >
