Re: Parallelism

David Olsen Thu, 26 May 2016 08:38:13 -0700

I can do that. Thanks for the suggestion and the tracking. I appreciate it!


On 26 May 2016 at 02:36, Thomas Groh <[email protected]> wrote:

> Each source (e.g. TextIO.Read.from("/foo/bar")) will currently be invoked
> by only a single thread at a time; Multiple sources (e.g.
> TextIO.Read.from("/foo/bar"); ... TextIO.Read.from("/foo/baz"); ...) will
> be read from independently and in different threads. If you have a known
> distribution of sources, splitting the read into multiple sources is a
> workaround to produce additional parallelism in the InProcessPipelineRunner
> (Flattening the produced PCollections together). Additionally, downstream
> transforms prior to a GroupByKey will also be executed by a single thread
> at a time, but independently; so if multiple transforms are applied to the
> same PCollection, they will be executed in parallel).
>
> I've added https://issues.apache.org/jira/browse/BEAM-310 to track the
> feature to split input sources.
>
> On Wed, May 25, 2016 at 8:41 AM, David Olsen <[email protected]>
> wrote:
>
>> At the moment I would need to read split data locally, to perform
>> external calls, and then to aggregate results based on a particular key
>> (e.g. per user) if needed.
>>
>> This seems to me the current InProcessPipelineRunner fulfill my
>> requirement, but not sure if 'Read transforms with one thread' means
>> reading underlying data (in my case i.e. to read split data locally) will
>> use one single thread to go through all split data? If this is the case,
>> anyway I can read splits with more than one thread or any workaround?
>> (Eventually I will go with across machines pipeline runner; now I don't
>> have enough resources so have to start from single machine.)
>>
>> Thanks for all the input. It's very useful!
>>
>> On 25 May 2016 at 02:41, Jean-Baptiste Onofré <[email protected]> wrote:
>>
>>> I second Thomas: thanks for the details explanation (I forgot the
>>> mention the "unique" JVM ;)).
>>>
>>> Regards
>>> JB
>>>
>>> On 05/24/2016 07:28 PM, Thomas Groh wrote:
>>>
>>>> More specifically, the InProcessPipelineRunner (soon to be renamed to
>>>> the DirectRunner) will run on a single machine, with a number of threads
>>>> based on the number of available processors in the JVM, fanning out work
>>>> to these threads as appropriate; It will not perform any cross-process
>>>> (including cross-machine) communication. No configuration is required to
>>>> get this threading behavior, but the number of threads is also not
>>>> currently configurable.
>>>>
>>>> Can you say more about what you require to be parallel? In the current
>>>> implementation, Read transforms (and the Source that underlies them) are
>>>> currently exercised by only one thread, as are PTransforms downstream of
>>>> them prior to a GroupByKey, based on how work is scheduled. However, all
>>>> transforms after a GroupByKey execute in parallel based on the number of
>>>> available keys.
>>>>
>>>> On Tue, May 24, 2016 at 7:43 AM, Jean-Baptiste Onofré <[email protected]
>>>> <mailto:[email protected]>> wrote:
>>>>
>>>>     Hi David,
>>>>
>>>>     if you use the InProcessPipelineRunner (the "new"
>>>>     DirectPipelineRunner), than it can creates several threads.
>>>>
>>>>     Regards
>>>>     JB
>>>>
>>>>
>>>>     On 05/24/2016 04:38 PM, David Olsen wrote:
>>>>
>>>>         A naive question about DirectPipelineRunner: Is it possible to
>>>>         execute DirectPipelineRunner with multiple threads/ instances
>>>>         (across
>>>>         machines) or the parallelism is only supported by runner such as
>>>>         SparkPipelineRunner?
>>>>
>>>>         My requirement is to run pipeline in parallel, either threading
>>>> or
>>>>         multiple machines. And I just start to investigating Apache
>>>> Beam.
>>>>
>>>>         When reading google dataflow doc, the options setting mention
>>>> that
>>>>         numWorkers can be configured for the instances to use (I
>>>>         understand it's
>>>>         still different from Apache Beam). However, searching Apache
>>>>         Beam source
>>>>         on github with the keyword 'numWorkers' doesn't come up related
>>>>         source
>>>>         snippet. So I am wondering if the only way to execute pipeline
>>>>         process
>>>>         in parallel is to use SparkPipelineRunner/ FlinkPipelineRunner
>>>>         (meaning
>>>>         I have to use Apache Beam + Spark/ Flink) or make use of Google
>>>>         Cloud
>>>>         Platform?
>>>>
>>>>         Thanks
>>>>
>>>>         [1].
>>>>
>>>> https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
>>>>
>>>>
>>>>     --
>>>>     Jean-Baptiste Onofré
>>>>     [email protected] <mailto:[email protected]>
>>>>     http://blog.nanthrax.net
>>>>     Talend - http://www.talend.com
>>>>
>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> [email protected]
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>

Re: Parallelism

Reply via email to