Re: Multi-threading implementation equivalence in Beam

amir bahmanyari Tue, 21 Jun 2016 12:41:48 -0700

Thanks Lukas.I am executing my fat jar Beam app in a Flink Cluster (2-nodes for 
now).I assume the Job Manager<--->Task manager(s) provide visibility to the 
in-memory db contents to all (ParDo) processes running on both nodes executing 
separate DoFn at the same time.Therefore, the "shared data" are 
synchronized/locked while one node process is making changes to it.I use one 
instance of Redis for one set of data (accessed by both nodes DoFn processes) & 
a concurrentHashMap for another set of dataI assume FlinkCluster maintains the 
thread safety of Redis & concurrentHashMap objects.Is this the right 
assumption? .Thanks again.Amir-

      From: Lukasz Cwik <[email protected]>
 To: [email protected]; amir bahmanyari <[email protected]> 
 Sent: Monday, June 20, 2016 4:10 PM
 Subject: Re: Multi-threading implementation equivalence in Beam

You are correct, an in memory database is outside the scope/knowledge of the 
runner and it will not be able to move any side effects over. For example, lets 
say your processing some elements X, Y, Z in one bundle on one machine and 
processing Y fails. The bundle may be retried on another machine where your 
changes to X may not exist. Or the bundle may be split such that X and Z is 
processed on one machine and Y on yet another machine.
If the reason for using an in-memory database is just caching and you can 
reload/recreate the cached entries than this should be fine, you'll just suffer 
cache misses elsewhere. 
If the reason is for caching previously seen elements which you will output 
later or write like side effects than this can disappear if the bundle 
processing is moved to another machine.
Its not that in memory databases can't be used, they just can't be relied on 
persistent state.

On Mon, Jun 20, 2016 at 3:36 PM, amir bahmanyari <[email protected]> wrote:

Wonderful. Thanks Lukaz.Have one question. The statement in that page "In 
addition, your DoFn should not rely on any persistent state from invocation to 
invocation.".I am using in-memory db such as Redis or Aerospike for 
intermediate look ups etc.Is this what the above statement referring to: dont 
use in-memory dbs?Thanks again.

      From: Lukasz Cwik <[email protected]>
 To: [email protected]; amir bahmanyari <[email protected]> 
 Sent: Monday, June 20, 2016 1:08 PM
 Subject: Re: Multi-threading implementation equivalence in Beam

Threading/parallelism is up to the runner and does not map 1-1 to the java 
memory model since most runners will execute in a distributed manner. In 
general, runners will attempt to break up the work as evenly as possible and 
schedule work across multiple machines / cpu cores at all times to maximize the 
throughput / minimize time for execution of the job This is abstracted away 
much by getting users to write DoFns that apply with ParDo. Please take a look 
at this explanation about ParDo 
(https://cloud.google.com/dataflow/model/par-do) to get a better understanding 
of its usage and as a place to look at some examples.
On Mon, Jun 20, 2016 at 12:44 PM, amir bahmanyari <[email protected]> wrote:

Thanks JB.I am executing FlinkPipelineRunner...& later will experirnt the same 
with SparkRunner....any examples pls?Cheers

      From: Jean-Baptiste Onofré <[email protected]>
 To: [email protected] 
 Sent: Monday, June 20, 2016 12:35 PM
 Subject: Re: Multi-threading implementation equivalence in Beam

Hi Amir,

the DirectPipelineRunner uses multi-thread to achieve ParDo execution 
for instance.

You mean example of Beam pipeline code ? Or example of runner ?

Regards
JB

On 06/20/2016 09:25 PM, amir bahmanyari wrote:
> Hi Colleagues,
> Hope you all had a great weekend. Another novice question :-)
> Is there a pipeline parallelism/threading model provide by Beam that
> equates the multi-threading model in Java for instance?
> Any examples if so?
> Thanks again,
> Amir

-- 
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Multi-threading implementation equivalence in Beam

Reply via email to