Re: A couple questions from someone new to Beam

Alexey Romanenko Wed, 25 Sep 2019 09:19:41 -0700

Hi Steve,

1) In general, managing the client connections is a responsibility of IO 
transform. Usually, one client instance is used per input split (bounded or 
unbounded source) and it opens a connection in the beginning of reading this 
split and closes in the end. The is in theory. Practically, every IO can 
implement it a bit differently and provide some additional options for 
connection configuration for users, like setting maximum connection time, 
accepting custom data source provider and connection pool and so on. Every 
input split will be processed in parallel but this is already a responsibility 
of backend data processing engine how to schedule and manage it on cluster.


2) Again, this is responsibility of backend data processing engine how to run 
it in parallel. Beam is a SDK that allows user to create a pipeline once and 
run it on different engines. To make it possible, Bean incorporates different 
runners, that are responsible to translate unified Beam pipeline DAG into 
specific backend engine one. Also, as Jeff mentioned before, Beam SDK is not 
thread-safe, so your should assume this while writing your pipeline code. In 
the same time, your backend engine can decide to fusion some of your pipeline 
stages, so, you can add break a fusion by adding GBK transform, but number of 
parallel threads/workers again will be depended on your backend engine. 


> On 25 Sep 2019, at 12:38, Steve973 <[email protected]> wrote:
> 
> Hi, all.  I am still ramping up on my learning of how to use Beam, and I have 
> a couple of questions for the experts.  And, while I have read the 
> documentation, I have either looked at the wrong parts, or my particular 
> questions were not specifically answered.  If I have missed something, then 
> please point me in the right direction.
> When using the MongoDB, for reading and writing from an execution node, does 
> it need to take the time, each time an executor runs, to set up the 
> connection to Mongo?  Or does Beam cache the connections and reuse them to 
> mitigate the performance hit of setting up the connection each time?  If so, 
> I am curious how it handles that for multiple nodes, unless Beam is "smart" 
> enough to pre-cache connections in a pool on execution nodes in advance.
> When something is executed in parallel (ParDo), do the parallel jobs run in 
> one thread on an execution node?  Or, will Beam utilize more 
> resources/threads, as available, on a node?  I would like to utilize as many 
> threads as possible on available cluster nodes.  My thought is that, if a job 
> is stateless, it seems reasonable to be able to utilize multiple threads on a 
> node to further parallelize and maximize performance.  Although, it also 
> occurs to me that this would probably be implementation-dependent on the 
> runner.  The other approach that I can see is to simply use 
> CompletableFutures in my jobs, which is what I am already doing in my code 
> that does not (yet) use Beam. But it would be preferable to allow Beam to 
> manage all of the parallelization.
> I am sure that I will have some more questions as time goes on, but this 
> would be great info to have for now.
> 
> Thanks,
> Steve

Re: A couple questions from someone new to Beam

Reply via email to