A couple questions from someone new to Beam

Steve973 Wed, 25 Sep 2019 03:39:09 -0700

Hi, all.  I am still ramping up on my learning of how to use Beam, and I
have a couple of questions for the experts.  And, while I have read the
documentation, I have either looked at the wrong parts, or my particular
questions were not specifically answered.  If I have missed something, then
please point me in the right direction.


   1. When using the MongoDB, for reading and writing from an execution
   node, does it need to take the time, each time an executor runs, to set up
   the connection to Mongo?  Or does Beam cache the connections and reuse them
   to mitigate the performance hit of setting up the connection each time?  If
   so, I am curious how it handles that for multiple nodes, unless Beam is
   "smart" enough to pre-cache connections in a pool on execution nodes in
   advance.
   2. When something is executed in parallel (ParDo), do the parallel jobs
   run in one thread on an execution node?  Or, will Beam utilize more
   resources/threads, as available, on a node?  I would like to utilize as
   many threads as possible on available cluster nodes.  My thought is that,
   if a job is stateless, it seems reasonable to be able to utilize multiple
   threads on a node to further parallelize and maximize performance.
   Although, it also occurs to me that this would probably be
   implementation-dependent on the runner.  The other approach that I can see
   is to simply use CompletableFutures in my jobs, which is what I am already
   doing in my code that does not (yet) use Beam. But it would be preferable
   to allow Beam to manage all of the parallelization.

I am sure that I will have some more questions as time goes on, but this
would be great info to have for now.

Thanks,
Steve

A couple questions from someone new to Beam

Reply via email to