Hi Steve, 1) In general, managing the client connections is a responsibility of IO transform. Usually, one client instance is used per input split (bounded or unbounded source) and it opens a connection in the beginning of reading this split and closes in the end. The is in theory. Practically, every IO can implement it a bit differently and provide some additional options for connection configuration for users, like setting maximum connection time, accepting custom data source provider and connection pool and so on. Every input split will be processed in parallel but this is already a responsibility of backend data processing engine how to schedule and manage it on cluster.
2) Again, this is responsibility of backend data processing engine how to run it in parallel. Beam is a SDK that allows user to create a pipeline once and run it on different engines. To make it possible, Bean incorporates different runners, that are responsible to translate unified Beam pipeline DAG into specific backend engine one. Also, as Jeff mentioned before, Beam SDK is not thread-safe, so your should assume this while writing your pipeline code. In the same time, your backend engine can decide to fusion some of your pipeline stages, so, you can add break a fusion by adding GBK transform, but number of parallel threads/workers again will be depended on your backend engine. > On 25 Sep 2019, at 12:38, Steve973 <[email protected]> wrote: > > Hi, all. I am still ramping up on my learning of how to use Beam, and I have > a couple of questions for the experts. And, while I have read the > documentation, I have either looked at the wrong parts, or my particular > questions were not specifically answered. If I have missed something, then > please point me in the right direction. > When using the MongoDB, for reading and writing from an execution node, does > it need to take the time, each time an executor runs, to set up the > connection to Mongo? Or does Beam cache the connections and reuse them to > mitigate the performance hit of setting up the connection each time? If so, > I am curious how it handles that for multiple nodes, unless Beam is "smart" > enough to pre-cache connections in a pool on execution nodes in advance. > When something is executed in parallel (ParDo), do the parallel jobs run in > one thread on an execution node? Or, will Beam utilize more > resources/threads, as available, on a node? I would like to utilize as many > threads as possible on available cluster nodes. My thought is that, if a job > is stateless, it seems reasonable to be able to utilize multiple threads on a > node to further parallelize and maximize performance. Although, it also > occurs to me that this would probably be implementation-dependent on the > runner. The other approach that I can see is to simply use > CompletableFutures in my jobs, which is what I am already doing in my code > that does not (yet) use Beam. But it would be preferable to allow Beam to > manage all of the parallelization. > I am sure that I will have some more questions as time goes on, but this > would be great info to have for now. > > Thanks, > Steve
