Hi, all. I am still ramping up on my learning of how to use Beam, and I have a couple of questions for the experts. And, while I have read the documentation, I have either looked at the wrong parts, or my particular questions were not specifically answered. If I have missed something, then please point me in the right direction.
1. When using the MongoDB, for reading and writing from an execution node, does it need to take the time, each time an executor runs, to set up the connection to Mongo? Or does Beam cache the connections and reuse them to mitigate the performance hit of setting up the connection each time? If so, I am curious how it handles that for multiple nodes, unless Beam is "smart" enough to pre-cache connections in a pool on execution nodes in advance. 2. When something is executed in parallel (ParDo), do the parallel jobs run in one thread on an execution node? Or, will Beam utilize more resources/threads, as available, on a node? I would like to utilize as many threads as possible on available cluster nodes. My thought is that, if a job is stateless, it seems reasonable to be able to utilize multiple threads on a node to further parallelize and maximize performance. Although, it also occurs to me that this would probably be implementation-dependent on the runner. The other approach that I can see is to simply use CompletableFutures in my jobs, which is what I am already doing in my code that does not (yet) use Beam. But it would be preferable to allow Beam to manage all of the parallelization. I am sure that I will have some more questions as time goes on, but this would be great info to have for now. Thanks, Steve
