(1) I don't have direct experience with contacting MongoDB in Beam, but my
general expectation is that yes, Beam IOs will do reasonable things for
connection setup and teardown. For the case of MongoDbIO in the Java SDK,
it looks like this connection setup happens in BoundedMongoDbReader [0]. In
general, the IOs in Beam tend to implement classes like BoundedSource that
handle hooking into Beam's execution model and provide hooks for things
like connection setup.

(2) As to threads, there is some relevant documentation in the Beam
Programming Guide about requirements for user transforms [1]. In
particular, that states that instances of the function you provide in your
ParDo will only be accessed by one thread at a time, and "note that static
members in your function object are not passed to worker instances and that
multiple instances of your function may be accessed from different
threads". So details of how many instances of a function run concurrently
is left up to runners to decide, but it certainly allowed by the Beam model.

[0]
https://github.com/apache/beam/blob/master/sdks/java/io/mongodb/src/main/java/org/apache/beam/sdk/io/mongodb/MongoDbIO.java#L642
[1]
https://beam.apache.org/documentation/programming-guide/#requirements-for-writing-user-code-for-beam-transforms

On Wed, Sep 25, 2019 at 6:38 AM Steve973 <[email protected]> wrote:

> Hi, all.  I am still ramping up on my learning of how to use Beam, and I
> have a couple of questions for the experts.  And, while I have read the
> documentation, I have either looked at the wrong parts, or my particular
> questions were not specifically answered.  If I have missed something, then
> please point me in the right direction.
>
>    1. When using the MongoDB, for reading and writing from an execution
>    node, does it need to take the time, each time an executor runs, to set up
>    the connection to Mongo?  Or does Beam cache the connections and reuse them
>    to mitigate the performance hit of setting up the connection each time?  If
>    so, I am curious how it handles that for multiple nodes, unless Beam is
>    "smart" enough to pre-cache connections in a pool on execution nodes in
>    advance.
>    2. When something is executed in parallel (ParDo), do the parallel
>    jobs run in one thread on an execution node?  Or, will Beam utilize more
>    resources/threads, as available, on a node?  I would like to utilize as
>    many threads as possible on available cluster nodes.  My thought is that,
>    if a job is stateless, it seems reasonable to be able to utilize multiple
>    threads on a node to further parallelize and maximize performance.
>    Although, it also occurs to me that this would probably be
>    implementation-dependent on the runner.  The other approach that I can see
>    is to simply use CompletableFutures in my jobs, which is what I am already
>    doing in my code that does not (yet) use Beam. But it would be preferable
>    to allow Beam to manage all of the parallelization.
>
> I am sure that I will have some more questions as time goes on, but this
> would be great info to have for now.
>
> Thanks,
> Steve
>

Reply via email to