Hello, How many times will the setup per node be called? Is it possible to limit pardo intances in google dataflow?
Aleksandr. 14. märts 2018 9:22 PM kirjutas kuupäeval "Eugene Kirpichov" < [email protected]>: "Jdbcio will create for each prepared statement new connection" - this is not the case: the connection is created in @Setup and deleted in @Teardown. https://github.com/apache/beam/blob/v2.3.0/sdks/java/io/ jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L503 https://github.com/apache/beam/blob/v2.3.0/sdks/java/io/ jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L631 Something else must be going wrong. On Wed, Mar 14, 2018 at 12:11 PM Aleksandr <[email protected]> wrote: > Hello, we had similar problem. Current jdbcio will cause alot of > connection errors. > > Typically you have more than one prepared statement. Jdbcio will create > for each prepared statement new connection(and close only in teardown) So > it is possible that connection will get timeot or in case in case of auto > scaling you will get to many connections to sql. > Our solution was to create connection pool in setup and get connection and > return back to pool in processElement. > > Best Regards, > Aleksandr Gortujev. > > 14. märts 2018 8:52 PM kirjutas kuupäeval "Jean-Baptiste Onofré" < > [email protected]>: > > Agree especially using the current JdbcIO impl that creates connection in > the @Setup. Or it means that @Teardown is never called ? > > Regards > JB > Le 14 mars 2018, à 11:40, Eugene Kirpichov <[email protected]> a écrit: >> >> Hi Derek - could you explain where does the "3000 connections" number >> come from, i.e. how did you measure it? It's weird that 5-6 workers would >> use 3000 connections. >> >> On Wed, Mar 14, 2018 at 3:50 AM Derek Chan <[email protected]> wrote: >> >>> Hi, >>> >>> We are new to Beam and need some help. >>> >>> We are working on a flow to ingest events and writes the aggregated >>> counts to a database. The input rate is rather low (~2000 message per >>> sec), but the processing is relatively heavy, that we need to scale out >>> to 5~6 nodes. The output (via JDBC) is aggregated, so the volume is also >>> low. But because of the number of workers, it keeps 3000 connections to >>> the database and it keeps hitting the database connection limits. >>> >>> Is there a way that we can reduce the concurrency only at the output >>> stage? (In Spark we would have done a repartition/coalesce). >>> >>> And, if it matters, we are using Apache Beam 2.2 via Scio, on Google >>> Dataflow. >>> >>> Thank you in advance! >>> >>> >>> >>> >
