Out of curiosity I wanted to see what JBoss supported in terms of clustering and database connection pooling since its implementation should suffice for your use case. I found:
*Note:* JBoss does not recommend using this feature on a production environment. It requires accessing a connection pool remotely and this is an anti-pattern as connections are not serializable. Besides, transaction propagation is not supported and it could lead to connection leaks if the remote clients are unreliable (i.e crashes, network failure). If you do need to access a datasource remotely, JBoss recommends accessing it via a remote session bean facade.[1] You probably aren't worried about transactions; I gather from your use case you are just pulling this data in a read only fashion. That being said JBoss appears to have something. The other thing to look for is whether or not a solution exists in Hadoop; I can't find anything for JDBC connection pools over a cluster (just pools local to a mapper which is similar to what Cody suggested earlier for Spark and partitions). If you were talking about a high volume web application then I'd believe the extra effort for connection pooling [over the cluster] would be worth it. Unless you're planning on executing several hundred parallel jobs, does the small amount of overhead outweigh the time necessary to develop a solution? (I'm guessing a solution doesn't exist because the pattern where it would be an issue just isn't a common use case for Spark. I went down this path - connection pooling - myself originally and found a single connection per executor was fine for my needs. Local connection pools for the partition as Cody said previously would also work for my use case.) A local connection pool that was shared amongst all executors on a node isn't a solution since different jobs execute under different JVMs even when on the same worker node.[2] 1. https://developer.jboss.org/wiki/ConfigDataSources 2. http://spark.apache.org/docs/latest/cluster-overview.html On Fri, Apr 3, 2015 at 1:39 AM Sateesh Kavuri <[email protected]> wrote: > Each executor runs for about 5 secs until which time the db connection can > potentially be open. Each executor will have 1 connection open. > Connection pooling surely has its advantages of performance and not > hitting the dbserver for every open/close. The database in question is not > just used by the spark jobs, but is shared by other systems and so the > spark jobs have to better at managing the resources. > > I am not really looking for a db connections counter (will let the db > handle that part), but rather have a pool of connections on spark end so > that the connections can be reused across jobs > > > On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke <[email protected]> > wrote: > >> How long does each executor keep the connection open for? How many >> connections does each executor open? >> >> Are you certain that connection pooling is a performant and suitable >> solution? Are you running out of resources on the database server and >> cannot tolerate each executor having a single connection? >> >> If you need a solution that limits the number of open connections >> [resource starvation on the DB server] I think you'd have to fake it with a >> centralized counter of active connections, and logic within each executor >> that blocks when the counter is at a given threshold. If the counter is not >> at threshold, then an active connection can be created (after incrementing >> the shared counter). You could use something like ZooKeeper to store the >> counter value. This would have the overall effect of decreasing performance >> if your required number of connections outstrips the database's resources. >> >> On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri <[email protected]> >> wrote: >> >>> But this basically means that the pool is confined to the job (of a >>> single app) in question, but is not sharable across multiple apps? >>> The setup we have is a job server (the spark-jobserver) that creates >>> jobs. Currently, we have each job opening and closing a connection to the >>> database. What we would like to achieve is for each of the jobs to obtain a >>> connection from a db pool >>> >>> Any directions on how this can be achieved? >>> >>> -- >>> Sateesh >>> >>> On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger <[email protected]> >>> wrote: >>> >>>> Connection pools aren't serializable, so you generally need to set them >>>> up inside of a closure. Doing that for every item is wasteful, so you >>>> typically want to use mapPartitions or foreachPartition >>>> >>>> rdd.mapPartition { part => >>>> setupPool >>>> part.map { ... >>>> >>>> >>>> >>>> See "Design Patterns for using foreachRDD" in >>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams >>>> >>>> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri < >>>> [email protected]> wrote: >>>> >>>>> Right, I am aware on how to use connection pooling with oracle, but >>>>> the specific question is how to use it in the context of spark job >>>>> execution >>>>> On 2 Apr 2015 17:41, "Ted Yu" <[email protected]> wrote: >>>>> >>>>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm >>>>>> >>>>>> The question doesn't seem to be Spark specific, btw >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri < >>>>>> [email protected]> wrote: >>>>>> > >>>>>> > Hi, >>>>>> > >>>>>> > We have a case that we will have to run concurrent jobs (for the >>>>>> same algorithm) on different data sets. And these jobs can run in >>>>>> parallel >>>>>> and each one of them would be fetching the data from the database. >>>>>> > We would like to optimize the database connections by making use of >>>>>> connection pooling. Any suggestions / best known ways on how to achieve >>>>>> this. The database in question is Oracle >>>>>> > >>>>>> > Thanks, >>>>>> > Sateesh >>>>>> >>>>> >>>> >>> >
