Re: Connection pooling in spark jobs

Charles Feduke Thu, 02 Apr 2015 23:15:50 -0700

Out of curiosity I wanted to see what JBoss supported in terms of
clustering and database connection pooling since its implementation should
suffice for your use case. I found:


*Note:* JBoss does not recommend using this feature on a production
environment. It requires accessing a connection pool remotely and this is
an anti-pattern as connections are not serializable. Besides, transaction
propagation is not supported and it could lead to connection leaks if the
remote clients are unreliable (i.e crashes, network failure). If you do
need to access a datasource remotely, JBoss recommends accessing it via a
remote session bean facade.[1]

You probably aren't worried about transactions; I gather from your use case
you are just pulling this data in a read only fashion. That being said
JBoss appears to have something.

The other thing to look for is whether or not a solution exists in Hadoop;
I can't find anything for JDBC connection pools over a cluster (just pools
local to a mapper which is similar to what Cody suggested earlier for Spark
and partitions).

If you were talking about a high volume web application then I'd believe
the extra effort for connection pooling [over the cluster] would be worth
it. Unless you're planning on executing several hundred parallel jobs, does
the small amount of overhead outweigh the time necessary to develop a
solution? (I'm guessing a solution doesn't exist because the pattern where
it would be an issue just isn't a common use case for Spark. I went down
this path - connection pooling - myself originally and found a single
connection per executor was fine for my needs. Local connection pools for
the partition as Cody said previously would also work for my use case.)

A local connection pool that was shared amongst all executors on a node
isn't a solution since different jobs execute under different JVMs even
when on the same worker node.[2]

1. https://developer.jboss.org/wiki/ConfigDataSources
2. http://spark.apache.org/docs/latest/cluster-overview.html



On Fri, Apr 3, 2015 at 1:39 AM Sateesh Kavuri <[email protected]>
wrote:

> Each executor runs for about 5 secs until which time the db connection can
> potentially be open. Each executor will have 1 connection open.
> Connection pooling surely has its advantages of performance and not
> hitting the dbserver for every open/close. The database in question is not
> just used by the spark jobs, but is shared by other systems and so the
> spark jobs have to better at managing the resources.
>
> I am not really looking for a db connections counter (will let the db
> handle that part), but rather have a pool of connections on spark end so
> that the connections can be reused across jobs
>
>
> On Fri, Apr 3, 2015 at 10:21 AM, Charles Feduke <[email protected]>
> wrote:
>
>> How long does each executor keep the connection open for? How many
>> connections does each executor open?
>>
>> Are you certain that connection pooling is a performant and suitable
>> solution? Are you running out of resources on the database server and
>> cannot tolerate each executor having a single connection?
>>
>> If you need a solution that limits the number of open connections
>> [resource starvation on the DB server] I think you'd have to fake it with a
>> centralized counter of active connections, and logic within each executor
>> that blocks when the counter is at a given threshold. If the counter is not
>> at threshold, then an active connection can be created (after incrementing
>> the shared counter). You could use something like ZooKeeper to store the
>> counter value. This would have the overall effect of decreasing performance
>> if your required number of connections outstrips the database's resources.
>>
>> On Fri, Apr 3, 2015 at 12:22 AM Sateesh Kavuri <[email protected]>
>> wrote:
>>
>>> But this basically means that the pool is confined to the job (of a
>>> single app) in question, but is not sharable across multiple apps?
>>> The setup we have is a job server (the spark-jobserver) that creates
>>> jobs. Currently, we have each job opening and closing a connection to the
>>> database. What we would like to achieve is for each of the jobs to obtain a
>>> connection from a db pool
>>>
>>> Any directions on how this can be achieved?
>>>
>>> --
>>> Sateesh
>>>
>>> On Thu, Apr 2, 2015 at 7:00 PM, Cody Koeninger <[email protected]>
>>> wrote:
>>>
>>>> Connection pools aren't serializable, so you generally need to set them
>>>> up inside of a closure.  Doing that for every item is wasteful, so you
>>>> typically want to use mapPartitions or foreachPartition
>>>>
>>>> rdd.mapPartition { part =>
>>>> setupPool
>>>> part.map { ...
>>>>
>>>>
>>>>
>>>> See "Design Patterns for using foreachRDD" in
>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
>>>>
>>>> On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri <
>>>> [email protected]> wrote:
>>>>
>>>>> Right, I am aware on how to use connection pooling with oracle, but
>>>>> the specific question is how to use it in the context of spark job 
>>>>> execution
>>>>> On 2 Apr 2015 17:41, "Ted Yu" <[email protected]> wrote:
>>>>>
>>>>>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
>>>>>>
>>>>>> The question doesn't seem to be Spark specific, btw
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri <
>>>>>> [email protected]> wrote:
>>>>>> >
>>>>>> > Hi,
>>>>>> >
>>>>>> > We have a case that we will have to run concurrent jobs (for the
>>>>>> same algorithm) on different data sets. And these jobs can run in 
>>>>>> parallel
>>>>>> and each one of them would be fetching the data from the database.
>>>>>> > We would like to optimize the database connections by making use of
>>>>>> connection pooling. Any suggestions / best known ways on how to achieve
>>>>>> this. The database in question is Oracle
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Sateesh
>>>>>>
>>>>>
>>>>
>>>
>

Re: Connection pooling in spark jobs

Reply via email to