Connection pools aren't serializable, so you generally need to set them up
inside of a closure. Doing that for every item is wasteful, so you
typically want to use mapPartitions or foreachPartition
rdd.mapPartition { part =>
setupPool
part.map { ...
See "Design Patterns for using foreachRDD" in
http://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
On Thu, Apr 2, 2015 at 7:52 AM, Sateesh Kavuri <[email protected]>
wrote:
> Right, I am aware on how to use connection pooling with oracle, but the
> specific question is how to use it in the context of spark job execution
> On 2 Apr 2015 17:41, "Ted Yu" <[email protected]> wrote:
>
>> http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm
>>
>> The question doesn't seem to be Spark specific, btw
>>
>>
>>
>>
>> > On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri <[email protected]>
>> wrote:
>> >
>> > Hi,
>> >
>> > We have a case that we will have to run concurrent jobs (for the same
>> algorithm) on different data sets. And these jobs can run in parallel and
>> each one of them would be fetching the data from the database.
>> > We would like to optimize the database connections by making use of
>> connection pooling. Any suggestions / best known ways on how to achieve
>> this. The database in question is Oracle
>> >
>> > Thanks,
>> > Sateesh
>>
>