I get TD's recommendation of sharing a connection among tasks. Now, is there a 
good way to determine when to close connections? 

Gino B.

> On Jul 17, 2014, at 7:05 PM, Yan Fang <yanfang...@gmail.com> wrote:
> 
> Hi Sean,
> 
> Thank you. I see your point. What I was thinking is that, do computation in a 
> distributed fashion and do the storing from a single place. But you are 
> right, having multiple DB connections actually is fine.
> 
> Thanks for answering my questions. That helps me understand the system.
> 
> Cheers,
> 
> Fang, Yan
> yanfang...@gmail.com
> +1 (206) 849-4108
> 
> 
>> On Thu, Jul 17, 2014 at 2:53 PM, Sean Owen <so...@cloudera.com> wrote:
>> On Thu, Jul 17, 2014 at 10:39 PM, Yan Fang <yanfang...@gmail.com> wrote:
>> > Thank you for the help. If I use TD's approache, it works and there is no
>> > exception. Only drawback is that it will create many connections to the DB,
>> > which I was trying to avoid.
>> 
>> Connection-like objects aren't data that can be serialized. What would
>> it mean to share one connection with N workers? that they all connect
>> back to the driver, and through one DB connection there? this defeats
>> the purpose of distributed computing. You want multiple DB
>> connections. You can limit the number of partitions if needed.
>> 
>> 
>> > Here is a snapshot of my code. Mark as red for the important code. What I
>> > was thinking is that, if I call the collect() method, Spark Streaming will
>> > bring the data to the driver and then the db object does not need to be 
>> > sent
>> 
>> The Function you pass to foreachRDD() has a reference to db though.
>> That's what is making it be serialized.
>> 
>> > to executors. My observation is that, thought exceptions are thrown, the
>> > insert function still works. Any thought about that? Also paste the log in
>> > case it helps .http://pastebin.com/T1bYvLWB
>> 
>> Any executors that run locally might skip the serialization and
>> succeed (?) but I don't think the remote executors can be succeeding.
> 

Reply via email to