I get TD's recommendation of sharing a connection among tasks. Now, is there a good way to determine when to close connections?
Gino B. > On Jul 17, 2014, at 7:05 PM, Yan Fang <yanfang...@gmail.com> wrote: > > Hi Sean, > > Thank you. I see your point. What I was thinking is that, do computation in a > distributed fashion and do the storing from a single place. But you are > right, having multiple DB connections actually is fine. > > Thanks for answering my questions. That helps me understand the system. > > Cheers, > > Fang, Yan > yanfang...@gmail.com > +1 (206) 849-4108 > > >> On Thu, Jul 17, 2014 at 2:53 PM, Sean Owen <so...@cloudera.com> wrote: >> On Thu, Jul 17, 2014 at 10:39 PM, Yan Fang <yanfang...@gmail.com> wrote: >> > Thank you for the help. If I use TD's approache, it works and there is no >> > exception. Only drawback is that it will create many connections to the DB, >> > which I was trying to avoid. >> >> Connection-like objects aren't data that can be serialized. What would >> it mean to share one connection with N workers? that they all connect >> back to the driver, and through one DB connection there? this defeats >> the purpose of distributed computing. You want multiple DB >> connections. You can limit the number of partitions if needed. >> >> >> > Here is a snapshot of my code. Mark as red for the important code. What I >> > was thinking is that, if I call the collect() method, Spark Streaming will >> > bring the data to the driver and then the db object does not need to be >> > sent >> >> The Function you pass to foreachRDD() has a reference to db though. >> That's what is making it be serialized. >> >> > to executors. My observation is that, thought exceptions are thrown, the >> > insert function still works. Any thought about that? Also paste the log in >> > case it helps .http://pastebin.com/T1bYvLWB >> >> Any executors that run locally might skip the serialization and >> succeed (?) but I don't think the remote executors can be succeeding. >