Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy "DatabaseConnection"

2015-06-12 Thread algermissen1971
On 12 Jun 2015, at 23:19, Cody Koeninger wrote: > A. No, it's called once per partition. Usually you have more partitions than > executors, so it will end up getting called multiple times per executor. But > you can use a lazy val, singleton, etc to make sure the setup only takes > place o

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy "DatabaseConnection"

2015-06-12 Thread Cody Koeninger
A. No, it's called once per partition. Usually you have more partitions than executors, so it will end up getting called multiple times per executor. But you can use a lazy val, singleton, etc to make sure the setup only takes place once per JVM. B. I cant speak to the specifics there ... but a

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy "DatabaseConnection"

2015-06-12 Thread algermissen1971
On 12 Jun 2015, at 22:59, Cody Koeninger wrote: > Close. the mapPartitions call doesn't need to do anything at all to the iter. > > mapPartitions { iter => > SomeDb.conn.init > iter > } Yes, thanks! Maybe you can confirm two more things and then you helped me make a giant leap today: a

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy "DatabaseConnection"

2015-06-12 Thread Cody Koeninger
Close. the mapPartitions call doesn't need to do anything at all to the iter. mapPartitions { iter => SomeDb.conn.init iter } On Fri, Jun 12, 2015 at 3:55 PM, algermissen1971 wrote: > Cody, > > On 12 Jun 2015, at 17:26, Cody Koeninger wrote: > > > There are several database apis that use

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy "DatabaseConnection"

2015-06-12 Thread algermissen1971
Cody, On 12 Jun 2015, at 17:26, Cody Koeninger wrote: > There are several database apis that use a thread local or singleton > reference to a connection pool (we use ScalikeJDBC currently, but there are > others). > > You can use mapPartitions earlier in the chain to make sure the connection

Re: Spark Streaming, updateStateByKey and mapPartitions() - and lazy "DatabaseConnection"

2015-06-12 Thread Cody Koeninger
There are several database apis that use a thread local or singleton reference to a connection pool (we use ScalikeJDBC currently, but there are others). You can use mapPartitions earlier in the chain to make sure the connection pool is set up on that executor, then use it inside updateStateByKey

Spark Streaming, updateStateByKey and mapPartitions() - and lazy "DatabaseConnection"

2015-06-12 Thread algermissen1971
Hi, I have a scenario with spark streaming, where I need to write to a database from within updateStateByKey[1]. That means that inside my update function I need a connection. I have so far understood that I should create a new (lazy) connection for every partition. But since I am not working