Re: How to use data from Database and reload every hour

2015-11-05 Thread Sabarish Sasidharan
Theoretically the executor is a long lived container. So you could use some
simple caching library or a simple Singleton to cache the data in your
executors, once they load it from mysql. But note that with lots of
executors you might choke your mysql.

Regards
Sab
On 05-Nov-2015 7:03 pm, "Kay-Uwe Moosheimer"  wrote:

> I have the following problem.
> We have MySQL and an Spark cluster.
> We need to load 5 different validation-instructions (several thousand of
> entries each) and use this information on the executors to decide if
> content from Kafka-Streaming is for process a or b.
> The streaming data from kafka are json messages and the validation-info
> from MySQL says „if field a is that and field b ist that then process a“
> and so on.
>
> The tables on MySQL are changing over time and we have to reload the data
> every hour.
> I tried to use broadcasting where I load the data and store it on HashSets
> and HashMaps (java code), but It’s not possible to redistribute the data.
>
> What would be the best way to resolve my problem?
> Se native jdbc in executor task an load the data – can the executor store
> this data on HashSets etc. for next call so that I only load the data every
> hour?
> Use other possibilities?
>
>


Re: How to use data from Database and reload every hour

2015-11-05 Thread Adrian Tanase
You should look at .transform – it’s a powerful transformation (sic) that 
allows you to dynamically load resources and it gets executed in every micro 
batch.

Re-broadcasting something should be possible from inside transform as that code 
is executed on the driver but it’s still a controversial topic, as you probably 
need to create a NEW broadcast variable instead of updating the existing one.
http://search-hadoop.com/?q=transform+update+broadcast&page=2&fc_project=Spark

An alternative is to load the filters from mysql and apply them implicitly 
inside the .transform via rdd.filter instead of broadcast them to the 
executors. See this thread:
http://search-hadoop.com/m/q3RTt2UD6KyBO5M1&subj=Re+Streaming+checkpoints+and+logic+change

Hope this helps,
-adrian

From: Kay-Uwe Moosheimer
Date: Thursday, November 5, 2015 at 3:33 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: How to use data from Database and reload every hour

I have the following problem.
We have MySQL and an Spark cluster.
We need to load 5 different validation-instructions (several thousand of 
entries each) and use this information on the executors to decide if content 
from Kafka-Streaming is for process a or b.
The streaming data from kafka are json messages and the validation-info from 
MySQL says „if field a is that and field b ist that then process a“ and so on.

The tables on MySQL are changing over time and we have to reload the data every 
hour.
I tried to use broadcasting where I load the data and store it on HashSets and 
HashMaps (java code), but It’s not possible to redistribute the data.

What would be the best way to resolve my problem?
Se native jdbc in executor task an load the data – can the executor store this 
data on HashSets etc. for next call so that I only load the data every hour?
Use other possibilities?



How to use data from Database and reload every hour

2015-11-05 Thread Kay-Uwe Moosheimer
I have the following problem.
We have MySQL and an Spark cluster.
We need to load 5 different validation-instructions (several thousand of
entries each) and use this information on the executors to decide if content
from Kafka-Streaming is for process a or b.
The streaming data from kafka are json messages and the validation-info from
MySQL says „if field a is that and field b ist that then process a“ and so
on.

The tables on MySQL are changing over time and we have to reload the data
every hour.
I tried to use broadcasting where I load the data and store it on HashSets
and HashMaps (java code), but It’s not possible to redistribute the data.

What would be the best way to resolve my problem?
Se native jdbc in executor task an load the data – can the executor store
this data on HashSets etc. for next call so that I only load the data every
hour?
Use other possibilities?