Thanks Cody. Are you suggesting to put the cache in global context in each executor JVM, in a Scala object for example. Then have a scheduled task to refresh the cache (or triggered by the expiry if Guava)?
Chen On Wed, Aug 26, 2015 at 10:51 AM, Cody Koeninger <c...@koeninger.org> wrote: > If your data only changes every few days, why not restart the job every > few days, and just broadcast the data? > > Or you can keep a local per-jvm cache with an expiry (e.g. guava cache) to > avoid many mysql reads > > On Wed, Aug 26, 2015 at 9:46 AM, Chen Song <chen.song...@gmail.com> wrote: > >> Piggyback on this question. >> >> I have a similar use case but a bit different. My job is consuming a >> stream from Kafka and I need to join the Kafka stream with some reference >> table from MySQL (kind of data validation and enrichment). I need to >> process this stream every 1 min. The data in MySQL is not changed very >> often, maybe once a few days. >> >> So my requirement is: >> >> * I cannot easily use broadcast variable because the data does change, >> although not very often. >> * I am not sure if it is good practice to read data from MySQL in every >> batch (in my case, 1 min). >> >> Anyone has done this before, any suggestions and feedback is appreciated. >> >> Chen >> >> >> On Sun, Jul 5, 2015 at 11:50 AM, Ashic Mahtab <as...@live.com> wrote: >> >>> If it is indeed a reactive use case, then Spark Streaming would be a >>> good choice. >>> >>> One approach worth considering - is it possible to receive a message via >>> kafka (or some other queue). That'd not need any polling, and you could use >>> standard consumers. If polling isn't an issue, then writing a custom >>> receiver will work fine. The way a receiver works is this: >>> >>> * Your receiver has a receive() function, where you'd typically start a >>> loop. In your loop, you'd fetch items, and call store(entry). >>> * You control everything in the receiver. If you're listening on a >>> queue, you receive messages, store() and ack your queue. If you're polling, >>> it's up to you to ensure delays between db calls. >>> * The things you store() go on to make up the rdds in your DStream. So, >>> intervals, windowing, etc. apply to those. The receiver is the boundary >>> between your data source and the DStream RDDs. In other words, if your >>> interval is 15 seconds with no windowing, then the things that went to >>> store() every 15 seconds are bunched up into an RDD of your DStream. That's >>> kind of a simplification, but should give you the idea that your "db >>> polling" interval and streaming interval are not tied together. >>> >>> -Ashic. >>> >>> ------------------------------ >>> Date: Mon, 6 Jul 2015 01:12:34 +1000 >>> Subject: Re: JDBC Streams >>> From: guha.a...@gmail.com >>> To: as...@live.com >>> CC: ak...@sigmoidanalytics.com; user@spark.apache.org >>> >>> >>> Hi >>> >>> Thanks for the reply. here is my situation: I hve a DB which enbles >>> synchronus CDC, think this as a DBtrigger which writes to a taable with >>> "changed" values as soon as something changes in production table. My job >>> will need to pick up the data "as soon as it arrives" which can be every 1 >>> min interval. Ideally it will pick up the changes, transform it into a >>> jsonand puts it to kinesis. In short, I am emulating a Kinesis producer >>> with a DB source (dont even ask why, lets say these are the constraints :) ) >>> >>> Please advice (a) is spark a good choice here (b) whats your suggestion >>> either way. >>> >>> I understand I can easily do it using a simple java/python app but I am >>> little worried about managing scaling/fault tolerance and thats where my >>> concern is. >>> >>> TIA >>> Ayan >>> >>> On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab <as...@live.com> wrote: >>> >>> Hi Ayan, >>> How "continuous" is your workload? As Akhil points out, with streaming, >>> you'll give up at least one core for receiving, will need at most one more >>> core for processing. Unless you're running on something like Mesos, this >>> means that those cores are dedicated to your app, and can't be leveraged by >>> other apps / jobs. >>> >>> If it's something periodic (once an hour, once every 15 minutes, etc.), >>> then I'd simply write a "normal" spark application, and trigger it >>> periodically. There are many things that can take care of that - sometimes >>> a simple cronjob is enough! >>> >>> ------------------------------ >>> Date: Sun, 5 Jul 2015 22:48:37 +1000 >>> Subject: Re: JDBC Streams >>> From: guha.a...@gmail.com >>> To: ak...@sigmoidanalytics.com >>> CC: user@spark.apache.org >>> >>> >>> Thanks Akhil. In case I go with spark streaming, I guess I have to >>> implment a custom receiver and spark streaming will call this receiver >>> every batch interval, is that correct? Any gotcha you see in this plan? >>> TIA...Best, Ayan >>> >>> On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>> If you want a long running application, then go with spark streaming >>> (which kind of blocks your resources). On the other hand, if you use job >>> server then you can actually use the resources (CPUs) for other jobs also >>> when your dbjob is not using them. >>> >>> Thanks >>> Best Regards >>> >>> On Sun, Jul 5, 2015 at 5:28 AM, ayan guha <guha.a...@gmail.com> wrote: >>> >>> Hi All >>> >>> I have a requireent to connect to a DB every few minutes and bring data >>> to HBase. Can anyone suggest if spark streaming would be appropriate for >>> this senario or I shoud look into jobserver? >>> >>> Thanks in advance >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >> >> -- >> Chen Song >> >> > -- Chen Song