what’s your data format? ORC or CSV or others? val keys = sqlContext.read.orc(“your previous batch data path”).select($”uniq_key”).collect val broadCast = sc.broadCast(keys)
val rdd = your_current_batch_data rdd.filter( line => line.key not in broadCase.value) > On Dec 8, 2015, at 4:44 PM, Ramkumar V <ramkumar.c...@gmail.com> wrote: > > Im running spark batch job in cluster mode every hour and it runs for 15 > minutes. I have certain unique keys in the dataset. i dont want to process > those keys during my next hour batch. > > Thanks, > > <https://in.linkedin.com/in/ramkumarcs31> > > > On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>> wrote: > Can you detail your question? what looks like your previous batch and the > current batch? > > > > > >> On Dec 8, 2015, at 3:52 PM, Ramkumar V <ramkumar.c...@gmail.com >> <mailto:ramkumar.c...@gmail.com>> wrote: >> >> Hi, >> >> I'm running java over spark in cluster mode. I want to apply filter on >> javaRDD based on some previous batch values. if i store those values in >> mapDB, is it possible to apply filter during the current batch ? >> >> Thanks, >> >> <https://in.linkedin.com/in/ramkumarcs31> >> > >