what’s your data format? ORC or CSV or others?

val keys = sqlContext.read.orc(“your previous batch data 
path”).select($”uniq_key”).collect
val broadCast = sc.broadCast(keys)

val rdd = your_current_batch_data
rdd.filter( line => line.key  not in broadCase.value)






> On Dec 8, 2015, at 4:44 PM, Ramkumar V <ramkumar.c...@gmail.com> wrote:
> 
> Im running spark batch job in cluster mode every hour and it runs for 15 
> minutes. I have certain unique keys in the dataset. i dont want to process 
> those keys during my next hour batch.
> 
> Thanks,
> 
>  <https://in.linkedin.com/in/ramkumarcs31> 
> 
> 
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Can you detail your question?  what looks like your previous batch and the 
> current batch?
> 
> 
> 
> 
> 
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V <ramkumar.c...@gmail.com 
>> <mailto:ramkumar.c...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I'm running java over spark in cluster mode. I want to apply filter on 
>> javaRDD based on some previous batch values. if i store those values in 
>> mapDB, is it possible to apply filter during the current batch ?
>> 
>> Thanks,
>> 
>>  <https://in.linkedin.com/in/ramkumarcs31> 
>> 
> 
> 

Reply via email to