Re: spark as a lookup engine for dedup

2015-07-27 Thread Shushant Arora
its for 1 day events in range of 1 billions and processing is in streaming
application of ~10-15 sec interval so lookup should be fast.  RDD need to
be updated with new events and old events of current time-24 hours back
should be removed at each processing.

So is spark RDD not fit for this requirement?

On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman r...@totango.com wrote:

 What the throughput of processing and for how long do you need to remember
 duplicates?

 You can take all the events, put them in an RDD, group by the key, and
 then process each key only once.
 But if you have a long running application where you want to check that
 you didn't see the same value before, and check that for every value, you
 probably need a key-value store, not RDD.

 On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora shushantaror...@gmail.com
 wrote:

 Hi

 I have a requirement for processing large events but ignoring duplicate
 at the same time.

 Events are consumed from kafka and each event has a eventid. It may
 happen that an event is already processed and came again at some other
 offset.

 1.Can I use Spark RDD to persist processed events and then lookup with
 this rdd (How to do lookup inside a RDD ?I have a
 JavaPairRDDeventid,timestamp )
 while processing new events and if event is present in  persisted rdd
 ignore it , else process the even. Does rdd.lookup(key) on billion of
 events will be efficient ?

 2. update the rdd (Since RDD is immutable  how to update it)?

 Thanks




Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
RDD is immutable, it cannot be changed, you can only create a new one from
data or from transformation. It sounds inefficient to create one each 15
sec for the last 24 hours.
I think a key-value store will be much more fitted for this purpose.

On Mon, Jul 27, 2015 at 11:21 AM Shushant Arora shushantaror...@gmail.com
wrote:

 its for 1 day events in range of 1 billions and processing is in streaming
 application of ~10-15 sec interval so lookup should be fast.  RDD need to
 be updated with new events and old events of current time-24 hours back
 should be removed at each processing.

 So is spark RDD not fit for this requirement?

 On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman r...@totango.com wrote:

 What the throughput of processing and for how long do you need to
 remember duplicates?

 You can take all the events, put them in an RDD, group by the key, and
 then process each key only once.
 But if you have a long running application where you want to check that
 you didn't see the same value before, and check that for every value, you
 probably need a key-value store, not RDD.

 On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora shushantaror...@gmail.com
 wrote:

 Hi

 I have a requirement for processing large events but ignoring duplicate
 at the same time.

 Events are consumed from kafka and each event has a eventid. It may
 happen that an event is already processed and came again at some other
 offset.

 1.Can I use Spark RDD to persist processed events and then lookup with
 this rdd (How to do lookup inside a RDD ?I have a
 JavaPairRDDeventid,timestamp )
 while processing new events and if event is present in  persisted rdd
 ignore it , else process the even. Does rdd.lookup(key) on billion of
 events will be efficient ?

 2. update the rdd (Since RDD is immutable  how to update it)?

 Thanks





Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
What the throughput of processing and for how long do you need to remember
duplicates?

You can take all the events, put them in an RDD, group by the key, and then
process each key only once.
But if you have a long running application where you want to check that you
didn't see the same value before, and check that for every value, you
probably need a key-value store, not RDD.

On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora shushantaror...@gmail.com
wrote:

 Hi

 I have a requirement for processing large events but ignoring duplicate at
 the same time.

 Events are consumed from kafka and each event has a eventid. It may happen
 that an event is already processed and came again at some other offset.

 1.Can I use Spark RDD to persist processed events and then lookup with
 this rdd (How to do lookup inside a RDD ?I have a
 JavaPairRDDeventid,timestamp )
 while processing new events and if event is present in  persisted rdd
 ignore it , else process the even. Does rdd.lookup(key) on billion of
 events will be efficient ?

 2. update the rdd (Since RDD is immutable  how to update it)?

 Thanks




spark as a lookup engine for dedup

2015-07-26 Thread Shushant Arora
Hi

I have a requirement for processing large events but ignoring duplicate at
the same time.

Events are consumed from kafka and each event has a eventid. It may happen
that an event is already processed and came again at some other offset.

1.Can I use Spark RDD to persist processed events and then lookup with this
rdd (How to do lookup inside a RDD ?I have a JavaPairRDDeventid,timestamp
)
while processing new events and if event is present in  persisted rdd
ignore it , else process the even. Does rdd.lookup(key) on billion of
events will be efficient ?

2. update the rdd (Since RDD is immutable  how to update it)?

Thanks