Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella
Heya, Dunno if these ideas are still in the air or felt in the warp ^^. However there is a paper on avocado http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf that mentions a way of working with their data (sequence's reads) in a windowed manner without

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread Mayur Rustagi
Interesting, clickstream data would have its own window concept based on session of User , I can imagine windows would change across streams but wouldnt they large be domain specific in Nature? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella
Actually for click stream, the users space wouldn't be a continuum, unless the order of users is important or the fact that they are coming in a kind of order can be used by the algo. The purpose of the break or binning function is to package things in a cluster for which we know the properties,

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Heya TD, Thanks for the detailed answer! Much appreciated. Regarding order among elements within an RDD, you're definitively right, it'd kill the //ism and would require synchronization which is completely avoided in distributed env. That's why, I won't push this constraint to the RDDs

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread Tathagata Das
I think it makes sense, though without a concrete implementation its hard to be sure. Applying sorting on the RDD according to the RDDs makes sense, but I can think of two kinds of fundamental problems. 1. How do you deal with ordering across RDD boundaries. Say two consecutive RDDs in the

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Indeed, these two cases are tightly coupled (the first one is a special case of the second). Actually, these outliers could be handled by a dedicated function what I named outliersManager -- I was not so much inspired ^^, but we could name these outliers, outlaws and thus the function would be

[brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-15 Thread andy petrella
Dear Sparkers, *[sorry for the lengthy email... = head to the gist https://gist.github.com/andypetrella/12228eb24eea6b3e1389 for a preview :-p**]* I would like to share some thinking I had due to a use case I faced. Basically, as the subject announced it, it's a generalization of the DStream

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-15 Thread Tathagata Das
Very interesting ideas Andy! Conceptually i think it makes sense. In fact, it is true that dealing with time series data, windowing over application time, windowing over number of events, are things that DStream does not natively support. The real challenge is actually mapping the conceptual