Heya,
Dunno if these ideas are still in the air or felt in the warp ^^.
However there is a paper on avocado
http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project8_report.pdf
that
mentions a way of working with their data (sequence's reads) in a windowed
manner without
Interesting, clickstream data would have its own window concept based on
session of User , I can imagine windows would change across streams but
wouldnt they large be domain specific in Nature?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Actually for click stream, the users space wouldn't be a continuum, unless
the order of users is important or the fact that they are coming in a kind
of order can be used by the algo.
The purpose of the break or binning function is to package things in a
cluster for which we know the properties,
Heya TD,
Thanks for the detailed answer! Much appreciated.
Regarding order among elements within an RDD, you're definitively right,
it'd kill the //ism and would require synchronization which is completely
avoided in distributed env.
That's why, I won't push this constraint to the RDDs
I think it makes sense, though without a concrete implementation its hard
to be sure. Applying sorting on the RDD according to the RDDs makes sense,
but I can think of two kinds of fundamental problems.
1. How do you deal with ordering across RDD boundaries. Say two consecutive
RDDs in the
Indeed, these two cases are tightly coupled (the first one is a special
case of the second).
Actually, these outliers could be handled by a dedicated function what I
named outliersManager -- I was not so much inspired ^^, but we could name
these outliers, outlaws and thus the function would be
Dear Sparkers,
*[sorry for the lengthy email... = head to the gist
https://gist.github.com/andypetrella/12228eb24eea6b3e1389 for a preview
:-p**]*
I would like to share some thinking I had due to a use case I faced.
Basically, as the subject announced it, it's a generalization of the
DStream
Very interesting ideas Andy!
Conceptually i think it makes sense. In fact, it is true that dealing with
time series data, windowing over application time, windowing over number of
events, are things that DStream does not natively support. The real
challenge is actually mapping the conceptual