I would like to use Spark (and Spark streaming) to do some processing on time
series. I have text files with many lines where each line contains a
timestamp and values associated with this timestamp. Each timestamp is
unique. Timestamps are ordered. I am considering them as keys. The lines in
my text files are already ordered by timestamps.

I am looking for a neat way to leverage this order in my spark programs, and
my questions are all about this. 

I am using "sc.textFile(..)", doing transformations with .map(), .join(),
etc. I am able to split my data (e.g. per day) with a custom partitioner.
However, invariably at some point I can observe that the initial ordering I
had is lost. Currently, this forces me to do calls to ".sortByKey()", but I
have the impression that this manner is far from optimal. I would prefer
preserving ordering information whenever this is possible, instead of losing
it and recomputing it later.

- Is there a description about functions that lose the order and functions
that preserve it? (As far as I understand, map() should preserve the order
for instance). I would like to understand when (and why) the order cannot be
preserved in Spark. 

- I think that many distributed algorithms (e.g. joins) could be much faster
when taking advantage of the fact that keys are ordered. Is there a way to
specify this in Spark?

- I would like to implement algorithms that traverse my time series data in
order, with a sliding window over time, just as with "reduceByWindow()" in
Spark Streaming, but taking order into account. I need to compute
non-associative functions over these rolling windows. This seems difficult
with the current versions of Spark/Spark Streaming, without a notion of
order (therefore limiting computable functions to associative ones). Am I
missing something here?

- Are there recommended ways to deal with ordered data (and keys) such as
time series data in Spark/Spark Streaming? 

Thank you for any hint.
Best regards

Pierre



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Time-series-in-Spark-Spark-Streaming-tp11775.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to