I want to use ARIMA for a predictive model so that I can take time series
data (metrics) and perform a light anomaly detection. The time series data
is going to be bucketed to different time units (several minutes within
several hours, several hours within several days, several days within
several years.

I want to do the algorithm in Spark Streaming. I'm used to "tuple at a
time" streaming and I'm having a tad bit of trouble gaining insight into
how exactly the windows are managed inside of DStreams.

Let's say I have a simple dataset that is marked by a key/value tuple where
the key is the name of the component who's metrics I want to run the
algorithm against and the value is a metric (a value representing a sum for
the time bucket. I want to create histograms of the time series data for
each key in the windows in which they reside so I can use that histogram
vector to generate my ARIMA prediction (actually, it seems like this
doesn't just apply to ARIMA but could apply to any sliding average).

I *think* my prediction code may look something like this:

val predictionAverages = dstream
  .groupByKeyAndWindow(60*60*24, 60*60*24)
  .mapValues(applyARIMAFunction)

That is, keep 24 hours worth of metrics in each window and use that for the
ARIMA prediction. The part I'm struggling with is how to join together the
actual values so that i can do my comparison against the prediction model.

Let's say dstream contains the actual values. For any time  window, I
should be able to take a previous set of windows and use model to compare
against the current values.

Reply via email to