Surprised I haven't gotten any responses about this. Has anyone tried using rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the other way- what I'd like to do is use R for model calculation and Spark to distribute the load across the cluster.
Also, has anyone used Scalation for ARIMA models? On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet <cjno...@gmail.com> wrote: > Taking out the complexity of the ARIMA models to simplify things- I can't > seem to find a good way to represent even standard moving averages in spark > streaming. Perhaps it's my ignorance with the micro-batched style of the > DStreams API. > > On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet <cjno...@gmail.com> wrote: > >> I want to use ARIMA for a predictive model so that I can take time series >> data (metrics) and perform a light anomaly detection. The time series data >> is going to be bucketed to different time units (several minutes within >> several hours, several hours within several days, several days within >> several years. >> >> I want to do the algorithm in Spark Streaming. I'm used to "tuple at a >> time" streaming and I'm having a tad bit of trouble gaining insight into >> how exactly the windows are managed inside of DStreams. >> >> Let's say I have a simple dataset that is marked by a key/value tuple >> where the key is the name of the component who's metrics I want to run the >> algorithm against and the value is a metric (a value representing a sum for >> the time bucket. I want to create histograms of the time series data for >> each key in the windows in which they reside so I can use that histogram >> vector to generate my ARIMA prediction (actually, it seems like this >> doesn't just apply to ARIMA but could apply to any sliding average). >> >> I *think* my prediction code may look something like this: >> >> val predictionAverages = dstream >> .groupByKeyAndWindow(60*60*24, 60*60*24) >> .mapValues(applyARIMAFunction) >> >> That is, keep 24 hours worth of metrics in each window and use that for >> the ARIMA prediction. The part I'm struggling with is how to join together >> the actual values so that i can do my comparison against the prediction >> model. >> >> Let's say dstream contains the actual values. For any time window, I >> should be able to take a previous set of windows and use model to compare >> against the current values. >> >> >> >