This "inside out" parallelization has been a way people have used R
with MapReduce for a long time. Run N copies of an R script on the
cluster, on different subsets of the data, babysat by Mappers. You
just need R installed on the cluster. Hadoop Streaming makes this easy
and things like RDD.pipe in Spark make it easier.

So it may be just that simple and so there's not much to say about it.
I haven't tried this with Spark Streaming but imagine it would also
work. Have you tried this?

Within a window you would probably take the first x% as training and
the rest as test. I don't think there's a question of looking across
windows.

On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet <cjno...@gmail.com> wrote:
> Surprised I haven't gotten any responses about this. Has anyone tried using
> rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the
> other way- what I'd like to do is use R for model calculation and Spark to
> distribute the load across the cluster.
>
> Also, has anyone used Scalation for ARIMA models?
>
> On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet <cjno...@gmail.com> wrote:
>>
>> Taking out the complexity of the ARIMA models to simplify things- I can't
>> seem to find a good way to represent even standard moving averages in spark
>> streaming. Perhaps it's my ignorance with the micro-batched style of the
>> DStreams API.
>>
>> On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet <cjno...@gmail.com> wrote:
>>>
>>> I want to use ARIMA for a predictive model so that I can take time series
>>> data (metrics) and perform a light anomaly detection. The time series data
>>> is going to be bucketed to different time units (several minutes within
>>> several hours, several hours within several days, several days within
>>> several years.
>>>
>>> I want to do the algorithm in Spark Streaming. I'm used to "tuple at a
>>> time" streaming and I'm having a tad bit of trouble gaining insight into how
>>> exactly the windows are managed inside of DStreams.
>>>
>>> Let's say I have a simple dataset that is marked by a key/value tuple
>>> where the key is the name of the component who's metrics I want to run the
>>> algorithm against and the value is a metric (a value representing a sum for
>>> the time bucket. I want to create histograms of the time series data for
>>> each key in the windows in which they reside so I can use that histogram
>>> vector to generate my ARIMA prediction (actually, it seems like this doesn't
>>> just apply to ARIMA but could apply to any sliding average).
>>>
>>> I *think* my prediction code may look something like this:
>>>
>>> val predictionAverages = dstream
>>>   .groupByKeyAndWindow(60*60*24, 60*60*24)
>>>   .mapValues(applyARIMAFunction)
>>>
>>> That is, keep 24 hours worth of metrics in each window and use that for
>>> the ARIMA prediction. The part I'm struggling with is how to join together
>>> the actual values so that i can do my comparison against the prediction
>>> model.
>>>
>>> Let's say dstream contains the actual values. For any time  window, I
>>> should be able to take a previous set of windows and use model to compare
>>> against the current values.
>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to