MLlib supports streaming linear models:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
and k-means:
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
With an iteration parameter of 1, this amounts to mini-batch SGD where the
mini-batch is the Spark Streaming batch.
On Mon, Mar 16, 2015 at 2:57 PM, Alex Minnaar aminn...@verticalscope.com
wrote:
I wanted to ask a basic question about the types of algorithms that are
possible to apply to a DStream with Spark streaming. With Spark it is
possible to perform iterative computations on RDDs like in the gradient
descent example
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i - 1 to ITERATIONS) {
val gradient = points.map(p =
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
which has a global state w that is updated after each iteration and the
updated value is then used in the next iteration. My question is whether
this type of algorithm is possible if the points variable was a DStream
instead of an RDD? It seems like you could perform the same map as above
which would create a gradient DStream and also use updateStateByKey to
create a DStream for the w variable. But the problem is that there doesn't
seem to be a way to reuse the w DStream inside the map. I don't think that
it is possible for DStreams to communicate this way. Am I correct that
this is not possible with DStreams or am I missing something?
Note: The reason I ask this question is that many machine learning
algorithms are trained by stochastic gradient descent. sgd is similar to
the above gradient descent algorithm except each iteration is on a new
minibatch of data points rather than the same data points for every
iteration. It seems like Spark streaming provides a natural way to stream
in these minibatches (as RDDs) but if it is not able to keep track of an
updating global state variable then I don't think it Spark streaming can be
used for sgd.
Thanks,
Alex