Re: Iterative Algorithms with Spark Streaming

2015-03-16 Thread Nick Pentreath
MLlib supports streaming linear models:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
and k-means:
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

With an iteration parameter of 1, this amounts to mini-batch SGD where the
mini-batch is the Spark Streaming batch.

On Mon, Mar 16, 2015 at 2:57 PM, Alex Minnaar aminn...@verticalscope.com
wrote:

  I wanted to ask a basic question about the types of algorithms that are
 possible to apply to a DStream with Spark streaming.  With Spark it is
 possible to perform iterative computations on RDDs like in the gradient
 descent example


val points = spark.textFile(...).map(parsePoint).cache()
 var w = Vector.random(D) // current separating plane
 for (i - 1 to ITERATIONS) {
   val gradient = points.map(p =
 (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
   ).reduce(_ + _)
   w -= gradient
 }


  which has a global state w that is updated after each iteration and the
 updated value is then used in the next iteration.  My question is whether
 this type of algorithm is possible if the points variable was a DStream
 instead of an RDD?  It seems like you could perform the same map as above
 which would create a gradient DStream and also use updateStateByKey to
 create a DStream for the w variable.  But the problem is that there doesn't
 seem to be a way to reuse the w DStream inside the map.  I don't think that
 it is possible for DStreams to communicate this way.  Am I correct that
 this is not possible with DStreams or am I missing something?


  Note:  The reason I ask this question is that many machine learning
 algorithms are trained by stochastic gradient descent.  sgd is similar to
 the above gradient descent algorithm except each iteration is on a new
 minibatch of data points rather than the same data points for every
 iteration.  It seems like Spark streaming provides a natural way to stream
 in these minibatches (as RDDs) but if it is not able to keep track of an
 updating global state variable then I don't think it Spark streaming can be
 used for sgd.


  Thanks,


  Alex



Iterative Algorithms with Spark Streaming

2015-03-16 Thread Alex Minnaar
I wanted to ask a basic question about the types of algorithms that are 
possible to apply to a DStream with Spark streaming.  With Spark it is possible 
to perform iterative computations on RDDs like in the gradient descent example


  val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i - 1 to ITERATIONS) {
  val gradient = points.map(p =
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
  ).reduce(_ + _)
  w -= gradient
}


which has a global state w that is updated after each iteration and the updated 
value is then used in the next iteration.  My question is whether this type of 
algorithm is possible if the points variable was a DStream instead of an RDD?  
It seems like you could perform the same map as above which would create a 
gradient DStream and also use updateStateByKey to create a DStream for the w 
variable.  But the problem is that there doesn't seem to be a way to reuse the 
w DStream inside the map.  I don't think that it is possible for DStreams to 
communicate this way.  Am I correct that this is not possible with DStreams or 
am I missing something?


Note:  The reason I ask this question is that many machine learning algorithms 
are trained by stochastic gradient descent.  sgd is similar to the above 
gradient descent algorithm except each iteration is on a new minibatch of 
data points rather than the same data points for every iteration.  It seems 
like Spark streaming provides a natural way to stream in these minibatches (as 
RDDs) but if it is not able to keep track of an updating global state variable 
then I don't think it Spark streaming can be used for sgd.


Thanks,


Alex