trying to implement mini-batch GD in pyspark

sethirot Tue, 22 Mar 2016 03:17:09 -0700

Hello,
I want to able to update an existing model with new data without the need to
do a batch GD again on all data. I would rather use the native mllib
functions and without the streaming module.


The way I thought about doing this is to use the *initialWeights* input
argument to load my previous found weights and use them to train a new batch
with a new RDD.

1) What I struggle to understand is that if *includeIntercept = False* then
the initial weights length is exactly the length of my input vectors.
However if *includeIntercept = True* then it seems to me that I would need
to *increase *the weights vector by *+1* for the algorithm to update the
intercept term. 
There is however no such option. This seems strange , taking into account
that the intercept should be considered as a regular weight ( at least
mathematically). 

2) the *StreamingLogisticRegressionWithSGD* function seems to do exactly
that. it uses an *update* function which itself uses
*LogisticRegressionWithSGD.train* with the already found weights, but
without the intercept weight.
This , to me, seems to give erroneous results.

3) the way I want to implement it then, is to use a pre-processing step to
increase the LabeledPoint feature vector by one more feature that will get a
value of 1 for all samples. this in affect will force the intercept to be
considered as a regular weight, and then I could use
*LogisticRegressionWithSGD.train* with *includeIntercept = False*

Is there any wrong in my logic regarding points 1,2. 
Also is point 3 a correct way ?

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/trying-to-implement-mini-batch-GD-in-pyspark-tp26559.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

trying to implement mini-batch GD in pyspark

Reply via email to