Hello, I want to able to update an existing model with new data without the need to do a batch GD again on all data. I would rather use the native mllib functions and without the streaming module.
The way I thought about doing this is to use the *initialWeights* input argument to load my previous found weights and use them to train a new batch with a new RDD. 1) What I struggle to understand is that if *includeIntercept = False* then the initial weights length is exactly the length of my input vectors. However if *includeIntercept = True* then it seems to me that I would need to *increase *the weights vector by *+1* for the algorithm to update the intercept term. There is however no such option. This seems strange , taking into account that the intercept should be considered as a regular weight ( at least mathematically). 2) the *StreamingLogisticRegressionWithSGD* function seems to do exactly that. it uses an *update* function which itself uses *LogisticRegressionWithSGD.train* with the already found weights, but without the intercept weight. This , to me, seems to give erroneous results. 3) the way I want to implement it then, is to use a pre-processing step to increase the LabeledPoint feature vector by one more feature that will get a value of 1 for all samples. this in affect will force the intercept to be considered as a regular weight, and then I could use *LogisticRegressionWithSGD.train* with *includeIntercept = False* Is there any wrong in my logic regarding points 1,2. Also is point 3 a correct way ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/trying-to-implement-mini-batch-GD-in-pyspark-tp26559.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
