Aureliano, you're correct that this is not "validation error", which is computed as the residuals on out-of-training-sample data, and helps minimize overfit variance.
However, in this example, the errors are correctly referred to as "training error", which is what you might compute on a per-iteration basis in a gradient-descent optimizer, in order to see how you're doing with respect to minimizing the in-sample residuals. There's nothing special about Spark ML algorithms that claims to escape these bias-variance considerations. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia <buendia...@gmail.com>wrote: > Hi, > > I notices spark machine learning examples use training data to validate > regression models, For instance, in linear > regression<http://spark.apache.org/docs/0.9.0/mllib-guide.html>example: > > // Evaluate model on training examples and compute training errorval > valuesAndPreds = parsedData.map { point => > val prediction = model.predict(point.features) > (point.label, prediction)} > ... > > > Here training data was used to validated a model which was created from > the very same training data. This is just a bias estimation, and cross > validation<http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29>is > missing here. In order to cross validate, we need to partition the data > into in-sample for training, and out-of-sample for validation. > > Please correct me if this does not apply to ML algorithms implemented in > spark. >