Aureliano, you're correct that this is not "validation error", which is
computed as the residuals on out-of-training-sample data, and helps
minimize overfit variance.

However, in this example, the errors are correctly referred to as "training
error", which is what you might compute on a per-iteration basis in a
gradient-descent optimizer, in order to see how you're doing with respect
to minimizing the in-sample residuals.

There's nothing special about Spark ML algorithms that claims to escape
these bias-variance considerations.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Sat, Mar 29, 2014 at 10:25 PM, Aureliano Buendia <buendia...@gmail.com>wrote:

> Hi,
>
> I notices spark machine learning examples use training data to validate
> regression models, For instance, in linear 
> regression<http://spark.apache.org/docs/0.9.0/mllib-guide.html>example:
>
> // Evaluate model on training examples and compute training errorval 
> valuesAndPreds = parsedData.map { point =>
>   val prediction = model.predict(point.features)
>   (point.label, prediction)}
> ...
>
>
>  Here training data was used to validated a model which was created from
> the very same training data. This is just a bias estimation, and cross
> validation<http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29>is 
> missing here. In order to cross validate, we need to partition the data
> into in-sample for training, and out-of-sample for validation.
>
> Please correct me if this does not apply to ML algorithms implemented in
> spark.
>

Reply via email to