ML Pipeline question about caching

Cesar Flores Tue, 17 Mar 2015 15:29:30 -0700

Hello all:

I am using the ML Pipeline, which I consider very powerful. I have the next
use case:


   - I have three transformers, which I will call A,B,C, that basically
   extract features from text files, with no parameters.
   - I have a final stage D, which is the logistic regression estimator.
   - I am creating a pipeline with the sequence A,B,C,D.
   - Finally, I am using this pipeline as estimator parameter of the
   CrossValidator class.

I have some concerns about how data persistance inside the cross validator
works. For example, if only D has multiple parameters to tune using the
cross validator, my concern is that the transformation A->B->C is being
performed multiple times?. Is that the case, or it is Spark smart enough to
realize that it is possible to persist the output of C? Do it will be
better to leave A,B, and C outside the cross validator pipeline?

Thanks a lot
-- 
Cesar Flores

ML Pipeline question about caching

Reply via email to