Thanks buddy I'll try it out in my project. Best, Lewis
2015-09-16 13:29 GMT+08:00 Feynman Liang <fli...@databricks.com>: > If you're doing hyperparameter grid search, consider using > ml.tuning.CrossValidator which does cache the dataset > <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85> > . > > Otherwise, perhaps you can elaborate more on your particular use case for > caching intermediate results and if the current API doesn't support it we > can create a JIRA for it. > > On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com> > wrote: > >> Yeah I understand on the low-level we should do as you said. >> >> But since ML pipeline is a high-level API, it is pretty natural to expect >> the ability to recognize overlapping parameters between successive runs. >> (Actually, this happen A LOT when we have lots of hyper-params to search >> for) >> >> I can also imagine the implementation by appending parameter information >> to the cached results. Let's say if we implemented an "equal" method for >> param1. By comparing param1 with the previous run, the program will know >> data1 is reusable. And time used for generating data1 can be saved. >> >> Best, >> Lewis >> >> 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>: >> >>> Nope, and that's intentional. There is no guarantee that rawData did not >>> change between intermediate calls to searchRun, so reusing a cached data1 >>> would be incorrect. >>> >>> If you want data1 to be cached between multiple runs, you have a few >>> options: >>> * cache it first and pass it in as an argument to searchRun >>> * use a creational pattern like singleton to ensure only one >>> instantiation >>> >>> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com> >>> wrote: >>> >>>> Hey Feynman, >>>> >>>> I doubt DF persistence will work in my case. Let's use the following >>>> code: >>>> ============================================== >>>> def searchRun( params = [param1, param2] ) >>>> data1 = hashing1.transform(rawData, param1) >>>> data1.cache() >>>> data2 = hashing2.transform(data1, param2) >>>> data2.someAction() >>>> ============================================== >>>> Say if we run "searchRun()" for 2 times with the same "param1" but >>>> different "param2". Will spark recognize that the two local variables >>>> "data1" in consecutive runs has the same content? >>>> >>>> >>>> Best, >>>> Lewis >>>> >>>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>: >>>> >>>>> You can persist the transformed Dataframes, for example >>>>> >>>>> val data : DF = ... >>>>> val hashedData = hashingTF.transform(data) >>>>> hashedData.cache() // to cache DataFrame in memory >>>>> >>>>> Future usage of hashedData read from an in-memory cache now. >>>>> >>>>> You can also persist to disk, eg: >>>>> >>>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet >>>>> format to disk >>>>> ... >>>>> val savedHashedData = sqlContext.read.parquet(FilePath) >>>>> >>>>> Future uses of hash >>>>> >>>>> Like my earlier response, this will still require you call each >>>>> PipelineStage's `transform` method (i.e. to NOT use the overall >>>>> Pipeline.setStages API) >>>>> >>>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey Feynman, >>>>>> >>>>>> Thanks for your response, but I'm afraid "model save/load" is not >>>>>> exactly the feature I'm looking for. >>>>>> >>>>>> What I need to cache and reuse are the intermediate outputs of >>>>>> transformations, not transformer themselves. Do you know any related dev. >>>>>> activities or plans? >>>>>> >>>>>> Best, >>>>>> Lewis >>>>>> >>>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>: >>>>>> >>>>>>> Lewis, >>>>>>> >>>>>>> Many pipeline stages implement save/load methods, which can be used >>>>>>> if you instantiate and call the underlying pipeline stages `transform` >>>>>>> methods individually (instead of using the Pipeline.setStages API). See >>>>>>> associated JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>. >>>>>>> >>>>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here >>>>>>> <https://issues.apache.org/jira/browse/SPARK-6725>. >>>>>>> >>>>>>> Feynman >>>>>>> >>>>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I have a question regarding the ability of ML pipeline to cache >>>>>>>> intermediate results. I've posted this question on stackoverflow >>>>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline> >>>>>>>> but got no answer, hope someone here can help me out. >>>>>>>> >>>>>>>> =========== >>>>>>>> Lately I'm planning to migrate my standalone python ML code to >>>>>>>> spark. The ML pipeline in spark.ml turns out quite handy, with >>>>>>>> streamlined API for chaining up algorithm stages and hyper-parameter >>>>>>>> grid >>>>>>>> search. >>>>>>>> >>>>>>>> Still, I found its support for one important feature obscure in >>>>>>>> existing documents: caching of intermediate results. The importance of >>>>>>>> this >>>>>>>> feature arise when the pipeline involves computation intensive stages. >>>>>>>> >>>>>>>> For example, in my case I use a huge sparse matrix to perform >>>>>>>> multiple moving averages on time series data in order to form input >>>>>>>> features. The structure of the matrix is determined by some >>>>>>>> hyper-parameter. This step turns out to be a bottleneck for the entire >>>>>>>> pipeline because I have to construct the matrix in runtime. >>>>>>>> >>>>>>>> During parameter search, I usually have other parameters to examine >>>>>>>> in addition to this "structure parameter". So if I can reuse the huge >>>>>>>> matrix when the "structure parameter" is unchanged, I can save tons of >>>>>>>> time. For this reason, I intentionally formed my code to cache and >>>>>>>> reuse >>>>>>>> these intermediate results. >>>>>>>> >>>>>>>> So my question is: can Spark's ML pipeline handle intermediate >>>>>>>> caching automatically? Or do I have to manually form code to do so? If >>>>>>>> so, >>>>>>>> is there any best practice to learn from? >>>>>>>> >>>>>>>> P.S. I have looked into the official document and some other >>>>>>>> material, but none of them seems to discuss this topic. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> Lewis >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >