Re: Caching intermediate results in Spark ML pipeline?

Jingchu Liu Fri, 18 Sep 2015 07:16:01 -0700

Thanks buddy I'll try it out in my project.

Best,
Lewis


2015-09-16 13:29 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> If you're doing hyperparameter grid search, consider using
> ml.tuning.CrossValidator which does cache the dataset
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85>
> .
>
> Otherwise, perhaps you can elaborate more on your particular use case for
> caching intermediate results and if the current API doesn't support it we
> can create a JIRA for it.
>
> On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com>
> wrote:
>
>> Yeah I understand on the low-level we should do as you said.
>>
>> But since ML pipeline is a high-level API, it is pretty natural to expect
>> the ability to recognize overlapping parameters between successive runs.
>> (Actually, this happen A LOT when we have lots of hyper-params to search
>> for)
>>
>> I can also imagine the implementation by appending parameter information
>> to the cached results. Let's say if we implemented an "equal" method for
>> param1. By comparing param1 with the previous run, the program will know
>> data1 is reusable. And time used for generating data1 can be saved.
>>
>> Best,
>> Lewis
>>
>> 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>
>>> Nope, and that's intentional. There is no guarantee that rawData did not
>>> change between intermediate calls to searchRun, so reusing a cached data1
>>> would be incorrect.
>>>
>>> If you want data1 to be cached between multiple runs, you have a few
>>> options:
>>> * cache it first and pass it in as an argument to searchRun
>>> * use a creational pattern like singleton to ensure only one
>>> instantiation
>>>
>>> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
>>> wrote:
>>>
>>>> Hey Feynman,
>>>>
>>>> I doubt DF persistence will work in my case. Let's use the following
>>>> code:
>>>> ==============================================
>>>> def searchRun( params = [param1, param2] )
>>>>       data1 = hashing1.transform(rawData, param1)
>>>>       data1.cache()
>>>>       data2 = hashing2.transform(data1, param2)
>>>>       data2.someAction()
>>>> ==============================================
>>>> Say if we run "searchRun()" for 2 times with the same "param1" but
>>>> different "param2". Will spark recognize that the two local variables
>>>> "data1" in consecutive runs has the same content?
>>>>
>>>>
>>>> Best,
>>>> Lewis
>>>>
>>>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>
>>>>> You can persist the transformed Dataframes, for example
>>>>>
>>>>> val data : DF = ...
>>>>> val hashedData = hashingTF.transform(data)
>>>>> hashedData.cache() // to cache DataFrame in memory
>>>>>
>>>>> Future usage of hashedData read from an in-memory cache now.
>>>>>
>>>>> You can also persist to disk, eg:
>>>>>
>>>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>>>> format to disk
>>>>> ...
>>>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>>>
>>>>> Future uses of hash
>>>>>
>>>>> Like my earlier response, this will still require you call each
>>>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>>>> Pipeline.setStages API)
>>>>>
>>>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Feynman,
>>>>>>
>>>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>>>> exactly the feature I'm looking for.
>>>>>>
>>>>>> What I need to cache and reuse are the intermediate outputs of
>>>>>> transformations, not transformer themselves. Do you know any related dev.
>>>>>> activities or plans?
>>>>>>
>>>>>> Best,
>>>>>> Lewis
>>>>>>
>>>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>>>
>>>>>>> Lewis,
>>>>>>>
>>>>>>> Many pipeline stages implement save/load methods, which can be used
>>>>>>> if you instantiate and call the underlying pipeline stages `transform`
>>>>>>> methods individually (instead of using the Pipeline.setStages API). See
>>>>>>> associated JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>>>>
>>>>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>>>>
>>>>>>> Feynman
>>>>>>>
>>>>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>>>>> intermediate results. I've posted this question on stackoverflow
>>>>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>>>>> but got no answer, hope someone here can help me out.
>>>>>>>>
>>>>>>>> ===========
>>>>>>>> Lately I'm planning to migrate my standalone python ML code to
>>>>>>>> spark. The ML pipeline in spark.ml turns out quite handy, with
>>>>>>>> streamlined API for chaining up algorithm stages and hyper-parameter 
>>>>>>>> grid
>>>>>>>> search.
>>>>>>>>
>>>>>>>> Still, I found its support for one important feature obscure in
>>>>>>>> existing documents: caching of intermediate results. The importance of 
>>>>>>>> this
>>>>>>>> feature arise when the pipeline involves computation intensive stages.
>>>>>>>>
>>>>>>>> For example, in my case I use a huge sparse matrix to perform
>>>>>>>> multiple moving averages on time series data in order to form input
>>>>>>>> features. The structure of the matrix is determined by some
>>>>>>>> hyper-parameter. This step turns out to be a bottleneck for the entire
>>>>>>>> pipeline because I have to construct the matrix in runtime.
>>>>>>>>
>>>>>>>> During parameter search, I usually have other parameters to examine
>>>>>>>> in addition to this "structure parameter". So if I can reuse the huge
>>>>>>>> matrix when the "structure parameter" is unchanged, I can save tons of
>>>>>>>> time. For this reason, I intentionally formed my code to cache and 
>>>>>>>> reuse
>>>>>>>> these intermediate results.
>>>>>>>>
>>>>>>>> So my question is: can Spark's ML pipeline handle intermediate
>>>>>>>> caching automatically? Or do I have to manually form code to do so? If 
>>>>>>>> so,
>>>>>>>> is there any best practice to learn from?
>>>>>>>>
>>>>>>>> P.S. I have looked into the official document and some other
>>>>>>>> material, but none of them seems to discuss this topic.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Lewis
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Caching intermediate results in Spark ML pipeline?

Reply via email to