The RDD will just be attached to the model so you simply reference it.

Here's an example where I also transform the RDD into a DataFrame in the
predict method.

val sqlContext = SQLContext.getOrCreate(model.sc)
import sqlContext.implicits._
val dfWithSchema = sqlContext.createDataFrame(model.dcRDD).toDF("col1",
"col2", "colN")

On Fri, 23 Feb 2018 at 17:12 Shane Johnson <shanewaldenjohn...@gmail.com>
wrote:

> Thanks Daniel,
>
> This is the one I am using now. I was hoping to see an example of how the
> RDD was used in the predict() method as this example stops at the train().
> I will follow the steps here and see if I am able to pull in the saved RDD
> from the train() to the predict() method.
>
> Best,
>
> Shane
>
> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
> <https://www.facebook.com/shane.johnson.71653>
>
> 2018-02-23 7:09 GMT-10:00 Daniel O' Shaughnessy <
> danieljamesda...@gmail.com>:
>
>> Hi Shane,
>>
>> There's an example of PAlgorithm here :
>>
>> https://predictionio.apache.org/templates/vanilla/dase/
>>
>>
>>
>> On Fri, 23 Feb 2018 at 16:10 Shane Johnson <shanewaldenjohn...@gmail.com>
>> wrote:
>>
>>> Thanks Donald. I used this command but am still getting this error. It
>>> doesn't seem to be adjusting the configuration. Do you see a problem in how
>>> I used the spark-submit options. The train function ran but the error makes
>>> me think the sparkResultSize was not adjusted.
>>>
>>> command:
>>>
>>> bin/pio build --verbose; bin/pio train -- --driver-memory 14G -- --conf
>>> spark.driver.max
>>> ResultSize=4g; bin/pio deploy
>>>
>>> error:
>>>
>>> Job aborted due to stage failure: Total
>>> size of serialized results of 8 tasks (1236.7 MB) is bigger than
>>> spark.driver.maxResultSize (1024.0
>>> MB)
>>>
>>> Regarding the PAlgorithm, what I am trying to do is save a Map in the
>>> train method to reuse in the predict method. Because of the error above I
>>> am not able to convert my RDD to a map as the collectAsMap tries to bring
>>> it to the driver. If I use the PAlgorithm, I should be able to just save
>>> the RDD in the Model class and then use it in the predict method. I am
>>> going down that path now. *Do you know of any templates that are using
>>> the PAlgorithm?* The docs say that "Similar Product" uses it but it
>>> looks like it uses the P2LAlgorithm.
>>>
>>> Thank you for your help.
>>>
>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>> <https://www.facebook.com/shane.johnson.71653>
>>>
>>> 2018-02-22 8:16 GMT-10:00 Donald Szeto <don...@apache.org>:
>>>
>>>> Hi Shane,
>>>>
>>>> I think what you are looking for to set max result size on the driver
>>>> is by passing in a spark-submit argument that looks something like this:
>>>>
>>>> pio train ... -- --conf spark.driver.maxResultSize=4g ...
>>>>
>>>> Regarding PAlgorithm, the predict() method does not actually have the
>>>> SparkContext in it (
>>>> http://predictionio.apache.org/api/current/#org.apache.predictionio.controller.PAlgorithm).
>>>> The "model" argument, unlike P2LAlgorithm, can contain RDDs. In
>>>> PAlgorithm.predict(), you would be able to perform RDD operations directly
>>>> on the model argument. If the SparkContext is needed, the context() method
>>>> can be used on the model RDD.
>>>>
>>>> Hope these help.
>>>>
>>>> Regards,
>>>> Donald
>>>>
>>>> On Wed, Feb 21, 2018 at 12:08 PM Shane Johnson <
>>>> shanewaldenjohn...@gmail.com> wrote:
>>>>
>>>>> Hi team,
>>>>>
>>>>> We have a specific use case where we are trying to save off a map from
>>>>> the train function and reuse it in the predict function to increase our
>>>>> predict function response time. I know the collect() forces everything to
>>>>> the driver. We are collecting the RDD to a map as we don't have a spark
>>>>> context in the predict function.
>>>>>
>>>>> I am getting this error and am looking for a way to adjust the
>>>>> parameter from 1G to 4G+. I can see a way to do it in Spark 1.6 but we are
>>>>> using Spark 2.1.1 and I have not seen the ability to set this. *Has
>>>>> anyone been able to adjust the maxResultSize to something more than 1G?*
>>>>>
>>>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted 
>>>>> due to stage failure: Total size of serialized results of 7 tasks (1156.3 
>>>>> MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
>>>>>
>>>>>
>>>>> I have tried to set this parameter but get this as a result with Spark
>>>>> 2.1.1
>>>>>
>>>>> Error: Unrecognized option: --driver-maxResultSize
>>>>>
>>>>> Our other option is to do the work to obtain a spark context in the
>>>>> predict function so we can pass the RDD through from the train to predict
>>>>> function. The documentation was a little unclear to me on PredictionIO. 
>>>>> *Is
>>>>> this the right place to learn how to get a spark context in the predict
>>>>> function?*
>>>>> https://predictionio.incubator.apache.org/templates/vanilla/dase/
>>>>>
>>>>> Also I am not seeing in this documentation how to get the spark
>>>>> context into the predict function, it looks like it is only used in the
>>>>> train function.
>>>>>
>>>>> Thanks in advance for your expertise.
>>>>>
>>>>> *Shane Johnson | 801.360.3350 <(801)%20360-3350>*
>>>>> LinkedIn <https://www.linkedin.com/in/shanewjohnson> | Facebook
>>>>> <https://www.facebook.com/shane.johnson.71653>
>>>>>
>>>>
>>>
>

Reply via email to