Re: How to access Spark Context in predict?

Hasan Can Saral Tue, 27 Sep 2016 04:05:40 -0700

Hi Kenneth & Donald,

That was really clarifying, thank you. I really appreciate it. So now I
know that;


1- I should use LEventStore and query HBase without sc and with the
smallest processing as possible in predict,
2- In this case I don't have to extend PersistentModel, since I will not
need sc in predict.
3- If I need sc and batch processing in predict, I can save RandomForest
trees to a file, then I can load it from there. As far as I can see, mt
only option to access sc for PEventStore is to add a dummy RDD to the
model, and use dummyRDD.context.

Am I correct, especially in the 3rd point?

Thank you again,
Hasan

On Tue, Sep 27, 2016 at 9:00 AM, Kenneth Chan <kenn...@apache.org> wrote:

> Hasan,
>
> Spark randomforest algo doesn't need RDD. much simpler to simply serialize
> it and use in local memory in predict().
> see example here.
> https://github.com/PredictionIO/template-scala-parallel-leadscoring/blob/
> develop/src/main/scala/RFAlgorithm.scala
>
> For accessing evernt store in predict(), you should use LEventStore API
> (not PEventStore API) to have fast query for specific events.
>
> (use PEventStore API if you really want to do batch processing again in
> predict() and need RDD for it)
>
>
> Kenneth
>
>
> On Mon, Sep 26, 2016 at 9:19 PM, Donald Szeto <don...@apache.org> wrote:
>
>> Hi Hasan,
>>
>> Does your randomForestModel contain any RDD?
>>
>> If so, implement your algorithm by extending PAlgorithm, have your model
>> extend PersistentModel, and implement PersistentModelLoader to save and
>> load your model. You will be able to perform RDD operations within
>> predict() by using the model's RDD.
>>
>> If not, implement your algorithm by extending P2LAlgorithm, and see if
>> PredictionIO can automatically persist the model for you. The convention
>> assumes that a non-RDD model does not require Spark to perform any RDD
>> operations, so there will be no SparkContext access.
>>
>> Are these conventions not fitting your use case? Feedbacks are always
>> welcome for improving PredictionIO.
>>
>> Regards,
>> Donald
>>
>>
>> On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral <hasancansa...@gmail.com
>> > wrote:
>>
>>> Hi Marcin,
>>>
>>> I did look at the definition of PersistentModel, and indeed replaced
>>> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
>>> really appreciate your help.
>>>
>>> However, I am having quite hard time understanding how I can access sc
>>> object that is provided by PredictionIO to save and apply methods within
>>> predict method.
>>>
>>> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends 
>>> PersistentModel[SomeAlgorithmParams] {
>>>
>>>   override def save(id: String, params: SomeAlgorithmParams, sc: 
>>> SparkContext): Boolean = {
>>>
>>> // Here I should save randomForestModel to a file, but how to?// Tried 
>>> saveAsObjectFile but no luck.
>>>
>>> true
>>>
>>>   }
>>> }
>>>
>>> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, 
>>> FraudModel] {
>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: 
>>> Option[SparkContext]): SomeModel = {
>>>
>>> // // Here should I load randomForestModel from file? How?
>>>     new SomeModel(randomForestModel)
>>>
>>>   }
>>> }
>>>
>>> So, my questions have become:
>>> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have
>>> to return false and retrain upon deployment. How do I skip pio train in
>>> this case?
>>> 2- How do I load saved randomForestModel from file? If I cannot, will I
>>> remove object SomeModel extends PersistentModelLoader all together?
>>> 3- How do I access sc within predict? Do I save a dummy RDD, load it in
>>> apply, and say .context? In this case what happens to randomForestModel?
>>>
>>> I am really quite confused and could really appreciate some help/sample
>>> code if you have time.
>>> Thank you.
>>> Hasan
>>>
>>>
>>> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <ziem...@gmail.com>
>>> wrote:
>>>
>>>> Hi Hasan,
>>>>
>>>> So I guess, there are two things here:
>>>> 1. You need SparkContext for predictions
>>>> 2. You also need to retrain you model during loading
>>>>
>>>> Please, look at the definition of PersistentModel and the comments
>>>> attached:
>>>>
>>>> trait PersistentModel[AP <: Params] {/** Save the model to some persistent 
>>>> storage.
>>>> *
>>>> * This method should return true if the model has been saved successfully 
>>>> so
>>>> * that PredictionIO knows that it can be restored later during deployment.
>>>> * This method should return false if the model cannot be saved (or should
>>>> * not be saved due to configuration) so that PredictionIO will re-train the
>>>> * model during deployment. All arguments of this method are provided by
>>>> * automatically by PredictionIO.
>>>> *
>>>> * @param id ID of the run that trained this model.
>>>> * @param params Algorithm parameters that were used to train this model.
>>>> * @param sc An Apache Spark context.
>>>> */def save(id: String, params: AP, sc: SparkContext): Boolean}
>>>>
>>>> In order to achieve the desired result you could simply use
>>>> PersistentModel instead of LocalFileSystemPersistentModel and return false
>>>> from save. Then during deployment your model will be retrained through your
>>>> Algorithm implementation. You shouldn't need to retrain your model in
>>>> implementations of PersistentModelLoader - this is rather for loading
>>>> models, that are already trained and stored somewhere.
>>>> You can save SparkContext instance provided to the train method for
>>>> usage in predict(...) (assuming that your algorithm is an instance of
>>>> PAlgorithm or P2LAlgorithm). Thus you should have what you need.
>>>>
>>>> Regards,
>>>> Marcin
>>>>
>>>>
>>>>
>>>> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral <
>>>> hasancansa...@gmail.com> napisał:
>>>>
>>>>> Hi Marcin!
>>>>>
>>>>> Thank you for your answer.
>>>>>
>>>>> I do only need SparkContext, but have no idea on:
>>>>> 1- How to retrieve it from PersitentModelLoader?
>>>>> 2- How do I access sc in predict method using the configuration below?
>>>>>
>>>>> class SomeModel() extends 
>>>>> LocalFileSystemPersistentModel[SomeAlgorithmParams] {
>>>>>   override def save(id: String, params: SomeAlgorithmParams, sc: 
>>>>> SparkContext): Boolean = {
>>>>>     false
>>>>>   }
>>>>> }
>>>>>
>>>>> object SomeModel extends 
>>>>> LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] {
>>>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: 
>>>>> Option[SparkContext]): SomeModel = {
>>>>>     new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL
>>>>>   }
>>>>> }
>>>>>
>>>>> Thank you very much, I really appreciate it!
>>>>>
>>>>> Hasan
>>>>>
>>>>>
>>>>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <ziem...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Hasan,
>>>>>>
>>>>>> I think that you problem comes from using deserialized RDD, which
>>>>>> already lost its connection with SparkContext.
>>>>>> Similar case could be found here: http://stackoverflow.com/quest
>>>>>> ions/29567247/serializing-rdd
>>>>>>
>>>>>> If you only really need SparkContext you could probably use the one
>>>>>> provided to PersitentModelLoader, which would be implemented by your 
>>>>>> model.
>>>>>> Alternatively you could also implement PersistentModel to return
>>>>>> false from save method. In this case your algorithm would be retrained on
>>>>>> deploy, what would also provide you with the instance of SparkContext.
>>>>>>
>>>>>> Regards,
>>>>>> Marcin
>>>>>>
>>>>>>
>>>>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral <
>>>>>> hasancansa...@gmail.com> napisał:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> I am trying to query Event Server with PEventStore api in predict
>>>>>>> method to fetch events per entity to create my features. PEventStore 
>>>>>>> needs
>>>>>>> sc, and for this, I have:
>>>>>>>
>>>>>>> - Extended PAlgorithm
>>>>>>> - Extended LocalFileSystemPersistentModel and LocalFileSystemP
>>>>>>> ersistentModelLoader
>>>>>>> - Put a dummy emptyRDD into my model
>>>>>>> - Tried to access sc with model.dummyRDD.context to receive this
>>>>>>> error:
>>>>>>>
>>>>>>> org.apache.spark.SparkException: RDD transformations and actions
>>>>>>> can only be invoked by the driver, not inside of other transformations; 
>>>>>>> for
>>>>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the
>>>>>>> values transformation and count action cannot be performed inside of the
>>>>>>> rdd1.map transformation. For more information, see SPARK-5063.
>>>>>>>
>>>>>>> Just like this user got it here
>>>>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> 
>>>>>>> in
>>>>>>> predictionio-user group. Any suggestions?
>>>>>>>
>>>>>>> Here's a more of my predict method:
>>>>>>>
>>>>>>> def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>>
>>>>>>>   def predict(model: SomeModel, query: Query): PredictedResult = {
>>>>>>>
>>>>>>>
>>>>>>>   val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName)
>>>>>>>
>>>>>>>       var previousEvents = try {
>>>>>>>         PEventStore.find(
>>>>>>>           appName = appName,
>>>>>>>           entityType = Some(ap.entityType),
>>>>>>>           entityId = Some(query.entityId.getOrElse(""))
>>>>>>>         )(model.dummyRDD.context).map(event => {
>>>>>>>
>>>>>>>           Try(new CustomEvent(
>>>>>>>             Some(event.event),
>>>>>>>             Some(event.entityType),
>>>>>>>             Some(event.entityId),
>>>>>>>             Some(event.eventTime),
>>>>>>>             Some(event.creationTime),
>>>>>>>             Some(new Properties(
>>>>>>>               *...*
>>>>>>>             ))
>>>>>>>           ))
>>>>>>>         }).filter(_.isSuccess).map(_.get)
>>>>>>>       } catch {
>>>>>>>         case e: Exception => // fatal because of error, an empty query
>>>>>>>           logger.error(s"Error when reading events: ${e}")
>>>>>>>           throw e
>>>>>>>       }
>>>>>>>
>>>>>>>      ...
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Hasan Can Saral
>>>>> hasancansa...@gmail.com
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Hasan Can Saral
>>> hasancansa...@gmail.com
>>>
>>
>>
>


-- 

Hasan Can Saral
hasancansa...@gmail.com

Re: How to access Spark Context in predict?

Reply via email to