Hi Kenneth & Donald, That was really clarifying, thank you. I really appreciate it. So now I know that;
1- I should use LEventStore and query HBase without sc and with the smallest processing as possible in predict, 2- In this case I don't have to extend PersistentModel, since I will not need sc in predict. 3- If I need sc and batch processing in predict, I can save RandomForest trees to a file, then I can load it from there. As far as I can see, mt only option to access sc for PEventStore is to add a dummy RDD to the model, and use dummyRDD.context. Am I correct, especially in the 3rd point? Thank you again, Hasan On Tue, Sep 27, 2016 at 9:00 AM, Kenneth Chan <kenn...@apache.org> wrote: > Hasan, > > Spark randomforest algo doesn't need RDD. much simpler to simply serialize > it and use in local memory in predict(). > see example here. > https://github.com/PredictionIO/template-scala-parallel-leadscoring/blob/ > develop/src/main/scala/RFAlgorithm.scala > > For accessing evernt store in predict(), you should use LEventStore API > (not PEventStore API) to have fast query for specific events. > > (use PEventStore API if you really want to do batch processing again in > predict() and need RDD for it) > > > Kenneth > > > On Mon, Sep 26, 2016 at 9:19 PM, Donald Szeto <don...@apache.org> wrote: > >> Hi Hasan, >> >> Does your randomForestModel contain any RDD? >> >> If so, implement your algorithm by extending PAlgorithm, have your model >> extend PersistentModel, and implement PersistentModelLoader to save and >> load your model. You will be able to perform RDD operations within >> predict() by using the model's RDD. >> >> If not, implement your algorithm by extending P2LAlgorithm, and see if >> PredictionIO can automatically persist the model for you. The convention >> assumes that a non-RDD model does not require Spark to perform any RDD >> operations, so there will be no SparkContext access. >> >> Are these conventions not fitting your use case? Feedbacks are always >> welcome for improving PredictionIO. >> >> Regards, >> Donald >> >> >> On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral <hasancansa...@gmail.com >> > wrote: >> >>> Hi Marcin, >>> >>> I did look at the definition of PersistentModel, and indeed replaced >>> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I >>> really appreciate your help. >>> >>> However, I am having quite hard time understanding how I can access sc >>> object that is provided by PredictionIO to save and apply methods within >>> predict method. >>> >>> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends >>> PersistentModel[SomeAlgorithmParams] { >>> >>> override def save(id: String, params: SomeAlgorithmParams, sc: >>> SparkContext): Boolean = { >>> >>> // Here I should save randomForestModel to a file, but how to?// Tried >>> saveAsObjectFile but no luck. >>> >>> true >>> >>> } >>> } >>> >>> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, >>> FraudModel] { >>> override def apply(id: String, params: SomeAlgorithmParams, sc: >>> Option[SparkContext]): SomeModel = { >>> >>> // // Here should I load randomForestModel from file? How? >>> new SomeModel(randomForestModel) >>> >>> } >>> } >>> >>> So, my questions have become: >>> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have >>> to return false and retrain upon deployment. How do I skip pio train in >>> this case? >>> 2- How do I load saved randomForestModel from file? If I cannot, will I >>> remove object SomeModel extends PersistentModelLoader all together? >>> 3- How do I access sc within predict? Do I save a dummy RDD, load it in >>> apply, and say .context? In this case what happens to randomForestModel? >>> >>> I am really quite confused and could really appreciate some help/sample >>> code if you have time. >>> Thank you. >>> Hasan >>> >>> >>> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński <ziem...@gmail.com> >>> wrote: >>> >>>> Hi Hasan, >>>> >>>> So I guess, there are two things here: >>>> 1. You need SparkContext for predictions >>>> 2. You also need to retrain you model during loading >>>> >>>> Please, look at the definition of PersistentModel and the comments >>>> attached: >>>> >>>> trait PersistentModel[AP <: Params] {/** Save the model to some persistent >>>> storage. >>>> * >>>> * This method should return true if the model has been saved successfully >>>> so >>>> * that PredictionIO knows that it can be restored later during deployment. >>>> * This method should return false if the model cannot be saved (or should >>>> * not be saved due to configuration) so that PredictionIO will re-train the >>>> * model during deployment. All arguments of this method are provided by >>>> * automatically by PredictionIO. >>>> * >>>> * @param id ID of the run that trained this model. >>>> * @param params Algorithm parameters that were used to train this model. >>>> * @param sc An Apache Spark context. >>>> */def save(id: String, params: AP, sc: SparkContext): Boolean} >>>> >>>> In order to achieve the desired result you could simply use >>>> PersistentModel instead of LocalFileSystemPersistentModel and return false >>>> from save. Then during deployment your model will be retrained through your >>>> Algorithm implementation. You shouldn't need to retrain your model in >>>> implementations of PersistentModelLoader - this is rather for loading >>>> models, that are already trained and stored somewhere. >>>> You can save SparkContext instance provided to the train method for >>>> usage in predict(...) (assuming that your algorithm is an instance of >>>> PAlgorithm or P2LAlgorithm). Thus you should have what you need. >>>> >>>> Regards, >>>> Marcin >>>> >>>> >>>> >>>> pt., 23.09.2016 o 17:46 użytkownik Hasan Can Saral < >>>> hasancansa...@gmail.com> napisał: >>>> >>>>> Hi Marcin! >>>>> >>>>> Thank you for your answer. >>>>> >>>>> I do only need SparkContext, but have no idea on: >>>>> 1- How to retrieve it from PersitentModelLoader? >>>>> 2- How do I access sc in predict method using the configuration below? >>>>> >>>>> class SomeModel() extends >>>>> LocalFileSystemPersistentModel[SomeAlgorithmParams] { >>>>> override def save(id: String, params: SomeAlgorithmParams, sc: >>>>> SparkContext): Boolean = { >>>>> false >>>>> } >>>>> } >>>>> >>>>> object SomeModel extends >>>>> LocalFileSystemPersistentModelLoader[SomeAlgorithmParams, FraudModel] { >>>>> override def apply(id: String, params: SomeAlgorithmParams, sc: >>>>> Option[SparkContext]): SomeModel = { >>>>> new SomeModel() // HERE I TRAIN AND RETURN THE TRAINED MODEL >>>>> } >>>>> } >>>>> >>>>> Thank you very much, I really appreciate it! >>>>> >>>>> Hasan >>>>> >>>>> >>>>> On Thu, Sep 22, 2016 at 7:05 PM, Marcin Ziemiński <ziem...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi Hasan, >>>>>> >>>>>> I think that you problem comes from using deserialized RDD, which >>>>>> already lost its connection with SparkContext. >>>>>> Similar case could be found here: http://stackoverflow.com/quest >>>>>> ions/29567247/serializing-rdd >>>>>> >>>>>> If you only really need SparkContext you could probably use the one >>>>>> provided to PersitentModelLoader, which would be implemented by your >>>>>> model. >>>>>> Alternatively you could also implement PersistentModel to return >>>>>> false from save method. In this case your algorithm would be retrained on >>>>>> deploy, what would also provide you with the instance of SparkContext. >>>>>> >>>>>> Regards, >>>>>> Marcin >>>>>> >>>>>> >>>>>> czw., 22.09.2016 o 13:34 użytkownik Hasan Can Saral < >>>>>> hasancansa...@gmail.com> napisał: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I am trying to query Event Server with PEventStore api in predict >>>>>>> method to fetch events per entity to create my features. PEventStore >>>>>>> needs >>>>>>> sc, and for this, I have: >>>>>>> >>>>>>> - Extended PAlgorithm >>>>>>> - Extended LocalFileSystemPersistentModel and LocalFileSystemP >>>>>>> ersistentModelLoader >>>>>>> - Put a dummy emptyRDD into my model >>>>>>> - Tried to access sc with model.dummyRDD.context to receive this >>>>>>> error: >>>>>>> >>>>>>> org.apache.spark.SparkException: RDD transformations and actions >>>>>>> can only be invoked by the driver, not inside of other transformations; >>>>>>> for >>>>>>> example, rdd1.map(x => rdd2.values.count() * x) is invalid because the >>>>>>> values transformation and count action cannot be performed inside of the >>>>>>> rdd1.map transformation. For more information, see SPARK-5063. >>>>>>> >>>>>>> Just like this user got it here >>>>>>> <https://groups.google.com/forum/#!topic/predictionio-user/h4kIltGIIYE> >>>>>>> in >>>>>>> predictionio-user group. Any suggestions? >>>>>>> >>>>>>> Here's a more of my predict method: >>>>>>> >>>>>>> def predict(model: SomeModel, query: Query): PredictedResult = { >>>>>>> >>>>>>> def predict(model: SomeModel, query: Query): PredictedResult = { >>>>>>> >>>>>>> >>>>>>> val appName = sys.env.getOrElse[String]("APP_NAME", ap.appName) >>>>>>> >>>>>>> var previousEvents = try { >>>>>>> PEventStore.find( >>>>>>> appName = appName, >>>>>>> entityType = Some(ap.entityType), >>>>>>> entityId = Some(query.entityId.getOrElse("")) >>>>>>> )(model.dummyRDD.context).map(event => { >>>>>>> >>>>>>> Try(new CustomEvent( >>>>>>> Some(event.event), >>>>>>> Some(event.entityType), >>>>>>> Some(event.entityId), >>>>>>> Some(event.eventTime), >>>>>>> Some(event.creationTime), >>>>>>> Some(new Properties( >>>>>>> *...* >>>>>>> )) >>>>>>> )) >>>>>>> }).filter(_.isSuccess).map(_.get) >>>>>>> } catch { >>>>>>> case e: Exception => // fatal because of error, an empty query >>>>>>> logger.error(s"Error when reading events: ${e}") >>>>>>> throw e >>>>>>> } >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> } >>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Hasan Can Saral >>>>> hasancansa...@gmail.com >>>>> >>>> >>> >>> >>> -- >>> >>> Hasan Can Saral >>> hasancansa...@gmail.com >>> >> >> > -- Hasan Can Saral hasancansa...@gmail.com