Re: How to access Spark Context in predict?

2016-09-27 Thread Hasan Can Saral
Hi Kenneth & Donald,

That was really clarifying, thank you. I really appreciate it. So now I
know that;

1- I should use LEventStore and query HBase without sc and with the
smallest processing as possible in predict,
2- In this case I don't have to extend PersistentModel, since I will not
need sc in predict.
3- If I need sc and batch processing in predict, I can save RandomForest
trees to a file, then I can load it from there. As far as I can see, mt
only option to access sc for PEventStore is to add a dummy RDD to the
model, and use dummyRDD.context.

Am I correct, especially in the 3rd point?

Thank you again,
Hasan

On Tue, Sep 27, 2016 at 9:00 AM, Kenneth Chan  wrote:

> Hasan,
>
> Spark randomforest algo doesn't need RDD. much simpler to simply serialize
> it and use in local memory in predict().
> see example here.
> https://github.com/PredictionIO/template-scala-parallel-leadscoring/blob/
> develop/src/main/scala/RFAlgorithm.scala
>
> For accessing evernt store in predict(), you should use LEventStore API
> (not PEventStore API) to have fast query for specific events.
>
> (use PEventStore API if you really want to do batch processing again in
> predict() and need RDD for it)
>
>
> Kenneth
>
>
> On Mon, Sep 26, 2016 at 9:19 PM, Donald Szeto  wrote:
>
>> Hi Hasan,
>>
>> Does your randomForestModel contain any RDD?
>>
>> If so, implement your algorithm by extending PAlgorithm, have your model
>> extend PersistentModel, and implement PersistentModelLoader to save and
>> load your model. You will be able to perform RDD operations within
>> predict() by using the model's RDD.
>>
>> If not, implement your algorithm by extending P2LAlgorithm, and see if
>> PredictionIO can automatically persist the model for you. The convention
>> assumes that a non-RDD model does not require Spark to perform any RDD
>> operations, so there will be no SparkContext access.
>>
>> Are these conventions not fitting your use case? Feedbacks are always
>> welcome for improving PredictionIO.
>>
>> Regards,
>> Donald
>>
>>
>> On Mon, Sep 26, 2016 at 9:05 AM, Hasan Can Saral > > wrote:
>>
>>> Hi Marcin,
>>>
>>> I did look at the definition of PersistentModel, and indeed replaced
>>> LocalFileSystemPersistentModel with PersistenModel. Thank you for this, I
>>> really appreciate your help.
>>>
>>> However, I am having quite hard time understanding how I can access sc
>>> object that is provided by PredictionIO to save and apply methods within
>>> predict method.
>>>
>>> class SomeModel(randomForestModel: RandomForestModel,dummyRDD: RDD) extends 
>>> PersistentModel[SomeAlgorithmParams] {
>>>
>>>   override def save(id: String, params: SomeAlgorithmParams, sc: 
>>> SparkContext): Boolean = {
>>>
>>> // Here I should save randomForestModel to a file, but how to?// Tried 
>>> saveAsObjectFile but no luck.
>>>
>>> true
>>>
>>>   }
>>> }
>>>
>>> object SomeModel extends PersistentModelLoader[SomeAlgorithmParams, 
>>> FraudModel] {
>>>   override def apply(id: String, params: SomeAlgorithmParams, sc: 
>>> Option[SparkContext]): SomeModel = {
>>>
>>> // // Here should I load randomForestModel from file? How?
>>> new SomeModel(randomForestModel)
>>>
>>>   }
>>> }
>>>
>>> So, my questions have become:
>>> 1- Can I save randomForestModel? If yes, how? If I cannot, I will have
>>> to return false and retrain upon deployment. How do I skip pio train in
>>> this case?
>>> 2- How do I load saved randomForestModel from file? If I cannot, will I
>>> remove object SomeModel extends PersistentModelLoader all together?
>>> 3- How do I access sc within predict? Do I save a dummy RDD, load it in
>>> apply, and say .context? In this case what happens to randomForestModel?
>>>
>>> I am really quite confused and could really appreciate some help/sample
>>> code if you have time.
>>> Thank you.
>>> Hasan
>>>
>>>
>>> On Mon, Sep 26, 2016 at 2:56 PM, Marcin Ziemiński 
>>> wrote:
>>>
 Hi Hasan,

 So I guess, there are two things here:
 1. You need SparkContext for predictions
 2. You also need to retrain you model during loading

 Please, look at the definition of PersistentModel and the comments
 attached:

 trait PersistentModel[AP <: Params] {/** Save the model to some persistent 
 storage.
 *
 * This method should return true if the model has been saved successfully 
 so
 * that PredictionIO knows that it can be restored later during deployment.
 * This method should return false if the model cannot be saved (or should
 * not be saved due to configuration) so that PredictionIO will re-train the
 * model during deployment. All arguments of this method are provided by
 * automatically by PredictionIO.
 *
 * @param id ID of the run that trained this model.
 * @param params Algorithm parameters that were used to train this model.
 * @param sc An Apache Spark context.
 */def 

Re: delay of engines

2016-09-27 Thread Georg Heiler
indeed. I am mostly interested in 2)

Kenneth Chan  schrieb am Di., 27. Sep. 2016 um
09:22 Uhr:

> just want to clarify when you mix ""evaluation / prediction"
> 1. "evaluation" means evaluating the performance of the model
> 2. "prediction" means calculate the prediction (fraud or not in your case)
> understand that evaluation needs generating "prediction" in order to
> evaluate the accuracy of the model.
> But you are referring to low latency for 2 right?
>
> "Regarding the low volume data: some features will require some sort of
> SQL for extraction"
> So if this can be fast, and if using the model to calculate the likelihood
> of fraud can be done in memory (without RDD), then the latency should be
> low.
>
>
>
> On Tue, Sep 27, 2016 at 12:04 AM, Georg Heiler 
> wrote:
>
>> For me, the latency of model evaluation is more important than training
>> latency. This holds true for retraining / model updates as well. I would
>> say that the "evaluation / prediction" latency is the most critical one.
>>
>> Your point regarding 3) is very interesting for me. I have 2 types of
>> data:
>>
>>- low volume information about a customer
>>- high volume usage data
>>
>> The high volume data will require aggregation (e.g. spark SQL) prior the
>> model can be evaluated. Here, a higher latency would be OK.
>> Regarding the low volume data: some features will require some sort of
>> SQL for extraction.
>>
>>
>>
>> Kenneth Chan  schrieb am Di., 27. Sep. 2016 um
>> 07:43 Uhr:
>>
>>> re: kappa vs lambda.
>>> as far as i understand, at high-level, kappa is more like a subset of
>>> lambda (ie. only keep the real-time part)
>>>
>>> https://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/
>>>
>>> Gerog, would you be more specific when you talk about "latency
>>> requirement"
>>>
>>> 1. latency of training a model with new data?
>>> 2. latency of deploy new model ? or
>>> 3. latency of getting predicted result using the previously trained
>>> model given a query?
>>>
>>> if you are talking about 3, depending on how your model calculates the
>>> prediction. It doesn't need spark if the model can be fit into memory.
>>>
>>>
>>>
>>>
>>> On Mon, Sep 26, 2016 at 9:41 PM, Georg Heiler >> > wrote:
>>>
 Hi Donald
 For me it is more about stacking and meta learning. The selection of
 models could be performed offline.

 But
 1 I am concerned about keeping the model up to date - retraining
 2 having some sort of reinforcement learning to improve / punish based
 on correctness of new ground truth 1/month
 3 to have Very quick responses. Especially more like an evaluation of a
 random forest /gbt / nnet without staring a yearn job.

 Thank you all for the feedback so far
 Best regards to
 Georg
 Donald Szeto  schrieb am Di. 27. Sep. 2016 um 06:34:

> Sorry for side-tracking. I think Kappa architecture is a promising
> paradigm, but including batch processing from the canonical store to the
> serving layer store should still be necessary. I believe this somewhat
> hybrid Kappa-Lambda architecture would be generic enough to handle many 
> use
> cases. If this is something that sounds good to everyone, we should drive
> PredictionIO to that direction.
>
> Georg, are you talking about updating an existing model in different
> ways, evaluate them, and select one within a time constraint, say every 1
> second?
>
> On Mon, Sep 26, 2016 at 4:11 PM, Pat Ferrel 
> wrote:
>
>> If you need the model updated in realtime you are talking about a
>> kappa architecture and PredictionIO does not support that. It does Lambda
>> only.
>>
>> The MLlib-based recommenders use live contexts to serve from
>> in-memory copies of the ALS models but the models themselves were
>> calculated in the background. There are several scaling issues with doing
>> this but it can be done.
>>
>> On Sep 25, 2016, at 10:23 AM, Georg Heiler 
>> wrote:
>>
>> Wow thanks. This is a great explanation.
>>
>> So when I think about writing a spark template for fraud detection (a
>> combination of spark sql and xgboost ) and would require <1 second 
>> latency
>> how should I store the model?
>>
>> As far as I know startup of YARN jobs e.g. A spark job is too slow
>> for that.
>> So it would be great if the model could be evaluated without using
>> the cluster or at least having a hot spark context similar to spark
>> jobserver or SnappyData.io  is this possible
>> for prediction.io?
>>
>> Regards,
>> Georg
>> Pat Ferrel  schrieb am So. 25. Sep. 2016 um
>> 18:19:
>>
>>> Gustavo it 

Re: delay of engines

2016-09-27 Thread Georg Heiler
For me, the latency of model evaluation is more important than training
latency. This holds true for retraining / model updates as well. I would
say that the "evaluation / prediction" latency is the most critical one.

Your point regarding 3) is very interesting for me. I have 2 types of data:

   - low volume information about a customer
   - high volume usage data

The high volume data will require aggregation (e.g. spark SQL) prior the
model can be evaluated. Here, a higher latency would be OK.
Regarding the low volume data: some features will require some sort of SQL
for extraction.



Kenneth Chan  schrieb am Di., 27. Sep. 2016 um
07:43 Uhr:

> re: kappa vs lambda.
> as far as i understand, at high-level, kappa is more like a subset of
> lambda (ie. only keep the real-time part)
>
> https://www.ericsson.com/research-blog/data-knowledge/data-processing-architectures-lambda-and-kappa/
>
> Gerog, would you be more specific when you talk about "latency requirement"
>
> 1. latency of training a model with new data?
> 2. latency of deploy new model ? or
> 3. latency of getting predicted result using the previously trained model
> given a query?
>
> if you are talking about 3, depending on how your model calculates the
> prediction. It doesn't need spark if the model can be fit into memory.
>
>
>
>
> On Mon, Sep 26, 2016 at 9:41 PM, Georg Heiler 
> wrote:
>
>> Hi Donald
>> For me it is more about stacking and meta learning. The selection of
>> models could be performed offline.
>>
>> But
>> 1 I am concerned about keeping the model up to date - retraining
>> 2 having some sort of reinforcement learning to improve / punish based on
>> correctness of new ground truth 1/month
>> 3 to have Very quick responses. Especially more like an evaluation of a
>> random forest /gbt / nnet without staring a yearn job.
>>
>> Thank you all for the feedback so far
>> Best regards to
>> Georg
>> Donald Szeto  schrieb am Di. 27. Sep. 2016 um 06:34:
>>
>>> Sorry for side-tracking. I think Kappa architecture is a promising
>>> paradigm, but including batch processing from the canonical store to the
>>> serving layer store should still be necessary. I believe this somewhat
>>> hybrid Kappa-Lambda architecture would be generic enough to handle many use
>>> cases. If this is something that sounds good to everyone, we should drive
>>> PredictionIO to that direction.
>>>
>>> Georg, are you talking about updating an existing model in different
>>> ways, evaluate them, and select one within a time constraint, say every 1
>>> second?
>>>
>>> On Mon, Sep 26, 2016 at 4:11 PM, Pat Ferrel 
>>> wrote:
>>>
 If you need the model updated in realtime you are talking about a kappa
 architecture and PredictionIO does not support that. It does Lambda only.

 The MLlib-based recommenders use live contexts to serve from in-memory
 copies of the ALS models but the models themselves were calculated in the
 background. There are several scaling issues with doing this but it can be
 done.

 On Sep 25, 2016, at 10:23 AM, Georg Heiler 
 wrote:

 Wow thanks. This is a great explanation.

 So when I think about writing a spark template for fraud detection (a
 combination of spark sql and xgboost ) and would require <1 second latency
 how should I store the model?

 As far as I know startup of YARN jobs e.g. A spark job is too slow for
 that.
 So it would be great if the model could be evaluated without using the
 cluster or at least having a hot spark context similar to spark jobserver
 or SnappyData.io  is this possible for
 prediction.io?

 Regards,
 Georg
 Pat Ferrel  schrieb am So. 25. Sep. 2016 um
 18:19:

> Gustavo it correct. To put another way both Oryx and PredictionIO are
> based on what is called a Lambda Architecture. Loosely speaking this means
> a potentially  slow background task computes the predictive “model” but
> this does not interfere with serving queries. Then when the model is ready
> (stored in HDFS or Elasticsearch depending on the template) it is deployed
> and the switch happens in microseconds.
>
> In the case of the Universal Recommender the model is stored in
> Elasticsearch. During `pio train` the new model in inserted into
> Elasticsearch and indexed. Once the indexing is done the index alias used
> to serve queries is switched to the new index in one atomic action so 
> there
> is no downtime and any slow operation happens in the background without
> impeding queries.
>
> The answer will vary somewhat with the template. Templates that use
> HDFS for storage may need to be re-deployed but still the switch from 
> using
> one to having the new one running is microseconds.
>