Re: Eventserver API in an Engine?

Pat Ferrel Mon, 10 Jul 2017 08:11:57 -0700

Good to know but if there is an event blocker and sniffer then they should be a 
concern of the Engine. Otherwise you are hiding Engine specifics from the 
Engine. The most irrefutable need for the “input” method is kappa requirements 
and Lambda need for realtime changes to the model.

I understand what you are aiming for—namely data independence from model and 
engine—but it is impossible and seems a very odd place to abstract when you put 
it in real terms. A Recommender will never need the same data as a neural net, 
a clusterer, or a classifier. This abstraction does not exist in the data 
because it is not there in the algorithm and should not be forced away from the 
Engine.

BTW the way the prototype server handles this data independence is allowing the 
user to ignore the engine (which may be under tuning or development and not 
reliable for validation) and simply mirroring un-validated events (PIO has this 
built into some client SDKs but this suffers from getting only a single clients 
events). Then they can be replayed or modified as with exported PIO events. The 
server also imports these maintaining event level compatibility with PIO. This 
even works with Kappa. If you want to re-create a kappa model you simply replay 
the mirrored events. But mirroring is optional and likely to be turned off once 
the Engine is running correctly. IMO it is a more flexible model than forcing 
data independence away from the Engine and maintaining it into the storage 
layer.

So far I’ve written 3 PIO Templates from scratch, the UR, The Contextual Bandit 
(MAB type online learner), and the db-cleaner. What I have found with these 
rather different algorithms is:
1) PIO works ok with the UR but could use realtime validation and a better way 
of dropping old events.
2) Kappa doesn’t work well at all with PIO but does with the prototype server
3) event/dataset compatibility can be maintained between PIO and prototype 
Engines.
4) there is no need for a db-cleaner in the prototype. The Engine persists 
mutable objects and makes realtime changes to their state, and event streams 
can be handled as the Engine needs (Kappa discards without storing, Lambda may 
store) but since they are separate from $set, $unset these streams can have db 
TTLs to age out old data for Lambda Engines. The system is always self-cleaning 
with no heavyweight operation required to keep just the right data (the db 
cleaner is heavyweight and slow), the data does not grow forever by design. 
This was never addressed as a design requirement for PIO and the add-on we did 
is not a very good solution.

On Jul 9, 2017, at 7:09 PM, Kenneth Chan <kenn...@apache.org> wrote:

i think there is a philosophical discussion:
1) as PIO user, should i collect my event data based on my application 
uniqueness and ML needs (of course, i can use the template format as 
reference), then create engine or modify engine template to use these data to 
train model 
or 
2) as PIO user, because i'm using this specific engine template, i must import 
and transform my data into the exact format required by template, and send to 
event server in order to make it work.

however, regardless of above, PIO event server currently supports "event 
blocker" and "event sniffer" to solve these issues you mentioned
1) "event blocker" can be used for "event validation in real time" - the engine 
template can provide a sample event blocker implementation and can be used to 
reject improper events.

2) "event sniffer" can be used for "forwarding specific event to other 
processing system in real time" - the engine template can also provide a sample 
sniffer (e.g. send to UR's elasticsearch to update meta data) 

for advanced user, they can modify these based on their application needs (say, 
if they have multiple engines). for starter, they may use out of the box along 
with template.

see 
http://mail-archives.apache.org/mod_mbox/incubator-predictionio-dev/201706.mbox/%3CCAF_HxLtEonOVALSQgrCRGXctAbL7eypxwG0ErHpaBJJym15j5Q%40mail.gmail.com%3E

<http://mail-archives.apache.org/mod_mbox/incubator-predictionio-dev/201706.mbox/%3CCAF_HxLtEonOVALSQgrCRGXctAbL7eypxwG0ErHpaBJJym15j5Q%40mail.gmail.com%3E>

On Sun, Jul 9, 2017 at 5:28 PM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
I must disagree here, The Engine should decide the disposition of data, which 
cannot be left to a generic EventServer. Data is the concern of the Engine, not 
the EventSever or PIO framework for these reasons:

1) input needs to be validated and since it is defined by the Engine it seems 
rather obvious that the Engine must provide an “input” method like the 
“predict” method. This input method parses and validates input responding with 
errors of format that only it knows about. It also decides…
2) a Kappa learner must get data in realtime, and do not save datasets, only 
buffers of data at most.
3) Kappa and even some Lambda algorithms need to modify/update the model in 
realtime. Realtime model updates define "Kappa online learners" but there are 
also Lambda learners like the UR that need to update parts of the model when, 
for instance, item attributes change (out of stock, ...) As PIO stands now this 
can only be done at train time which is a rather troublesome limitation.
4) It is the Engine’s concern, whether input modifies mutable or immutable 
data. One engine may use a named event to do something but the name of the 
event is only know by the Engine. So if you agree that data come in 2 forms, 
only the Engine can define and enforce this.

This is certainly not to denigrate the EventStore, which is most certainly 
required by every existing PIO Lambda Engine. But it should be the concern of 
the Engine how it is used and the only way to do this is make “input” the 
concern of the Engine. This can be done generically if there is truly no 
validation beyond the current an so does not needlessly complicate Engines.

I am also not arguing for a different encoding of data. The PIO event JSON is 
quite flexible and I have not seen a need to alter it. However because of its 
flexibility the EventServer cannot really validate it. The PIO events are even 
quite sufficient for Lambda and Kappa data encoding in fact we have a Lambda 
Template in PIO that we made into a Kappa Template with the prototype server 
and used exactly the same event encoding. Since the prototype requires that the 
Engine validate it and respond to the input request, we immediately found event 
encoding errors that were very serious and had been in the client for a long 
time but since the events looked perfectly fine to the PIO EventServer, the 
errors were never detected and the data was in fact ignored. Within a day of 
replaying exported PIO events to the prototype server the issue was resolved 
and fixed in the client.

On Jul 8, 2017, at 12:48 AM, Kenneth Chan <kenn...@apache.org 
<mailto:kenn...@apache.org>> wrote:

re: "bundling event server as engine"

depending on how we wanna separate the concern.

the way i look at it is decouple 1, data collection service (PIO event server) 
and 2. modeling and prediction service (PIO engine) - that's the separation of 
concern.

Ideally data is agnostic to engine, and should be tied to user application.
The original vision is user collect data, then can create multiple PIO engines 
which use the collected data.
if combine 1 and 2, how could user create engine A and engine B to train model 
on collected data for different ML use case?

for your input data problem, maybe other way is that the template should also 
provide a "event validator" which can be loaded into event server and advanced 
user can also customize it.

On Sat, Jul 8, 2017 at 12:31 AM, Kenneth Chan <kenn...@apache.org 
<mailto:kenn...@apache.org>> wrote:
# re: " I see it as objects you see it as data stores"

not really. I see things based on what functionality and purpose it provides. 
like you mentioned - The way Elasticseach is used in UR is part of the model 
and where the algorithm write the computation result into and then used as 
serving. In a way, it's the model. just a more complex model than a simple 
linear regression function.
If we define "Model" as output of the train() function, then UR is storing the 
model into Elasticsearch - and it is required because UR relies on 
Elasticsearch computation - meaning it's part of UR's "model".predict()

# re:  "In reality the input comes in 2 types, persistent mutable objects and 
immutable streams of events (that may well be usable as a time window of data, 
dropping old events)"

like you said, basically there are two types of data type 
1. mutable object (e.g meta data of a product, user profile, etc) 
2. immutable event (e.g. behavior data)

However, 1 can be considered as 2 if we treat the "changes" of mutable object 
as "event" as well - basically this's the current event server design.

But i agree some use case may not care about changes of mutable object - for 
this, we can provide some API/option for people to store mutable objects and 
always overwrite. or use better storage structure to capture the changes of 
mutable object.

On Fri, Jun 30, 2017 at 5:29 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Actually I think it’s a great solution. The question about different storage 
config (https://issues.apache.org/jira/browse/PIO-96 
<https://issues.apache.org/jira/browse/PIO-96>) is because Elasticsearch 
performs the last step of the algorithm, it is not just a store for models, so 
it’s an integral part of the compute engine, not the storage. If it looks that 
way I hardly think it matters in the way implied (see below where Templates 
should come with compassable containers). This is actually the primary 
difference in the way you and I look at the problem. I see it as objects you 
see it as data stores. Let’s add the question of compute backends and 
unfortunately users will have to pick the solution along with the engines they 
require (TensorFlow anyone?) If PIO is going to be a viable ML/AI server in the 
long term it has to be a lot more flexible, not less so. In the proto server I 
mention, the Engine decides on the compute backend and the example Template 
does not use Spark. 

The prototype server I mentioned actually only handles metadata, installs 
engines, and mirrors input. To handle Kappa as well as Lambda algorithms the 
Engine must decide what and if it needs to store. Therefore instead of assuming 
an EventServer we have mirroring of un-validated events. This has many 
benefits. For one thing we can require validation from the Engine with every 
event. This is because the single most frequent mistake by users I’ve dealt 
with is malformed input. PIO’s input scheme is great because it is so flexible 
but because of that validation is nil. I have seen users that have been using a 
Template for a year without understanding that most of their data was ignored 
by the Template code (not the UR in this case) . I have spent literally 
thousands of hours helping correct bad input over email even though the UR has 
orders of magnitude better docs than any other Template. Yes, it’s also a lot 
more complicated but anyway, I’m tired of this—we need validation of every 
input. Then maybe I will only spend 90% of those hours :-P

Anyway I think the separation of concerns should be Server handles metadata, 
installs engines, and mirrors input. The Template framework provides required 
APIs for Engines that must be implemented and a set of Tools they can use or 
ignore to use what ever they need. If the Engines provides an input method they 
can validate and if they are Kappa, learn immediately (update models in real 
time), if they are Lambda, store the valid data using something like an Event 
Store. The train method is then optional and, of course, query.

BTW the reason I call it a PredictionServer (in PIO) is because it is not an 
Engine Server, all it does is provide a query endpoint. This corresponds to 
only one method of an Engine and there is no reason to look at a query endpoint 
any differently than the other public APIs of the Engine.

I guess I look at this in an object oriented way, not a data oriented way. This 
leads to Template code/Engines making more decisions. The Kappa template we 
have for this proto server never uses Spark. Why would it to implement Kappa 
online learning? It also does not need an Event Store because it only stores 
models. This is also fine for Lambda where an Event Store is required because 
the Engine provides the input method too, where it can make the store/no-store 
decision.

This has other benefits. Treating input as an immutable stream has some major 
flaws. Some of the data has to be dropped, we cannot store forever—no one can 
afford that much disk. And some data can never be dropped because only the 
aggregate of all object changes makes any sense. In reality the input comes in 
2 types, persistent mutable objects and immutable streams of events (that may 
well be usable as a time window of data, dropping old events). With the above 
split, the mirror always has all input in case it’s needed, the Engine can 
decide what events operate on mutable objects and store the rest as a stream in 
the Event Store (with TTL for time windows). Once this is trusted to work 
correctly mirroring can be stopped. In fact the mutable objects can affect the 
model in real time now, even with Lambda Templates like the UR. When an object 
property changes in today’s PIO we have to wait till train before the model 
changes because the Engine does not have an input method. If it did, then input 
that should affect the model can.

This solves all my pet peeves, internal API-wise, and allows one implementation 
of an SaaS capable multi-tenant, secure Server. And here multi-tenancy is super 
lightweight. Since most users have only one Template, they may have to install 
supporting compute engines or stores. This is a one time issue for them and 
Templates should come with containers and scripts to compose them. We’re 
already doing this with PIO. A fully clustered install takes an hour. Admin of 
such a monster is another issue that is not necessarily better or even good in 
this model but a subject for another day.

On Jun 30, 2017, at 1:40 AM, Kenneth Chan <kenn...@apache.org 
<mailto:kenn...@apache.org>> wrote:

I agree that there is confusion regarding event server VS event storage  and  
the unclear usage definition of types of data storage (e.g. meta-data vs model)
but i'm not sure if bundling Event Server with Engine Server (or Pat calls it 
PredictionServer)  is a good solution.

currently PIO has 3 "types" of storage
- METADATA  : store PIO's administrative data ("Apps", etc)
- EVENTDATA: store the pure events
- MODELDATA : store the model

1. one confusion is when universal recommendation is used, Elastichsearch is 
required in order to serve the Predicted Results. Is this type of storage 
considered as "MODELDATA" or "METADATA" or should introduce a new type of 
storage for "Serving" purpose (which can be tied to engine specific) ?

2. question regarding the problem described in ticket   
https://issues.apache.org/jira/browse/PIO-96 
<https://issues.apache.org/jira/browse/PIO-96>

```
 Problems emerge when a developer tries running multiple engines with different 
storage configs on the same underlying database, such as:
a Classifier with Postgres meta, event, & model storage, and
the Universal Recommender with Elasticsearch meta plus Postgres event & model 
storage.
```

why user want to use different storage config for different engine? can the 
classifier match the same configuration as universal recommender?
because i thought the storage configuration is more tied to PIO as a whole 
rather than per engine.

Kenneth

On Thu, Jun 29, 2017 at 10:22 AM, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Are you asking about the EventServer or PredictionServer? The EventServer is 
multi-tenant with access keys, not really pure REST. We (ActionML) did a hack 
for a client to The PredictionServer to allow Actors to respond on the same 
port for several engine queries. We used REST addressing for this, which adds 
yet another id. This makes for one process for the EventServe and one for the 
PredictionServer. Each responding engine was behind an Actor not a new process. 
So it’s possible but IMO makes the API as a total rather messy. We also had to 
change the workflow so metadata was read on `pio deploy` so one build could 
then deploy many times with different engine.jsons and different 
PredictionServer endpoints for queries only. This comes pretty close to clean 
multi-tenantcy but is not SaaS capable without solving SSL and Auth for both 
services.

The hack was pretty ugly in the code and after doing that I concluded that a 
big chunk needed a rewrite and hence the prototype. It depends on what you want 
but if you want SaaS I think that mean SSL + Auth + multi-tenancy, and you also 
mention minimizing process boundaries. There are rather many implications to 
this.

On Jun 29, 2017, at 9:57 AM, Mars Hall <m...@heroku.com 
<mailto:m...@heroku.com>> wrote:

Donald, Pat, great to hear that this is a well-pondered design challenge of PIO 
😄 The prototype, composable, all-in-one server sounds promising.

I'm wondering if there's a more immediate possibility to address adding the 
`/events` REST API to Engine? Would it make sense to try invoking an 
`EventServiceActor` in the tools.commands.Engine#deploy method? If that would 
be a distasteful hack, just say so. I'm trying to understand possibility of 
solving this in the current codebase vs a visionary new version of PIO.

*Mars

( <> .. <> )

> On Jun 28, 2017, at 18:01, Pat Ferrel <p...@occamsmachete.com 
> <mailto:p...@occamsmachete.com>> wrote:
>
> Ah, one of my favorite subjects.
>
> I’m working on a prototype server that handles online learning as well as 
> Lambda style. There is only one server with everything going through REST. 
> There are 2 resource types, Engines and Commands. Engines have REST APIs with 
> endpoints for Events and Queries. So something like POST 
> /engines/resouce-id/events would send an event to what is like a PIO app and 
> POST /engine/resource-id/queries does the PIO query equivalent. Note that 
> this is fully multi-tenant and has only one important id. It’s based on 
> akka-http in a fully microservice type architecture. While the Server is 
> running you can add completely new Templates for any algorithm, thereby 
> adding new endpoints for Events and Queries. Each “tenant” is super 
> lightweight since it’s just an Actor not a new JVM. The CLI is actually 
> Python that hits the REST API with a Python SDK, and there is a Java SDK too. 
> We support SSL and OAuth2 so having those baked into an SDK is really 
> important. Though a prototype it can support multi-tenant SaaS.
>
> We have a prototype online learner Template which does not save events at all 
> though it ingests events exactly like PIO in the same format in fact we have 
> the same template for both servers taking identical input. Instead of an 
> EventServer it mirrors received events events before validation (yes we have 
> full event validation that is template specific.) This allows some events to 
> affect mutable data in a database and some to just be an immutable stream or 
> even be thrown away for Kappa learners. For an online learner, each event 
> updates the model, which is stored periodically as a watermark. If you want 
> to change algo params you destroy the engine instance and replay the mirrored 
> events. For a Lambda learner the Events may be stored like PIO.
>
> This is very much along the lines of the proposal I put up for future PIO but 
> the philosophy internally is so different that I’m now not sure how it would 
> fit. I’d love to talk about it sometime and once we do a Lambda Template 
> we’ll at least have some nice comparisons to make. We migrated the Kappa 
> style Template to it so we have a good idea that it’s not that hard. I’d love 
> to donate it to PIO but only if it makes sense.
>
>
> On Jun 28, 2017, at 4:27 PM, Donald Szeto <don...@apache.org 
> <mailto:don...@apache.org>> wrote:
>
> Hey Mars,
>
> Thanks for the suggestion and I agree with your point on the metadata part. 
> Essentially I think the app and channel concept should be instead logically 
> grouped together with event, not metadata.
>
> I think in some advanced use cases, event storage should not even be a hard 
> requirement as engine templates can source data differently. In the long run, 
> it might be cleaner to have event server (and all relevant concepts such as 
> its API, access keys, apps, etc) as a separable package, that is by default 
> turned on, embedded to engine server. Advanced users can either make it 
> standalone or even turn it off completely.
>
> I imagine this kind of refactoring would echo Pat's proposal on making a 
> clean and separate engine and metadata management system down the road.
>
> Regards,
> Donald
>
> On Wed, Jun 28, 2017 at 3:29 PM Mars Hall <m...@heroku.com 
> <mailto:m...@heroku.com>> wrote:
> One of the ongoing challenges we face with PredictionIO is the separation of 
> Engine & Eventserver APIs. This separation leads to several problems:
>
> 1. Deploying a complete PredictionIO app requires multiple processes, each 
> with its own network listener
> 2. Eventserver & Engine must be configured to share exactly the same storage 
> backends (same `pio-env.sh`)
> 3. Confusion between "Eventserver" (an optional REST API) & "event storage" 
> (a required database)
>
> These challenges are exacerbated by the fact that PredictionIO's docs & `pio 
> app` CLI make it appear that sharing an Eventserver between Engines is a good 
> idea. I recently filed a JIRA issue about this topic. TL;DR sharing an 
> eventserver between engines with different Meta Storage config will cause 
> data corruption:
>  https://issues.apache.org/jira/browse/PIO-96 
> <https://issues.apache.org/jira/browse/PIO-96>
>
>
> I believe a lot of these issues could be alleviated with one change to 
> PredictionIO core:
>
> By default, expose the Eventserver API from the `pio deploy` Engine process, 
> so that it is not necessary to deploy a second Eventserver-only process. 
> Separate `pio eventserver` could still be optional if you need the separation 
> of concerns for scalability.
>
>
> I'd love to hear what you folks think. I will file a JIRA enhancement issue 
> if this seems like an acceptable approach.
>
> *Mars Hall
> Customer Facing Architect
> Salesforce Platform / Heroku
> San Francisco, California
>
>

Re: Eventserver API in an Engine?

Reply via email to