Ah, one of my favorite subjects.

I’m working on a prototype server that handles online learning as well as 
Lambda style. There is only one server with everything going through REST. 
There are 2 resource types, Engines and Commands. Engines have REST APIs with 
endpoints for Events and Queries. So something like POST 
/engines/resouce-id/events would send an event to what is like a PIO app and 
POST /engine/resource-id/queries does the PIO query equivalent. Note that this 
is fully multi-tenant and has only one important id. It’s based on akka-http in 
a fully microservice type architecture. While the Server is running you can add 
completely new Templates for any algorithm, thereby adding new endpoints for 
Events and Queries. Each “tenant” is super lightweight since it’s just an Actor 
not a new JVM. The CLI is actually Python that hits the REST API with a Python 
SDK, and there is a Java SDK too. We support SSL and OAuth2 so having those 
baked into an SDK is really important. Though a prototype it can support 
multi-tenant SaaS.

We have a prototype online learner Template which does not save events at all 
though it ingests events exactly like PIO in the same format in fact we have 
the same template for both servers taking identical input. Instead of an 
EventServer it mirrors received events events before validation (yes we have 
full event validation that is template specific.) This allows some events to 
affect mutable data in a database and some to just be an immutable stream or 
even be thrown away for Kappa learners. For an online learner, each event 
updates the model, which is stored periodically as a watermark. If you want to 
change algo params you destroy the engine instance and replay the mirrored 
events. For a Lambda learner the Events may be stored like PIO. 

This is very much along the lines of the proposal I put up for future PIO but 
the philosophy internally is so different that I’m now not sure how it would 
fit. I’d love to talk about it sometime and once we do a Lambda Template we’ll 
at least have some nice comparisons to make. We migrated the Kappa style 
Template to it so we have a good idea that it’s not that hard. I’d love to 
donate it to PIO but only if it makes sense.


On Jun 28, 2017, at 4:27 PM, Donald Szeto <[email protected]> wrote:

Hey Mars,

Thanks for the suggestion and I agree with your point on the metadata part. 
Essentially I think the app and channel concept should be instead logically 
grouped together with event, not metadata.

I think in some advanced use cases, event storage should not even be a hard 
requirement as engine templates can source data differently. In the long run, 
it might be cleaner to have event server (and all relevant concepts such as its 
API, access keys, apps, etc) as a separable package, that is by default turned 
on, embedded to engine server. Advanced users can either make it standalone or 
even turn it off completely.

I imagine this kind of refactoring would echo Pat's proposal on making a clean 
and separate engine and metadata management system down the road.

Regards,
Donald

On Wed, Jun 28, 2017 at 3:29 PM Mars Hall <[email protected] 
<mailto:[email protected]>> wrote:
One of the ongoing challenges we face with PredictionIO is the separation of 
Engine & Eventserver APIs. This separation leads to several problems:

1. Deploying a complete PredictionIO app requires multiple processes, each with 
its own network listener
2. Eventserver & Engine must be configured to share exactly the same storage 
backends (same `pio-env.sh`)
3. Confusion between "Eventserver" (an optional REST API) & "event storage" (a 
required database)

These challenges are exacerbated by the fact that PredictionIO's docs & `pio 
app` CLI make it appear that sharing an Eventserver between Engines is a good 
idea. I recently filed a JIRA issue about this topic. TL;DR sharing an 
eventserver between engines with different Meta Storage config will cause data 
corruption:
  https://issues.apache.org/jira/browse/PIO-96 
<https://issues.apache.org/jira/browse/PIO-96>


I believe a lot of these issues could be alleviated with one change to 
PredictionIO core:

By default, expose the Eventserver API from the `pio deploy` Engine process, so 
that it is not necessary to deploy a second Eventserver-only process. Separate 
`pio eventserver` could still be optional if you need the separation of 
concerns for scalability.


I'd love to hear what you folks think. I will file a JIRA enhancement issue if 
this seems like an acceptable approach.

*Mars Hall
Customer Facing Architect
Salesforce Platform / Heroku
San Francisco, California


Reply via email to