I agree that there is confusion regarding event server VS event storage
and the unclear usage definition of types of data storage (e.g. meta-data
vs model)
but i'm not sure if bundling Event Server with Engine Server (or Pat calls
it PredictionServer) is a good solution.
currently PIO has 3 "types" of storage
- METADATA : store PIO's administrative data ("Apps", etc)
- EVENTDATA: store the pure events
- MODELDATA : store the model
1. one confusion is when universal recommendation is used, Elastichsearch
is required in order to serve the Predicted Results. Is this type of
storage considered as "MODELDATA" or "METADATA" or should introduce a new
type of storage for "Serving" purpose (which can be tied to engine
specific) ?
2. question regarding the problem described in ticket
https://issues.apache.org/jira/browse/PIO-96
```
Problems emerge when a developer tries running multiple engines with
different storage configs on the same underlying database, such as:
- a Classifier with *Postgres* meta, event, & model storage, and
- the Universal Recommender with *Elasticsearch* meta plus *Postgres* event
& model storage.
```
why user want to use different storage config for different engine? can the
classifier match the same configuration as universal recommender?
because i thought the storage configuration is more tied to PIO as a whole
rather than per engine.
Kenneth
On Thu, Jun 29, 2017 at 10:22 AM, Pat Ferrel <[email protected]> wrote:
> Are you asking about the EventServer or PredictionServer? The EventServer
> is multi-tenant with access keys, not really pure REST. We (ActionML) did a
> hack for a client to The PredictionServer to allow Actors to respond on the
> same port for several engine queries. We used REST addressing for this,
> which adds yet another id. This makes for one process for the EventServe
> and one for the PredictionServer. Each responding engine was behind an
> Actor not a new process. So it’s possible but IMO makes the API as a total
> rather messy. We also had to change the workflow so metadata was read on
> `pio deploy` so one build could then deploy many times with different
> engine.jsons and different PredictionServer endpoints for queries only.
> This comes pretty close to clean multi-tenantcy but is not SaaS capable
> without solving SSL and Auth for both services.
>
> The hack was pretty ugly in the code and after doing that I concluded that
> a big chunk needed a rewrite and hence the prototype. It depends on what
> you want but if you want SaaS I think that mean SSL + Auth + multi-tenancy,
> and you also mention minimizing process boundaries. There are rather many
> implications to this.
>
> On Jun 29, 2017, at 9:57 AM, Mars Hall <[email protected]> wrote:
>
> Donald, Pat, great to hear that this is a well-pondered design challenge
> of PIO 😄 The prototype, composable, all-in-one server sounds promising.
>
> I'm wondering if there's a more immediate possibility to address adding
> the `/events` REST API to Engine? Would it make sense to try invoking an
> `EventServiceActor` in the tools.commands.Engine#deploy method? If that
> would be a distasteful hack, just say so. I'm trying to understand
> possibility of solving this in the current codebase vs a visionary new
> version of PIO.
>
> *Mars
>
> ( <> .. <> )
>
> > On Jun 28, 2017, at 18:01, Pat Ferrel <[email protected]> wrote:
> >
> > Ah, one of my favorite subjects.
> >
> > I’m working on a prototype server that handles online learning as well
> as Lambda style. There is only one server with everything going through
> REST. There are 2 resource types, Engines and Commands. Engines have REST
> APIs with endpoints for Events and Queries. So something like POST
> /engines/resouce-id/events would send an event to what is like a PIO app
> and POST /engine/resource-id/queries does the PIO query equivalent. Note
> that this is fully multi-tenant and has only one important id. It’s based
> on akka-http in a fully microservice type architecture. While the Server is
> running you can add completely new Templates for any algorithm, thereby
> adding new endpoints for Events and Queries. Each “tenant” is super
> lightweight since it’s just an Actor not a new JVM. The CLI is actually
> Python that hits the REST API with a Python SDK, and there is a Java SDK
> too. We support SSL and OAuth2 so having those baked into an SDK is really
> important. Though a prototype it can support multi-tenant SaaS.
> >
> > We have a prototype online learner Template which does not save events
> at all though it ingests events exactly like PIO in the same format in fact
> we have the same template for both servers taking identical input. Instead
> of an EventServer it mirrors received events events before validation (yes
> we have full event validation that is template specific.) This allows some
> events to affect mutable data in a database and some to just be an
> immutable stream or even be thrown away for Kappa learners. For an online
> learner, each event updates the model, which is stored periodically as a
> watermark. If you want to change algo params you destroy the engine
> instance and replay the mirrored events. For a Lambda learner the Events
> may be stored like PIO.
> >
> > This is very much along the lines of the proposal I put up for future
> PIO but the philosophy internally is so different that I’m now not sure how
> it would fit. I’d love to talk about it sometime and once we do a Lambda
> Template we’ll at least have some nice comparisons to make. We migrated the
> Kappa style Template to it so we have a good idea that it’s not that hard.
> I’d love to donate it to PIO but only if it makes sense.
> >
> >
> > On Jun 28, 2017, at 4:27 PM, Donald Szeto <[email protected]> wrote:
> >
> > Hey Mars,
> >
> > Thanks for the suggestion and I agree with your point on the metadata
> part. Essentially I think the app and channel concept should be instead
> logically grouped together with event, not metadata.
> >
> > I think in some advanced use cases, event storage should not even be a
> hard requirement as engine templates can source data differently. In the
> long run, it might be cleaner to have event server (and all relevant
> concepts such as its API, access keys, apps, etc) as a separable package,
> that is by default turned on, embedded to engine server. Advanced users can
> either make it standalone or even turn it off completely.
> >
> > I imagine this kind of refactoring would echo Pat's proposal on making a
> clean and separate engine and metadata management system down the road.
> >
> > Regards,
> > Donald
> >
> > On Wed, Jun 28, 2017 at 3:29 PM Mars Hall <[email protected]> wrote:
> > One of the ongoing challenges we face with PredictionIO is the
> separation of Engine & Eventserver APIs. This separation leads to several
> problems:
> >
> > 1. Deploying a complete PredictionIO app requires multiple processes,
> each with its own network listener
> > 2. Eventserver & Engine must be configured to share exactly the same
> storage backends (same `pio-env.sh`)
> > 3. Confusion between "Eventserver" (an optional REST API) & "event
> storage" (a required database)
> >
> > These challenges are exacerbated by the fact that PredictionIO's docs &
> `pio app` CLI make it appear that sharing an Eventserver between Engines is
> a good idea. I recently filed a JIRA issue about this topic. TL;DR sharing
> an eventserver between engines with different Meta Storage config will
> cause data corruption:
> > https://issues.apache.org/jira/browse/PIO-96
> >
> >
> > I believe a lot of these issues could be alleviated with one change to
> PredictionIO core:
> >
> > By default, expose the Eventserver API from the `pio deploy` Engine
> process, so that it is not necessary to deploy a second Eventserver-only
> process. Separate `pio eventserver` could still be optional if you need the
> separation of concerns for scalability.
> >
> >
> > I'd love to hear what you folks think. I will file a JIRA enhancement
> issue if this seems like an acceptable approach.
> >
> > *Mars Hall
> > Customer Facing Architect
> > Salesforce Platform / Heroku
> > San Francisco, California
> >
> >
>
>
>