# re: " I see it as objects you see it as data stores" not really. I see things based on what functionality and purpose it provides. like you mentioned - The way Elasticseach is used in UR is part of the model and where the algorithm write the computation result into and then used as serving. In a way, it's the model. just a more complex model than a simple linear regression function. If we define "Model" as output of the train() function, then UR is storing the model into Elasticsearch - and it is required because UR relies on Elasticsearch computation - meaning it's part of UR's "model".predict()
# re: "In reality the input comes in 2 types, persistent mutable objects and immutable streams of events (that may well be usable as a time window of data, dropping old events)" like you said, basically there are two types of data type 1. mutable object (e.g meta data of a product, user profile, etc) 2. immutable event (e.g. behavior data) However, 1 can be considered as 2 if we treat the "changes" of mutable object as "event" as well - basically this's the current event server design. But i agree some use case may not care about changes of mutable object - for this, we can provide some API/option for people to store mutable objects and always overwrite. or use better storage structure to capture the changes of mutable object. On Fri, Jun 30, 2017 at 5:29 AM, Pat Ferrel <[email protected]> wrote: > Actually I think it’s a great solution. The question about different > storage config (https://issues.apache.org/jira/browse/PIO-96) is because > Elasticsearch performs the last step of the algorithm, it is not just a > store for models, so it’s an integral part of the compute engine, not the > storage. If it looks that way I hardly think it matters in the way implied > (see below where Templates should come with compassable containers). This > is actually the primary difference in the way you and I look at the > problem. I see it as objects you see it as data stores. Let’s add the > question of compute backends and unfortunately users will have to pick the > solution along with the engines they require (TensorFlow anyone?) If PIO is > going to be a viable ML/AI server in the long term it has to be a lot more > flexible, not less so. In the proto server I mention, the Engine decides on > the compute backend and the example Template does not use Spark. > > The prototype server I mentioned actually only handles metadata, installs > engines, and mirrors input. To handle Kappa as well as Lambda algorithms > the Engine must decide what and if it needs to store. Therefore instead of > assuming an EventServer we have mirroring of un-validated events. This has > many benefits. For one thing we can require validation from the Engine with > every event. This is because the single most frequent mistake by users I’ve > dealt with is malformed input. PIO’s input scheme is great because it is so > flexible but because of that validation is nil. I have seen users that have > been using a Template for a year without understanding that most of their > data was ignored by the Template code (not the UR in this case) . I have > spent literally thousands of hours helping correct bad input over email > even though the UR has orders of magnitude better docs than any other > Template. Yes, it’s also a lot more complicated but anyway, I’m tired of > this—we need validation of every input. Then maybe I will only spend 90% of > those hours :-P > > Anyway I think the separation of concerns should be Server handles > metadata, installs engines, and mirrors input. The Template framework > provides required APIs for Engines that must be implemented and a set of > Tools they can use or ignore to use what ever they need. If the Engines > provides an input method they can validate and if they are Kappa, learn > immediately (update models in real time), if they are Lambda, store the > valid data using something like an Event Store. The train method is then > optional and, of course, query. > > BTW the reason I call it a PredictionServer (in PIO) is because it is not > an Engine Server, all it does is provide a query endpoint. This corresponds > to only one method of an Engine and there is no reason to look at a query > endpoint any differently than the other public APIs of the Engine. > > I guess I look at this in an object oriented way, not a data oriented way. > This leads to Template code/Engines making more decisions. The Kappa > template we have for this proto server never uses Spark. Why would it to > implement Kappa online learning? It also does not need an Event Store > because it only stores models. This is also fine for Lambda where an Event > Store is required because the Engine provides the input method too, where > it can make the store/no-store decision. > > This has other benefits. Treating input as an immutable stream has some > major flaws. Some of the data has to be dropped, we cannot store forever—no > one can afford that much disk. And some data can never be dropped because > only the aggregate of all object changes makes any sense. In reality the > input comes in 2 types, persistent mutable objects and immutable streams of > events (that may well be usable as a time window of data, dropping old > events). With the above split, the mirror always has all input in case it’s > needed, the Engine can decide what events operate on mutable objects and > store the rest as a stream in the Event Store (with TTL for time windows). > Once this is trusted to work correctly mirroring can be stopped. In fact > the mutable objects can affect the model in real time now, even with Lambda > Templates like the UR. When an object property changes in today’s PIO we > have to wait till train before the model changes because the Engine does > not have an input method. If it did, then input that should affect the > model can. > > This solves all my pet peeves, internal API-wise, and allows one > implementation of an SaaS capable multi-tenant, secure Server. And here > multi-tenancy is super lightweight. Since most users have only one > Template, they may have to install supporting compute engines or stores. > This is a one time issue for them and Templates should come with containers > and scripts to compose them. We’re already doing this with PIO. A fully > clustered install takes an hour. Admin of such a monster is another issue > that is not necessarily better or even good in this model but a subject for > another day. > > > On Jun 30, 2017, at 1:40 AM, Kenneth Chan <[email protected]> wrote: > > I agree that there is confusion regarding event server VS event storage > and the unclear usage definition of types of data storage (e.g. meta-data > vs model) > but i'm not sure if bundling Event Server with Engine Server (or Pat calls > it PredictionServer) is a good solution. > > currently PIO has 3 "types" of storage > - METADATA : store PIO's administrative data ("Apps", etc) > - EVENTDATA: store the pure events > - MODELDATA : store the model > > 1. one confusion is when universal recommendation is used, Elastichsearch > is required in order to serve the Predicted Results. Is this type of > storage considered as "MODELDATA" or "METADATA" or should introduce a new > type of storage for "Serving" purpose (which can be tied to engine > specific) ? > > > 2. question regarding the problem described in ticket https://issues. > apache.org/jira/browse/PIO-96 > > ``` > Problems emerge when a developer tries running multiple engines with > different storage configs on the same underlying database, such as: > > - a Classifier with *Postgres* meta, event, & model storage, and > - the Universal Recommender with *Elasticsearch* meta plus *Postgres* event > & model storage. > > ``` > > why user want to use different storage config for different engine? can > the classifier match the same configuration as universal recommender? > because i thought the storage configuration is more tied to PIO as a whole > rather than per engine. > > Kenneth > > > > > On Thu, Jun 29, 2017 at 10:22 AM, Pat Ferrel <[email protected]> > wrote: > >> Are you asking about the EventServer or PredictionServer? The EventServer >> is multi-tenant with access keys, not really pure REST. We (ActionML) did a >> hack for a client to The PredictionServer to allow Actors to respond on the >> same port for several engine queries. We used REST addressing for this, >> which adds yet another id. This makes for one process for the EventServe >> and one for the PredictionServer. Each responding engine was behind an >> Actor not a new process. So it’s possible but IMO makes the API as a total >> rather messy. We also had to change the workflow so metadata was read on >> `pio deploy` so one build could then deploy many times with different >> engine.jsons and different PredictionServer endpoints for queries only. >> This comes pretty close to clean multi-tenantcy but is not SaaS capable >> without solving SSL and Auth for both services. >> >> The hack was pretty ugly in the code and after doing that I concluded >> that a big chunk needed a rewrite and hence the prototype. It depends on >> what you want but if you want SaaS I think that mean SSL + Auth + >> multi-tenancy, and you also mention minimizing process boundaries. There >> are rather many implications to this. >> >> On Jun 29, 2017, at 9:57 AM, Mars Hall <[email protected]> wrote: >> >> Donald, Pat, great to hear that this is a well-pondered design challenge >> of PIO 😄 The prototype, composable, all-in-one server sounds promising. >> >> I'm wondering if there's a more immediate possibility to address adding >> the `/events` REST API to Engine? Would it make sense to try invoking an >> `EventServiceActor` in the tools.commands.Engine#deploy method? If that >> would be a distasteful hack, just say so. I'm trying to understand >> possibility of solving this in the current codebase vs a visionary new >> version of PIO. >> >> *Mars >> >> ( <> .. <> ) >> >> > On Jun 28, 2017, at 18:01, Pat Ferrel <[email protected]> wrote: >> > >> > Ah, one of my favorite subjects. >> > >> > I’m working on a prototype server that handles online learning as well >> as Lambda style. There is only one server with everything going through >> REST. There are 2 resource types, Engines and Commands. Engines have REST >> APIs with endpoints for Events and Queries. So something like POST >> /engines/resouce-id/events would send an event to what is like a PIO app >> and POST /engine/resource-id/queries does the PIO query equivalent. Note >> that this is fully multi-tenant and has only one important id. It’s based >> on akka-http in a fully microservice type architecture. While the Server is >> running you can add completely new Templates for any algorithm, thereby >> adding new endpoints for Events and Queries. Each “tenant” is super >> lightweight since it’s just an Actor not a new JVM. The CLI is actually >> Python that hits the REST API with a Python SDK, and there is a Java SDK >> too. We support SSL and OAuth2 so having those baked into an SDK is really >> important. Though a prototype it can support multi-tenant SaaS. >> > >> > We have a prototype online learner Template which does not save events >> at all though it ingests events exactly like PIO in the same format in fact >> we have the same template for both servers taking identical input. Instead >> of an EventServer it mirrors received events events before validation (yes >> we have full event validation that is template specific.) This allows some >> events to affect mutable data in a database and some to just be an >> immutable stream or even be thrown away for Kappa learners. For an online >> learner, each event updates the model, which is stored periodically as a >> watermark. If you want to change algo params you destroy the engine >> instance and replay the mirrored events. For a Lambda learner the Events >> may be stored like PIO. >> > >> > This is very much along the lines of the proposal I put up for future >> PIO but the philosophy internally is so different that I’m now not sure how >> it would fit. I’d love to talk about it sometime and once we do a Lambda >> Template we’ll at least have some nice comparisons to make. We migrated the >> Kappa style Template to it so we have a good idea that it’s not that hard. >> I’d love to donate it to PIO but only if it makes sense. >> > >> > >> > On Jun 28, 2017, at 4:27 PM, Donald Szeto <[email protected]> wrote: >> > >> > Hey Mars, >> > >> > Thanks for the suggestion and I agree with your point on the metadata >> part. Essentially I think the app and channel concept should be instead >> logically grouped together with event, not metadata. >> > >> > I think in some advanced use cases, event storage should not even be a >> hard requirement as engine templates can source data differently. In the >> long run, it might be cleaner to have event server (and all relevant >> concepts such as its API, access keys, apps, etc) as a separable package, >> that is by default turned on, embedded to engine server. Advanced users can >> either make it standalone or even turn it off completely. >> > >> > I imagine this kind of refactoring would echo Pat's proposal on making >> a clean and separate engine and metadata management system down the road. >> > >> > Regards, >> > Donald >> > >> > On Wed, Jun 28, 2017 at 3:29 PM Mars Hall <[email protected]> wrote: >> > One of the ongoing challenges we face with PredictionIO is the >> separation of Engine & Eventserver APIs. This separation leads to several >> problems: >> > >> > 1. Deploying a complete PredictionIO app requires multiple processes, >> each with its own network listener >> > 2. Eventserver & Engine must be configured to share exactly the same >> storage backends (same `pio-env.sh`) >> > 3. Confusion between "Eventserver" (an optional REST API) & "event >> storage" (a required database) >> > >> > These challenges are exacerbated by the fact that PredictionIO's docs & >> `pio app` CLI make it appear that sharing an Eventserver between Engines is >> a good idea. I recently filed a JIRA issue about this topic. TL;DR sharing >> an eventserver between engines with different Meta Storage config will >> cause data corruption: >> > https://issues.apache.org/jira/browse/PIO-96 >> > >> > >> > I believe a lot of these issues could be alleviated with one change to >> PredictionIO core: >> > >> > By default, expose the Eventserver API from the `pio deploy` Engine >> process, so that it is not necessary to deploy a second Eventserver-only >> process. Separate `pio eventserver` could still be optional if you need the >> separation of concerns for scalability. >> > >> > >> > I'd love to hear what you folks think. I will file a JIRA enhancement >> issue if this seems like an acceptable approach. >> > >> > *Mars Hall >> > Customer Facing Architect >> > Salesforce Platform / Heroku >> > San Francisco, California >> > >> > >> >> >> > >
