I understand from a theoretical perspective that the model itself is not distributed. Thus it can be used for making predictions for a vector or a RDD. But speaking in terms of the APIs provided by spark 2.0.0 when I create a model from a large data the recommended way is to use the ml library for fit. I have the option of getting a http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/NaiveBayesModel.html or wrapping it as http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/PipelineModel.html
Both of these do not have any method which supports Vectors. How do I bridge this gap in the API from my side? Is there anything in Spark's API which I have missed? Or do I need to extract the parameters and use another library for the predictions for a single row? On Thu, Sep 1, 2016 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote: > How the model is built isn't that related to how it scores things. > Here we're just talking about scoring. NaiveBayesModel can score > Vector which is not a distributed entity. That's what you want to use. > You do not want to use a whole distributed operation to score one > record. This isn't related to .ml vs .mllib APIs. > > On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > > I understand your point. > > > > Is there something like a bridge? Is it possible to convert the model > > trained using Dataset<Row> (i.e. the distributed one) to the one which > uses > > vectors? In Spark 1.6 the mllib packages had everything as per vectors > and > > that should be faster as per my understanding. But in many spark blogs we > > saw that spark is moving towards the ml package and mllib package will be > > phased out. So how can someone train using huge data and then use it on a > > row by row basis? > > > > Thanks for your inputs. > > > > On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote: > >> > >> If you're trying to score a single example by way of an RDD or > >> Dataset, then no it will never be that fast. It's a whole distributed > >> operation, and while you might manage low latency for one job at a > >> time, consider what will happen when hundreds of them are running at > >> once. It's just huge overkill for scoring a single example (but, > >> pretty fine for high-er latency, high throughput batch operations) > >> > >> However if you're scoring a Vector locally I can't imagine it's that > >> slow. It does some linear algebra but it's not that complicated. Even > >> something unoptimized should be fast. > >> > >> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <asmbans...@gmail.com> > wrote: > >> > Hi > >> > > >> > Currently trying to use NaiveBayes to make predictions. But facing > >> > issues > >> > that doing the predictions takes order of few seconds. I tried with > >> > other > >> > model examples shipped with Spark but they also ran in minimum of 500 > ms > >> > when I used Scala API. With > >> > > >> > Has anyone used spark ML to do predictions for a single row under 20 > ms? > >> > > >> > I am not doing premature optimization. The use case is that we are > doing > >> > real time predictions and we need results 20ms. Maximum 30ms. This is > a > >> > hard > >> > limit for our use case. > > > > >