thansk nick. i'll take a look at oryx and prediction.io. re: private val model in word2vec ;) yes, i couldn't wait so i just changed it in the word2vec source code. but i'm running into some compiliation issue now. hopefully i can fix it soon, so to get this things going.
On Fri, Nov 7, 2014 at 12:52 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > For ALS if you want real time recs (and usually this is order 10s to a few > 100s ms response), then Spark is not the way to go - a serving layer like > Oryx, or prediction.io is what you want. > > (At graphflow we've built our own). > > You hold the factor matrices in memory and do the dot product in real time > (with optional caching). Again, even for huge models (10s of millions > users/items) this can be handled on a single, powerful instance. The issue > at this scale is winnowing down the search space using LSH or similar > approach to get to real time speeds. > > For word2vec it's pretty much the same thing as what you have is very > similar to one of the ALS factor matrices. > > One problem is you can't access the wors2vec vectors as they are private > val. I think this should be changed actually, so that just the word vectors > could be saved and used in a serving layer. > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks <evan.spa...@gmail.com> > wrote: > >> There are a few examples where this is the case. Let's take ALS, where >> the result is a MatrixFactorizationModel, which is assumed to be big - the >> model consists of two matrices, one (users x k) and one (k x products). >> These are represented as RDDs. >> >> You can save these RDDs out to disk by doing something like >> >> model.userFeatures.saveAsObjectFile(...) and >> model.productFeatures.saveAsObjectFile(...) >> >> to save out to HDFS or Tachyon or S3. >> >> Then, when you want to reload you'd have to instantiate them into a class >> of MatrixFactorizationModel. That class is package private to MLlib right >> now, so you'd need to copy the logic over to a new class, but that's the >> basic idea. >> >> That said - using spark to serve these recommendations on a >> point-by-point basis might not be optimal. There's some work going on in >> the AMPLab to address this issue. >> >> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh <duy.huynh....@gmail.com> >> wrote: >> >>> you're right, serialization works. >>> >>> what is your suggestion on saving a "distributed" model? so part of the >>> model is in one cluster, and some other parts of the model are in other >>> clusters. during runtime, these sub-models run independently in their own >>> clusters (load, train, save). and at some point during run time these >>> sub-models merge into the master model, which also loads, trains, and saves >>> at the master level. >>> >>> much appreciated. >>> >>> >>> >>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks <evan.spa...@gmail.com> >>> wrote: >>> >>>> There's some work going on to support PMML - >>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet >>>> been merged into master. >>>> >>>> What are you used to doing in other environments? In R I'm used to >>>> running save(), same with matlab. In python either pickling things or >>>> dumping to json seems pretty common. (even the scikit-learn docs recommend >>>> pickling - >>>> http://scikit-learn.org/stable/modules/model_persistence.html). These >>>> all seem basically equivalent java serialization to me.. >>>> >>>> Would some helper functions (in, say, mllib.util.modelpersistence or >>>> something) make sense to add? >>>> >>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh <duy.huynh....@gmail.com> >>>> wrote: >>>> >>>>> that works. is there a better way in spark? this seems like the most >>>>> common feature for any machine learning work - to be able to save your >>>>> model after training it and load it later. >>>>> >>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks <evan.spa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Plain old java serialization is one straightforward approach if >>>>>> you're in java/scala. >>>>>> >>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll <duy.huynh....@gmail.com> wrote: >>>>>> >>>>>>> what is the best way to save an mllib model that you just trained >>>>>>> and reload >>>>>>> it in the future? specifically, i'm using the mllib word2vec >>>>>>> model... >>>>>>> thanks. >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >