subject:"Why are ml models repartition\(1\)'d in save methods\?"

Re: Why are ml models repartition(1)'d in save methods?

2017-01-16 Thread Asher Krim

Cool, thanks! Jira: https://issues.apache.org/jira/browse/SPARK-19247 PR: https://github.com/apache/spark/pull/16607 I think the LDA model has the exact same issues - currently the `topicsMatrix` (which is on order of numWords*k, 4GB for numWords=3m and k=1000) is saved as a single element in a

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen

Yes, certainly debatable for word2vec. You have a good point that this could overrun the 2GB limit if the model is one big datum, for large but not crazy models. This model could probably easily be serialized as individual vectors in this case. It would introduce a backwards-compatibility issue

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim

I guess it depends on the definition of "small". A Word2vec model with vectorSize=300 and vocabulary=3m takes nearly 4gb. While it does fit on a single machine (so isn't really "big" data), I don't see the benefit in having the model stored in one file. On the contrary, it seems that we would want

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath

Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed. If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large On Fri, 13 Jan 2017 at 19:42, Sean Owen wrote: > You're referring to

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen

You're referring to code that serializes models, which are quite small. For example a PCA model consists of a few principal component vector. It's a Dataset of just one element being saved here. It's re-using the code path normally used to save big data sets, to output 1 file with 1 thing as

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim

But why is that beneficial? The data is supposedly quite large, distributing it across many partitions/files would seem to make sense. On Fri, Jan 13, 2017 at 12:25 PM, Sean Owen wrote: > That is usually so the result comes out in one file, not partitioned over > n files. >

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Sean Owen

That is usually so the result comes out in one file, not partitioned over n files. On Fri, Jan 13, 2017 at 5:23 PM Asher Krim wrote: > Hi, > > I'm curious why it's common for data to be repartitioned to 1 partition > when saving ml models: > >

Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Asher Krim

Hi, I'm curious why it's common for data to be repartitioned to 1 partition when saving ml models: sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) This shows up in most ml models I've seen (Word2Vec

Re: Why are ml models repartition(1)'d in save methods?

Re: Why are ml models repartition(1)'d in save methods?

Re: Why are ml models repartition(1)'d in save methods?

Re: Why are ml models repartition(1)'d in save methods?

Re: Why are ml models repartition(1)'d in save methods?

Re: Why are ml models repartition(1)'d in save methods?

Re: Why are ml models repartition(1)'d in save methods?

Why are ml models repartition(1)'d in save methods?

8 matches

Site Navigation

Mail list logo

Footer information