Re: what is the best way to implement mini batches?

2014-12-11 Thread Duy Huynh
the dataset i'm working on has about 100,000 records.  the batch that we're
training on has a size around 10.  can you repartition(10,000) into 10,000
partitions?

On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia 
wrote:

> You can just do mapPartitions on the whole RDD, and then called sliding()
> on the iterator in each one to get a sliding window. One problem is that
> you will not be able to slide "forward" into the next partition at
> partition boundaries. If this matters to you, you need to do something more
> complicated to get those, such as the repartition that you said (where you
> map each record to the partition it should be in).
>
> Matei
>
> > On Dec 11, 2014, at 10:16 AM, ll  wrote:
> >
> > any advice/comment on this would be much appreciated.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>


Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
just to give some reference point.  with the same algorithm running on
mnist dataset.

1.  python implementation:  ~10 miliseconds per iteration (can be faster if
i switch to gpu)

2.  local version (scala + breeze):  ~2 seconds per iteration

3.  distributed version (spark + scala + breeze):  15 seconds per iteration

i love spark and really enjoy writing scala code.  but this huge difference
in performance makes it really hard to do any kind of machine learning work.




On Thu, Dec 11, 2014 at 2:18 PM, Duy Huynh  wrote:

> both.
>
> first, the distributed version is so much slower than python.  i tried a
> few things like broadcasting variables, replacing Seq with Array, and a few
> other little things.  it helps to improve the performance, but still slower
> than the python code.
>
> so, i wrote a local version that's pretty much just running a bunch of
> breeze/blas operations.  i guess that's purely scala (no spark).  this
> local version is faster than the distributed version but still much slower
> than the python code.
>
>
>
>
>
>
>
> On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande 
> wrote:
>
>> Are you using Scala in a distributed enviroment or in a standalone mode ?
>>
>> Natu
>>
>> On Thu, Dec 11, 2014 at 8:23 PM, ll  wrote:
>>
>>> hi.. i'm converting some of my machine learning python code into scala +
>>> spark.  i haven't been able to run it on large dataset yet, but on small
>>> datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala
>>> code is
>>> much slower than my python code (5 to 10 times slower than python)
>>>
>>> i already tried everything to improve my spark + scala code like
>>> broadcasting variables, caching the RDD, replacing all my matrix/vector
>>> operations with breeze/blas, etc.  i saw some improvements, but it's
>>> still a
>>> lot slower than my python code.
>>>
>>> why is that?
>>>
>>> how do you improve your spark + scala performance today?
>>>
>>> or is spark + scala just not the right tool for small to medium datasets?
>>>
>>> when would you use spark + scala vs. python?
>>>
>>> thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
both.

first, the distributed version is so much slower than python.  i tried a
few things like broadcasting variables, replacing Seq with Array, and a few
other little things.  it helps to improve the performance, but still slower
than the python code.

so, i wrote a local version that's pretty much just running a bunch of
breeze/blas operations.  i guess that's purely scala (no spark).  this
local version is faster than the distributed version but still much slower
than the python code.







On Thu, Dec 11, 2014 at 2:09 PM, Natu Lauchande 
wrote:

> Are you using Scala in a distributed enviroment or in a standalone mode ?
>
> Natu
>
> On Thu, Dec 11, 2014 at 8:23 PM, ll  wrote:
>
>> hi.. i'm converting some of my machine learning python code into scala +
>> spark.  i haven't been able to run it on large dataset yet, but on small
>> datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code
>> is
>> much slower than my python code (5 to 10 times slower than python)
>>
>> i already tried everything to improve my spark + scala code like
>> broadcasting variables, caching the RDD, replacing all my matrix/vector
>> operations with breeze/blas, etc.  i saw some improvements, but it's
>> still a
>> lot slower than my python code.
>>
>> why is that?
>>
>> how do you improve your spark + scala performance today?
>>
>> or is spark + scala just not the right tool for small to medium datasets?
>>
>> when would you use spark + scala vs. python?
>>
>> thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
thansk nick.  i'll take a look at oryx and prediction.io.

re: private val model in word2vec ;) yes, i couldn't wait so i just changed
it in the word2vec source code.  but i'm running into some compiliation
issue now.  hopefully i can fix it soon, so to get this things going.

On Fri, Nov 7, 2014 at 12:52 PM, Nick Pentreath 
wrote:

> For ALS if you want real time recs (and usually this is order 10s to a few
> 100s ms response), then Spark is not the way to go - a serving layer like
> Oryx, or prediction.io is what you want.
>
> (At graphflow we've built our own).
>
> You hold the factor matrices in memory and do the dot product in real time
> (with optional caching). Again, even for huge models (10s of millions
> users/items) this can be handled on a single, powerful instance. The issue
> at this scale is winnowing down the search space using LSH or similar
> approach to get to real time speeds.
>
> For word2vec it's pretty much the same thing as what you have is very
> similar to one of the ALS factor matrices.
>
> One problem is you can't access the wors2vec vectors as they are private
> val. I think this should be changed actually, so that just the word vectors
> could be saved and used in a serving layer.
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Nov 7, 2014 at 7:37 PM, Evan R. Sparks 
> wrote:
>
>> There are a few examples where this is the case. Let's take ALS, where
>> the result is a MatrixFactorizationModel, which is assumed to be big - the
>> model consists of two matrices, one (users x k) and one (k x products).
>> These are represented as RDDs.
>>
>> You can save these RDDs out to disk by doing something like
>>
>> model.userFeatures.saveAsObjectFile(...) and
>> model.productFeatures.saveAsObjectFile(...)
>>
>> to save out to HDFS or Tachyon or S3.
>>
>> Then, when you want to reload you'd have to instantiate them into a class
>> of MatrixFactorizationModel. That class is package private to MLlib right
>> now, so you'd need to copy the logic over to a new class, but that's the
>> basic idea.
>>
>> That said - using spark to serve these recommendations on a
>> point-by-point basis might not be optimal. There's some work going on in
>> the AMPLab to address this issue.
>>
>> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh 
>> wrote:
>>
>>> you're right, serialization works.
>>>
>>> what is your suggestion on saving a "distributed" model?  so part of the
>>> model is in one cluster, and some other parts of the model are in other
>>> clusters.  during runtime, these sub-models run independently in their own
>>> clusters (load, train, save).  and at some point during run time these
>>> sub-models merge into the master model, which also loads, trains, and saves
>>> at the master level.
>>>
>>> much appreciated.
>>>
>>>
>>>
>>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>>> wrote:
>>>
>>>> There's some work going on to support PMML -
>>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>>> been merged into master.
>>>>
>>>> What are you used to doing in other environments? In R I'm used to
>>>> running save(), same with matlab. In python either pickling things or
>>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>>> pickling -
>>>> http://scikit-learn.org/stable/modules/model_persistence.html). These
>>>> all seem basically equivalent java serialization to me..
>>>>
>>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>>> something) make sense to add?
>>>>
>>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>>>> wrote:
>>>>
>>>>> that works.  is there a better way in spark?  this seems like the most
>>>>> common feature for any machine learning work - to be able to save your
>>>>> model after training it and load it later.
>>>>>
>>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
>>>>> wrote:
>>>>>
>>>>>> Plain old java serialization is one straightforward approach if
>>>>>> you're in java/scala.
>>>>>>
>>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>>>>>
>>>>>>> what is the best way to save an mllib model that you just trained
>>>>>>> and reload
>>>>>>> it in the future?  specifically, i'm using the mllib word2vec
>>>>>>> model...
>>>>>>> thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
yep, but that's only if they are already represented as RDDs. which is much
more convenient for saving and loading.

my question is for the use case that they are not represented as RDDs yet.

then, do you think if it makes sense to covert them into RDDs, just for the
convenience of saving and loading them distributedly?

On Fri, Nov 7, 2014 at 12:36 PM, Evan R. Sparks 
wrote:

> There are a few examples where this is the case. Let's take ALS, where the
> result is a MatrixFactorizationModel, which is assumed to be big - the
> model consists of two matrices, one (users x k) and one (k x products).
> These are represented as RDDs.
>
> You can save these RDDs out to disk by doing something like
>
> model.userFeatures.saveAsObjectFile(...) and
> model.productFeatures.saveAsObjectFile(...)
>
> to save out to HDFS or Tachyon or S3.
>
> Then, when you want to reload you'd have to instantiate them into a class
> of MatrixFactorizationModel. That class is package private to MLlib right
> now, so you'd need to copy the logic over to a new class, but that's the
> basic idea.
>
> That said - using spark to serve these recommendations on a point-by-point
> basis might not be optimal. There's some work going on in the AMPLab to
> address this issue.
>
> On Fri, Nov 7, 2014 at 7:44 AM, Duy Huynh  wrote:
>
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>> been merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
hi nick.. sorry about the confusion.  originally i had a question
specifically about word2vec, but my follow up question on distributed model
is a more general question about saving different types of models.

on distributed model, i was hoping to implement a model parallelism, so
that different workers can work on different parts of the models, and then
merge the results at the end at the single master model.



On Fri, Nov 7, 2014 at 12:20 PM, Nick Pentreath 
wrote:

> Currently I see the word2vec model is collected onto the master, so the
> model itself is not distributed.
>
> I guess the question is why do you need  a distributed model? Is the vocab
> size so large that it's necessary? For model serving in general, unless the
> model is truly massive (ie cannot fit into memory on a modern high end box
> with 64, or 128GB ram) then single instance is way faster and simpler
> (using a cluster of machines is more for load balancing / fault tolerance).
>
> What is your use case for model serving?
>
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox>
>
>
> On Fri, Nov 7, 2014 at 5:47 PM, Duy Huynh  wrote:
>
>> you're right, serialization works.
>>
>> what is your suggestion on saving a "distributed" model?  so part of the
>> model is in one cluster, and some other parts of the model are in other
>> clusters.  during runtime, these sub-models run independently in their own
>> clusters (load, train, save).  and at some point during run time these
>> sub-models merge into the master model, which also loads, trains, and saves
>> at the master level.
>>
>> much appreciated.
>>
>>
>>
>> On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
>> wrote:
>>
>>> There's some work going on to support PMML -
>>> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet
>>> been merged into master.
>>>
>>> What are you used to doing in other environments? In R I'm used to
>>> running save(), same with matlab. In python either pickling things or
>>> dumping to json seems pretty common. (even the scikit-learn docs recommend
>>> pickling - http://scikit-learn.org/stable/modules/model_persistence.html).
>>> These all seem basically equivalent java serialization to me..
>>>
>>> Would some helper functions (in, say, mllib.util.modelpersistence or
>>> something) make sense to add?
>>>
>>> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
>>> wrote:
>>>
>>>> that works.  is there a better way in spark?  this seems like the most
>>>> common feature for any machine learning work - to be able to save your
>>>> model after training it and load it later.
>>>>
>>>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
>>>> wrote:
>>>>
>>>>> Plain old java serialization is one straightforward approach if you're
>>>>> in java/scala.
>>>>>
>>>>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>>>>
>>>>>> what is the best way to save an mllib model that you just trained and
>>>>>> reload
>>>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>>>> thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: sparse x sparse matrix multiplication

2014-11-07 Thread Duy Huynh
thanks reza.  i'm not familiar with the "block matrix multiplication", but
is it a good fit for "very large dimension, but extremely sparse" matrix?

if not, what is your recommendation on implementing matrix multiplication
in spark on "very large dimension, but extremely sparse" matrix?




On Thu, Nov 6, 2014 at 5:50 PM, Reza Zadeh  wrote:

> See this thread for examples of sparse matrix x sparse matrix:
> https://groups.google.com/forum/#!topic/spark-users/CGfEafqiTsA
>
> We thought about providing matrix multiplies on CoordinateMatrix, however,
> the matrices have to be very dense for the overhead of having many little
> (i, j, value) objects to be worth it. For this reason, we are focused on
> doing block matrix multiplication first. The goal is version 1.3.
>
> Best,
> Reza
>
> On Wed, Nov 5, 2014 at 11:48 PM, Wei Tan  wrote:
>
>> I think Xiangrui's ALS code implement certain aspect of it. You may want
>> to check it out.
>> Best regards,
>> Wei
>>
>> -
>> Wei Tan, PhD
>> Research Staff Member
>> IBM T. J. Watson Research Center
>>
>>
>> [image: Inactive hide details for Xiangrui Meng ---11/05/2014 01:13:40
>> PM---You can use breeze for local sparse-sparse matrix multiplic]Xiangrui
>> Meng ---11/05/2014 01:13:40 PM---You can use breeze for local sparse-sparse
>> matrix multiplication and then define an RDD of sub-matri
>>
>> From: Xiangrui Meng 
>> To: Duy Huynh 
>> Cc: user 
>> Date: 11/05/2014 01:13 PM
>> Subject: Re: sparse x sparse matrix multiplication
>> --
>>
>>
>>
>> You can use breeze for local sparse-sparse matrix multiplication and
>> then define an RDD of sub-matrices
>>
>> RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)
>>
>> and then use join and aggregateByKey to implement this feature, which
>> is the same as in MapReduce.
>>
>> -Xiangrui
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>


Re: word2vec: how to save an mllib model and reload it?

2014-11-07 Thread Duy Huynh
you're right, serialization works.

what is your suggestion on saving a "distributed" model?  so part of the
model is in one cluster, and some other parts of the model are in other
clusters.  during runtime, these sub-models run independently in their own
clusters (load, train, save).  and at some point during run time these
sub-models merge into the master model, which also loads, trains, and saves
at the master level.

much appreciated.



On Fri, Nov 7, 2014 at 2:53 AM, Evan R. Sparks 
wrote:

> There's some work going on to support PMML -
> https://issues.apache.org/jira/browse/SPARK-1406 - but it's not yet been
> merged into master.
>
> What are you used to doing in other environments? In R I'm used to running
> save(), same with matlab. In python either pickling things or dumping to
> json seems pretty common. (even the scikit-learn docs recommend pickling -
> http://scikit-learn.org/stable/modules/model_persistence.html). These all
> seem basically equivalent java serialization to me..
>
> Would some helper functions (in, say, mllib.util.modelpersistence or
> something) make sense to add?
>
> On Thu, Nov 6, 2014 at 11:36 PM, Duy Huynh 
> wrote:
>
>> that works.  is there a better way in spark?  this seems like the most
>> common feature for any machine learning work - to be able to save your
>> model after training it and load it later.
>>
>> On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
>> wrote:
>>
>>> Plain old java serialization is one straightforward approach if you're
>>> in java/scala.
>>>
>>> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>>>
>>>> what is the best way to save an mllib model that you just trained and
>>>> reload
>>>> it in the future?  specifically, i'm using the mllib word2vec model...
>>>> thanks.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: word2vec: how to save an mllib model and reload it?

2014-11-06 Thread Duy Huynh
that works.  is there a better way in spark?  this seems like the most
common feature for any machine learning work - to be able to save your
model after training it and load it later.

On Fri, Nov 7, 2014 at 2:30 AM, Evan R. Sparks 
wrote:

> Plain old java serialization is one straightforward approach if you're in
> java/scala.
>
> On Thu, Nov 6, 2014 at 11:26 PM, ll  wrote:
>
>> what is the best way to save an mllib model that you just trained and
>> reload
>> it in the future?  specifically, i'm using the mllib word2vec model...
>> thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-how-to-save-an-mllib-model-and-reload-it-tp18329.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
in case, this won't be available anytime soon with spark.  what would be a
good way to implement this multiplication feature in spark?

On Wed, Nov 5, 2014 at 4:59 AM, Duy Huynh  wrote:

> distributed.  something like CordinatedMatrix.multiply(CoordinatedMatrix).
>
>
> thanks xiangrui!
>
> On Wed, Nov 5, 2014 at 4:24 AM, Xiangrui Meng  wrote:
>
>> local matrix-matrix multiplication or distributed?
>>
>> On Tue, Nov 4, 2014 at 11:58 PM, ll  wrote:
>> > what is the best way to implement a sparse x sparse matrix
>> multiplication
>> > with spark?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>


Re: Matrix multiplication in spark

2014-11-05 Thread Duy Huynh
ok great.  when will this be ready?

On Wed, Nov 5, 2014 at 4:27 AM, Xiangrui Meng  wrote:

> We are working on distributed block matrices. The main JIRA is at:
>
> https://issues.apache.org/jira/browse/SPARK-3434
>
> The goal is to support basic distributed linear algebra, (dense first
> and then sparse).
>
> -Xiangrui
>
> On Wed, Nov 5, 2014 at 12:23 AM, ll  wrote:
> > @sowen.. i am looking for distributed operations, especially very large
> > sparse matrix x sparse matrix multiplication.  what is the best way to
> > implement this in spark?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-multiplication-in-spark-tp12562p18164.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: sparse x sparse matrix multiplication

2014-11-05 Thread Duy Huynh
distributed.  something like CordinatedMatrix.multiply(CoordinatedMatrix).

thanks xiangrui!

On Wed, Nov 5, 2014 at 4:24 AM, Xiangrui Meng  wrote:

> local matrix-matrix multiplication or distributed?
>
> On Tue, Nov 4, 2014 at 11:58 PM, ll  wrote:
> > what is the best way to implement a sparse x sparse matrix multiplication
> > with spark?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse-matrix-multiplication-tp18163.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: object in an rdd: serializable?

2014-10-17 Thread Duy Huynh
interesting.  why does case class work for this?  thanks boromir!

On Thu, Oct 16, 2014 at 10:41 PM, Boromir Widas  wrote:

> make it a case class should work.
>
> On Thu, Oct 16, 2014 at 8:30 PM, ll  wrote:
>
>> i got an exception complaining about serializable.  the sample code is
>> below...
>>
>> class HelloWorld(val count: Int) {
>>   ...
>>   ...
>> }
>>
>> object Test extends App {
>>   ...
>>   val data = sc.parallelize(List(new HelloWorld(1), new HelloWorld(2)))
>>   ...
>> }
>>
>> what is the best way to serialize HelloWorld so that it can be contained
>> in
>> an RDD?
>>
>> thanks!
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/object-in-an-rdd-serializable-tp16638.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: scala: java.net.BindException?

2014-10-16 Thread Duy Huynh
thanks marcelo.  i only instantiated sparkcontext once, at the beginning,
in this code.  the exception was thrown right at the beginning.

i also tried to run other programs, which worked fine previously, but now
also got the same error.

it looks like it put "global block" on creating sparkcontext that prevents
any program to create a sparkcontext.



On Oct 16, 2014 6:26 PM, "Marcelo Vanzin"  wrote:

> This error is not fatal, since Spark will retry on a different port..
> but this might be a problem, for different reasons, if somehow your
> code is trying to instantiate multiple SparkContexts.
>
> I assume "nn.SimpleNeuralNetwork" is part of your application, and
> since it seems to be instantiating a new SparkContext and also is
> being called from an iteration, that looks sort of fishy.
>
> On Thu, Oct 16, 2014 at 2:51 PM, ll  wrote:
> > hello... does anyone know how to resolve this issue?  i'm running this
> > locally on my computer.  keep getting this BindException.  much
> appreciated.
> >
> > 14/10/16 17:48:13 WARN component.AbstractLifeCycle: FAILED
> > SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address
> already
> > in use
> > java.net.BindException: Address already in use
> > at sun.nio.ch.Net.bind0(Native Method)
> > at sun.nio.ch.Net.bind(Net.java:444)
> > at sun.nio.ch.Net.bind(Net.java:436)
> > at
> > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
> > at
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> > at
> >
> org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
> > at
> >
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
> > at
> >
> org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
> > at
> >
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> > at org.eclipse.jetty.server.Server.doStart(Server.java:293)
> > at
> >
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
> > at
> >
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:192)
> > at
> org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
> > at
> org.apache.spark.ui.JettyUtils$$anonfun$3.apply(JettyUtils.scala:202)
> > at
> >
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1446)
> > at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > at
> org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1442)
> > at
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:202)
> > at org.apache.spark.ui.WebUI.bind(WebUI.scala:102)
> > at org.apache.spark.SparkContext.(SparkContext.scala:224)
> > at
> >
> nn.SimpleNeuralNetwork$delayedInit$body.apply(SimpleNeuralNetwork.scala:15)
> > at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
> > at
> scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
> > at scala.App$$anonfun$main$1.apply(App.scala:71)
> > at scala.App$$anonfun$main$1.apply(App.scala:71)
> > at scala.collection.immutable.List.foreach(List.scala:318)
> > at
> >
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> > at scala.App$class.main(App.scala:71)
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/scala-java-net-BindException-tp16624.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>


Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
great, thanks!

On Tue, Oct 14, 2014 at 5:08 PM, Ankur Dave  wrote:

> On Tue, Oct 14, 2014 at 1:57 PM, Duy Huynh 
>  wrote:
>
>> a related question, what is the best way to update the values of existing
>> vertices and edges?
>>
>
> Many of the Graph methods deal with updating the existing values in bulk,
> including mapVertices, mapEdges, mapTriplets, mapReduceTriplets, and
> outerJoinVertices.
>
> To update just a small number of existing values, IndexedRDD would be
> ideal, but until it makes it into GraphX the best way is to use one of the
> above methods. This will be slower since it's touching all of the vertices,
> but it will achieve the same goal.
>
> For example, if you had a graph and wanted to update the value of vertex 1
> to "a", you could do the following:
>
> val graph = ...
> val updates = sc.parallelize(List((1L, "a")))
> val newGraph = graph.outerJoinVertices(updates) { (id, a, b) => b }
>
> Ankur <http://www.ankurdave.com/>
>


Re: mllib CoordinateMatrix

2014-10-14 Thread Duy Huynh
thanks reza!

On Tue, Oct 14, 2014 at 5:02 PM, Reza Zadeh  wrote:

> Hello,
>
> CoordinateMatrix is in its infancy, and right now is only a placeholder.
>
> To get/set the value at (i,j), you should map the entries rdd using the
> usual rdd map operation, and change the relevant entries.
>
> To get the values on a specific row, you can call toIndexedRowMatrix(),
> which returns a RowMatrix
> 
> with indices.
>
> Best,
> Reza
>
>
> On Tue, Oct 14, 2014 at 1:18 PM, ll  wrote:
>
>> after creating a coordinate matrix from my rdd[matrixentry]...
>>
>> 1.  how can i get/query the value at coordiate (i, j)?
>>
>> 2.  how can i set/update the value at coordiate (i, j)?
>>
>> 3.  how can i get all the values on a specific row i, ideally as a vector?
>>
>> thanks!
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/mllib-CoordinateMatrix-tp16412.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: graphx - mutable?

2014-10-14 Thread Duy Huynh
thanks ankur.  indexedrdd sounds super helpful!

a related question, what is the best way to update the values of existing
vertices and edges?

On Tue, Oct 14, 2014 at 4:30 PM, Ankur Dave  wrote:

> On Tue, Oct 14, 2014 at 12:36 PM, ll  wrote:
>
>> hi again.  just want to check in again to see if anyone could advise on
>> how
>> to implement a "mutable, growing graph" with graphx?
>>
>> we're building a graph is growing over time.  it adds more vertices and
>> edges every iteration of our algorithm.
>>
>> it doesn't look like there is an obvious way to add a new vertice & a set
>> of
>> edges to an existing graph.
>
>
> Currently the only way to do this is to rebuild the graph in each
> iteration by adding more vertices and edges to the source RDDs, then
> calling the graph constructor.
>
> I'm working on a way to support this more efficiently (SPARK-2365
> ), but GraphX doesn't
> take advantage of this yet.
>
> Ankur 
>