Yes, I know that rowSimiliarity does not do what I really want, it does a
lot of unnecesary computations (similarities I don't want), but I thought it
would give me an Idea of what could happen with a distributed version of the
problem with A= 1000000 rows and B= 50000 rows, for example. The system I'm
working on should be prepared for that case, though this may happen only
less than 0,1% of the times. A more likely situation will be A = 500,000
rows and B = 5000 rows

I agree that the problem can be that extra load due to distribution. It will
be interesting for me to detect the point where the distributed version will
give a better perfomarnce than a non-distributed one.

I think that a distributed version should be quite easy to implement, since
I'm using cosine similarity, maybe it would be more efficient doing like
this pseudo-mahout-code:

numerator=A.times(B.transpose());
normA=A.norm();   /* For each row of A */
normB=B.norm():  /*  For each row of B */
denominator=normA*normB; /* Elementwise */
similarity=numerator/denominator;  /* Elementwise */


Thank you all again, for your opinions.

Fernando.

2010/12/20 Sean Owen <[email protected]>

> I don't think that's what the job does. It is computing the similarity of
> every row of A with every other row of A which is not what you are trying
> to
> do.
>
> Even at tens of thousands of row in memory, that's not large at all and
> comfortably fits in memory. I would just continue with your non-distributed
> version.
>
> I think one fact that's overlooked is that distributing a computation
> typically introduces a load of overhead -- some constant scalar factor, and
> not a small one. It takes a lot of work to move all that data around.
> Distributing is a necessary evil, and I believe it should be avoided if you
> can avoid it.
>
> 2010/12/20 Fernando Fernández <[email protected]>
>
> > Hi Sebastian,
> >
> > Actually, this is related to some other message I sent a couple of days
> > ago.
> > What I really want to implement is an A to B similarity Job. A is at this
> > moment about 50K rows and B 1000 rows, but this will grow in the future
> > (possibly A to hundreds of thousands and B to tens of thousands) so I
> > thought a rowsimilarity over a C matrix (being C the rows of A and B put
> > together) would give me and Idea of the possible performance of this
> future
> > distributed A to B similarity job, and some results to check if the
> > methodology works for my problem.  I have a non-distributed version right
> > now that solves the "50000 to 1000" problem in about 40 minutes on a
> single
> > machine, so I expect that a distributed version can solve the problem in
> > aprox (time / # of nodes), since I could simply split the A row-wise and
> > put
> > each piece on a node with a whole copy of B. So, as you say, something is
> > going really bad in my  rowsimilarity proccess... maybe I just should
> > forget
> > using rowsimiliarty and implement a job that is not prepared to deal with
> > sparse matrices...
> >
> > Thank you all!!
> >
> > 2010/12/20 Sebastian Schelter <[email protected]>
> >
> > > Hi  Fernando,
> > >
> > > If you set maxSimilaritiesPerRow to 100 it will return only the 100
> most
> > > similar rows for each row.
> > >
> > > The density of your matrix could maybe explain the long execution time,
> > as
> > > the number of comparisons that need to be made might become quadratic
> > > because every row needs to be compared with every other row (50K times
> > 50K
> > > is up in the billions). RowSimilarityJob's purpose is to work on sparse
> > > matrices.
> > >
> > > Could you give us some details about your usecase?
> > >
> > > --sebastian
> > >
> > >
> > >
> > >
> > >
> > > On 20.12.2010 12:58, Fernando Fernández wrote:
> > >
> > >> Ok, understood now :)
> > >>
> > >> About the parameters:
> > >>
> > >> It's a 50000x100 dense matrix, so I set the --numberOfColumns
> parameter
> > to
> > >> 100, and the rest nophave the default values (This means that
> > >> maxSimilaritiesPerRow is set to 100, but I don't know which 100 it
> will
> > >> return...)
> > >>
> > >> 2010/12/20 Sebastian Schelter<[email protected]>
> > >>
> > >>  Hi,
> > >>>
> > >>> Most of mahout's algorithm implementations need to run a series of
> > >>> map/reduce jobs to compute their results. By specifying a start and
> > >>> endphase
> > >>> you can make the implementation run only some of these internal jobs.
> > You
> > >>> could e.g. use this to restart a failed execution.
> > >>>
> > >>> --sebastian
> > >>>
> > >>>
> > >>>
> > >>> On 20.12.2010 12:41, Fernando Fernández wrote:
> > >>>
> > >>>  But, does this affect the result? What will I get if I launch
> > >>>> Rowsimiliarty
> > >>>> (cosine similarity) with --startphase=1 and --endPhase=2? I don't
> > fully
> > >>>> understand what "phases" exactly are in this case.
> > >>>>
> > >>>> 2010/12/20 Niall Riddell<[email protected]>
> > >>>>
> > >>>>  Startphase and endphase shouldn't impact overall performance in any
> > >>>> way,
> > >>>>
> > >>>>> however it does mean that you can start at a later stage in a job
> > >>>>> pipeline.
> > >>>>>
> > >>>>> You can execute specific MR jobs by designating a startphase and
> > >>>>> endphase.
> > >>>>> It goes without saying that the correct inputs must be available to
> > >>>>> start
> > >>>>> a
> > >>>>> phase correctly.
> > >>>>>
> > >>>>> The first MR job is index 0.  So setting --startPhase 1 will
> execute
> > >>>>> the
> > >>>>> 2nd
> > >>>>> job onwards.  Putting in --endPhase 2 would stop after the 3rd job.
> > >>>>> On 20 Dec 2010 11:17, "Fernando Fernández"<
> > >>>>> [email protected]>   wrote:
> > >>>>>
> > >>>>>  Hello everyone,
> > >>>>>>
> > >>>>>> Can anyone explain what are exactly these two parameters
> (startphase
> > >>>>>> and
> > >>>>>> endphase) and how to use them? I'm trying to launch a
> RowSimilarity
> > >>>>>> job
> > >>>>>>
> > >>>>>>  on
> > >>>>> a
> > >>>>>
> > >>>>>  50K row matrix (100 columns) with cosine similarity and default
> > >>>>>>
> > >>>>>>  startphase
> > >>>>>
> > >>>>>  and endphase parameters and I'm getting a extremely poor
> performance
> > >>>>>> on
> > >>>>>> a
> > >>>>>> quite big cluster (After 16 hours, only reached 3% of the
> proccess)
> > >>>>>> and
> > >>>>>> I
> > >>>>>> think that this could have something to do with startphase and
> > >>>>>> endphase
> > >>>>>> parameters. What do you think? How do these paremeters affect the
> > >>>>>> RowSimilarity job?
> > >>>>>>
> > >>>>>> Thanks in advance.
> > >>>>>> Fernando.
> > >>>>>>
> > >>>>>>
> > >
> >
>

Reply via email to