Yes, I know that rowSimiliarity does not do what I really want, it does a lot of unnecesary computations (similarities I don't want), but I thought it would give me an Idea of what could happen with a distributed version of the problem with A= 1000000 rows and B= 50000 rows, for example. The system I'm working on should be prepared for that case, though this may happen only less than 0,1% of the times. A more likely situation will be A = 500,000 rows and B = 5000 rows
I agree that the problem can be that extra load due to distribution. It will be interesting for me to detect the point where the distributed version will give a better perfomarnce than a non-distributed one. I think that a distributed version should be quite easy to implement, since I'm using cosine similarity, maybe it would be more efficient doing like this pseudo-mahout-code: numerator=A.times(B.transpose()); normA=A.norm(); /* For each row of A */ normB=B.norm(): /* For each row of B */ denominator=normA*normB; /* Elementwise */ similarity=numerator/denominator; /* Elementwise */ Thank you all again, for your opinions. Fernando. 2010/12/20 Sean Owen <[email protected]> > I don't think that's what the job does. It is computing the similarity of > every row of A with every other row of A which is not what you are trying > to > do. > > Even at tens of thousands of row in memory, that's not large at all and > comfortably fits in memory. I would just continue with your non-distributed > version. > > I think one fact that's overlooked is that distributing a computation > typically introduces a load of overhead -- some constant scalar factor, and > not a small one. It takes a lot of work to move all that data around. > Distributing is a necessary evil, and I believe it should be avoided if you > can avoid it. > > 2010/12/20 Fernando Fernández <[email protected]> > > > Hi Sebastian, > > > > Actually, this is related to some other message I sent a couple of days > > ago. > > What I really want to implement is an A to B similarity Job. A is at this > > moment about 50K rows and B 1000 rows, but this will grow in the future > > (possibly A to hundreds of thousands and B to tens of thousands) so I > > thought a rowsimilarity over a C matrix (being C the rows of A and B put > > together) would give me and Idea of the possible performance of this > future > > distributed A to B similarity job, and some results to check if the > > methodology works for my problem. I have a non-distributed version right > > now that solves the "50000 to 1000" problem in about 40 minutes on a > single > > machine, so I expect that a distributed version can solve the problem in > > aprox (time / # of nodes), since I could simply split the A row-wise and > > put > > each piece on a node with a whole copy of B. So, as you say, something is > > going really bad in my rowsimilarity proccess... maybe I just should > > forget > > using rowsimiliarty and implement a job that is not prepared to deal with > > sparse matrices... > > > > Thank you all!! > > > > 2010/12/20 Sebastian Schelter <[email protected]> > > > > > Hi Fernando, > > > > > > If you set maxSimilaritiesPerRow to 100 it will return only the 100 > most > > > similar rows for each row. > > > > > > The density of your matrix could maybe explain the long execution time, > > as > > > the number of comparisons that need to be made might become quadratic > > > because every row needs to be compared with every other row (50K times > > 50K > > > is up in the billions). RowSimilarityJob's purpose is to work on sparse > > > matrices. > > > > > > Could you give us some details about your usecase? > > > > > > --sebastian > > > > > > > > > > > > > > > > > > On 20.12.2010 12:58, Fernando Fernández wrote: > > > > > >> Ok, understood now :) > > >> > > >> About the parameters: > > >> > > >> It's a 50000x100 dense matrix, so I set the --numberOfColumns > parameter > > to > > >> 100, and the rest nophave the default values (This means that > > >> maxSimilaritiesPerRow is set to 100, but I don't know which 100 it > will > > >> return...) > > >> > > >> 2010/12/20 Sebastian Schelter<[email protected]> > > >> > > >> Hi, > > >>> > > >>> Most of mahout's algorithm implementations need to run a series of > > >>> map/reduce jobs to compute their results. By specifying a start and > > >>> endphase > > >>> you can make the implementation run only some of these internal jobs. > > You > > >>> could e.g. use this to restart a failed execution. > > >>> > > >>> --sebastian > > >>> > > >>> > > >>> > > >>> On 20.12.2010 12:41, Fernando Fernández wrote: > > >>> > > >>> But, does this affect the result? What will I get if I launch > > >>>> Rowsimiliarty > > >>>> (cosine similarity) with --startphase=1 and --endPhase=2? I don't > > fully > > >>>> understand what "phases" exactly are in this case. > > >>>> > > >>>> 2010/12/20 Niall Riddell<[email protected]> > > >>>> > > >>>> Startphase and endphase shouldn't impact overall performance in any > > >>>> way, > > >>>> > > >>>>> however it does mean that you can start at a later stage in a job > > >>>>> pipeline. > > >>>>> > > >>>>> You can execute specific MR jobs by designating a startphase and > > >>>>> endphase. > > >>>>> It goes without saying that the correct inputs must be available to > > >>>>> start > > >>>>> a > > >>>>> phase correctly. > > >>>>> > > >>>>> The first MR job is index 0. So setting --startPhase 1 will > execute > > >>>>> the > > >>>>> 2nd > > >>>>> job onwards. Putting in --endPhase 2 would stop after the 3rd job. > > >>>>> On 20 Dec 2010 11:17, "Fernando Fernández"< > > >>>>> [email protected]> wrote: > > >>>>> > > >>>>> Hello everyone, > > >>>>>> > > >>>>>> Can anyone explain what are exactly these two parameters > (startphase > > >>>>>> and > > >>>>>> endphase) and how to use them? I'm trying to launch a > RowSimilarity > > >>>>>> job > > >>>>>> > > >>>>>> on > > >>>>> a > > >>>>> > > >>>>> 50K row matrix (100 columns) with cosine similarity and default > > >>>>>> > > >>>>>> startphase > > >>>>> > > >>>>> and endphase parameters and I'm getting a extremely poor > performance > > >>>>>> on > > >>>>>> a > > >>>>>> quite big cluster (After 16 hours, only reached 3% of the > proccess) > > >>>>>> and > > >>>>>> I > > >>>>>> think that this could have something to do with startphase and > > >>>>>> endphase > > >>>>>> parameters. What do you think? How do these paremeters affect the > > >>>>>> RowSimilarity job? > > >>>>>> > > >>>>>> Thanks in advance. > > >>>>>> Fernando. > > >>>>>> > > >>>>>> > > > > > >
