Hi Sebastian,

Actually, this is related to some other message I sent a couple of days ago.
What I really want to implement is an A to B similarity Job. A is at this
moment about 50K rows and B 1000 rows, but this will grow in the future
(possibly A to hundreds of thousands and B to tens of thousands) so I
thought a rowsimilarity over a C matrix (being C the rows of A and B put
together) would give me and Idea of the possible performance of this future
distributed A to B similarity job, and some results to check if the
methodology works for my problem.  I have a non-distributed version right
now that solves the "50000 to 1000" problem in about 40 minutes on a single
machine, so I expect that a distributed version can solve the problem in
aprox (time / # of nodes), since I could simply split the A row-wise and put
each piece on a node with a whole copy of B. So, as you say, something is
going really bad in my  rowsimilarity proccess... maybe I just should forget
using rowsimiliarty and implement a job that is not prepared to deal with
sparse matrices...

Thank you all!!

2010/12/20 Sebastian Schelter <[email protected]>

> Hi  Fernando,
>
> If you set maxSimilaritiesPerRow to 100 it will return only the 100 most
> similar rows for each row.
>
> The density of your matrix could maybe explain the long execution time, as
> the number of comparisons that need to be made might become quadratic
> because every row needs to be compared with every other row (50K times 50K
> is up in the billions). RowSimilarityJob's purpose is to work on sparse
> matrices.
>
> Could you give us some details about your usecase?
>
> --sebastian
>
>
>
>
>
> On 20.12.2010 12:58, Fernando Fernández wrote:
>
>> Ok, understood now :)
>>
>> About the parameters:
>>
>> It's a 50000x100 dense matrix, so I set the --numberOfColumns parameter to
>> 100, and the rest nophave the default values (This means that
>> maxSimilaritiesPerRow is set to 100, but I don't know which 100 it will
>> return...)
>>
>> 2010/12/20 Sebastian Schelter<[email protected]>
>>
>>  Hi,
>>>
>>> Most of mahout's algorithm implementations need to run a series of
>>> map/reduce jobs to compute their results. By specifying a start and
>>> endphase
>>> you can make the implementation run only some of these internal jobs. You
>>> could e.g. use this to restart a failed execution.
>>>
>>> --sebastian
>>>
>>>
>>>
>>> On 20.12.2010 12:41, Fernando Fernández wrote:
>>>
>>>  But, does this affect the result? What will I get if I launch
>>>> Rowsimiliarty
>>>> (cosine similarity) with --startphase=1 and --endPhase=2? I don't fully
>>>> understand what "phases" exactly are in this case.
>>>>
>>>> 2010/12/20 Niall Riddell<[email protected]>
>>>>
>>>>  Startphase and endphase shouldn't impact overall performance in any
>>>> way,
>>>>
>>>>> however it does mean that you can start at a later stage in a job
>>>>> pipeline.
>>>>>
>>>>> You can execute specific MR jobs by designating a startphase and
>>>>> endphase.
>>>>> It goes without saying that the correct inputs must be available to
>>>>> start
>>>>> a
>>>>> phase correctly.
>>>>>
>>>>> The first MR job is index 0.  So setting --startPhase 1 will execute
>>>>> the
>>>>> 2nd
>>>>> job onwards.  Putting in --endPhase 2 would stop after the 3rd job.
>>>>> On 20 Dec 2010 11:17, "Fernando Fernández"<
>>>>> [email protected]>   wrote:
>>>>>
>>>>>  Hello everyone,
>>>>>>
>>>>>> Can anyone explain what are exactly these two parameters (startphase
>>>>>> and
>>>>>> endphase) and how to use them? I'm trying to launch a RowSimilarity
>>>>>> job
>>>>>>
>>>>>>  on
>>>>> a
>>>>>
>>>>>  50K row matrix (100 columns) with cosine similarity and default
>>>>>>
>>>>>>  startphase
>>>>>
>>>>>  and endphase parameters and I'm getting a extremely poor performance
>>>>>> on
>>>>>> a
>>>>>> quite big cluster (After 16 hours, only reached 3% of the proccess)
>>>>>> and
>>>>>> I
>>>>>> think that this could have something to do with startphase and
>>>>>> endphase
>>>>>> parameters. What do you think? How do these paremeters affect the
>>>>>> RowSimilarity job?
>>>>>>
>>>>>> Thanks in advance.
>>>>>> Fernando.
>>>>>>
>>>>>>
>

Reply via email to