Hi Matt,

Computing pairwise similarity is a quadratic problem. The runtime does
not so much depend on the amount of it as on its distribution. If you
have a few things in your data that cooccur with everything else, you
will get quadratic intermediate sizes.

In the collaborative filtering code, users with an enormous number of
interactions are down sampled to avoid this. If you use the "raw"
rowsimilarity job, you might have to do this yourself.

You should have a look at your data and see whether this is the case.

Best,
Sebastian



On 05.11.2012 17:22, Matt Molek wrote:
> Having found a few mentions of running rowsimilarity with multiple
> reducers, I assume it's ok.
> 
> I'm having a problem with the RowSimilarityJob-CooccurrencesMapper-Reducer
> job though. I'm running over a data set of ~5 million entries x ~3 million
> boolean features, where each entry has no more than 10 non-zeros. With 256
> mappers, ~95% of them finish within 10 minutes. The last 5% get stuck at
> random levels of completeness, like 44.47%, and just sit there for ages
> spilling more and more output but never increasing the completeness
> counter. Eventually after as much as 8 hours they jump to 100%, merge their
> output, and finish.
> 
> It's usually the early map tasks that have trouble. Right now I'm sitting
> with all tasks done except mappers 0-4 which are stuck at various states of
> completeness.
> 
> Is there something about the ordering of the output of the
> RowSimilarityJob-VectorNormMapper-Reducer job that would consistently cause
> the early map tasks on RowSimilarityJob-CooccurrencesMapper-Reducer job to
> take forever? Is there any tuning I can do to more evenly distribute this
> load so 5% of my mappers don't slow my job down so horribly?
> 

Reply via email to