Intersting.

I have another requirement, which is to do something like real time vector based queries. Imagine taking a doc vector, reweighting some terms then doing a query with it, perhaps in a truncated form. There are several ways to do this but only solr would offer something real time results afaik. It looks like I could use your approach below to do this. A quick look at eDisMax however suggests some problems. The use of pf2 and pf3 would jamb the query vector into synthesized bi and tri grams for instance.

I'd be interested in hearing more about how you use it. Is there a better venue than the mahout list?

On 7/13/12 9:41 PM, Ken Krugler wrote:
Hi Pat,

On Jul 13, 2012, at 12:47pm, Pat Ferrel wrote:

I also do clustering so that's an obvious optimization I just haven't gotten to 
it yet (doing similar only on docs clustered together). I'm also trying to 
decide how to downsample. However the results from similarity are quite good so 
understanding how to scale is #1.

Clustering gives docs closest to a centroid. RowSimilarity finds docs similar 
to each docs.

What I really need is to calculate the k most similar docs to a short list, 
known ahead of time. I don't know of an algorithm to do this (other than brute 
force). It would take a realatively small set of docs and find similar docs in 
a much much larger set. Rowsimilarity finds all pair-wise similarities. 
Strictly speaking I need only a tiny number of those.

I think lucene has a weighted verctor based search that I need to investigate 
it further.
As one point of reference, I've used Solr (Lucene) to do this, by taking the 
set of features (small, heavily reduced) from the target doc, using them (with 
weights) via edismax to find some top N candidate documents in the Lucene index 
which I'd built using the same approach (small set of features), and then 
calculating pair-wise similarity to rank the results.

-- Ken

On 7/13/12 9:32 AM, Sebastian Schelter wrote:
Pat,

RowSimilarityJob compares all pairs of rows, which is by definition a
quadratic and therefore non-scalable problem. The comparison is however
done in a way that only rows that have at least one non-zero value in a
common dimension are compared.

Therefore if you have certain sparse types of input such as ratings for
example, you only have to look at a relatively small number of pairs and
the comparison scales.

RowSimilarityJob is mainly used for the collaborative filtering stuff in
Mahout. We have a special job to prepare the data
(PreparePreferenceMatrixJob) that will take care of sampling down
entries in the rating matrix that might cause too much cooccurrences.

If you directly use RowSimilarityJob, you have to ensure that your input
data is of a shape suitable for the job. It seems to me that this is not
the case, you created 76GB of intermediate output (cooccurring terms)
from 35k documents, its clear that it takes hadoop a long time to sort
that in the shuffle phase.

My advice would be that you either take a deeper look at your data and
try to downsample highly frequent terms more, or that you take a look at
other techniques such as clustering or locality sensitive hashing to
find similar documents.

Best,
Sebastian



On 13.07.2012 18:03, Pat Ferrel wrote:
I increased the timeout to 100 minutes and added another machine (does
the new machine matter in this case?). The job completed successfully.

You say the algorithm is non-scalable--did you mean it's not
parallelizable? I assume I'll need to keep increasing this limit?

I'm sure you know better than I that it is not really good for the
efficiency of a cluster to increase the timeout so far since it means
jobs can take much longer in the case of transient task failures.

On 7/12/12 8:26 AM, Pat Ferrel wrote:
OK, thanks. I haven't checked for sparsity. However I have many
successful runs of rowsimilarity with up to 150,000 docs and 250,000
terms as I said below. This run has a much smaller matrix. I
understand that spasity is a different question but anyway since the
data in all cases is a crawl of the same sites I'd expect the same
sparsity in all the data sets whether they succeeded or timed out.

My issue has nothing to do with the elapsed time although I'll have to
consider it in larger data sets (thanks for the heads up). Is it
impossible to check in with the task tracker, avoiding a timeout? Or
is there some other issue?

On 7/12/12 8:06 AM, Sebastian Schelter wrote:
It's important to note that the performance of RowSimilarityJob
heavily depends on the sparsity of the input data, because in general
comparing all pairs of things is a quadratic (non-scalable) problem.

2012/7/12 Sebastian Schelter <[email protected]>:
Sorry, I overread that its more than one machine. Could you provide
the values for the counters from RowSimilarityJob (ROWS,
COOCCURRENCES, PRUNED_COOCCURRENCES)?

Best,
Sebastian

2012/7/12 Pat Ferrel <[email protected]>:
Thanks, actually there are two machines. I am testing before
spending on
AWS. It's failing the test in this case.

BTW I ran the same setup with 150,000 docs and 250,000 terms with a
much
lower timeout (30000000) all worked fine. I was using 0.6 at the
time and
not sure if 0.8 has ever completed a rowsimilarity of any size.
Small runs
work fine on my laptop.

I smell some kind of other problem than simple performance. In any
case in a
perfect world isn't the code supposed to check in often enough so the
cluster config doesn't need to be tweaked for a specific job?

It may be some problem of mine, of course. I see no obvious hadoop
or mahout
errors but there are many places to look.

With a 100 minute timeout I am currently at the pause between map and
reduce. If it fails would you like any specific logs?


On 7/11/12 4:00 PM, Sebastian Schelter wrote:
To be honest, I don't think it makes a lot of sense to test a Hadoop
job on a single machine. It's pretty obvious that you will get
terrible performance.

2012/7/12 Pat Ferrel <[email protected]>:
BTW the timeout is 1800 but the task in total runs over 9 hours
before
each
failure. This causes the job to take (after three tries) 27 hrs to
completely fail. Oh, bother...

The timeout seems to be during the last map, so when the mappers
reach
100%
but still running. Maybe some kind of cleanup is happening?
The first reducer is still "pending". The reducer never gets a
chance to
start.

12/07/11 11:09:45 INFO mapred.JobClient:  map 92% reduce 0%
12/07/11 11:11:06 INFO mapred.JobClient:  map 93% reduce 0%
12/07/11 11:12:51 INFO mapred.JobClient:  map 94% reduce 0%
12/07/11 11:15:22 INFO mapred.JobClient:  map 95% reduce 0%
12/07/11 11:18:43 INFO mapred.JobClient:  map 96% reduce 0%
12/07/11 11:24:32 INFO mapred.JobClient:  map 97% reduce 0%
12/07/11 11:27:40 INFO mapred.JobClient:  map 98% reduce 0%
12/07/11 11:30:53 INFO mapred.JobClient:  map 99% reduce 0%
12/07/11 11:36:35 INFO mapred.JobClient:  map 100% reduce 0%
---after a very long wait (9hrs or so) insert fail here--->

8 core 2 machine cluster with 8G ram per machine 32,000 docs
76,000 terms

Any other info you need please ask.

I'm about to try cranking it up to a couple hours for timeout but I
suspect
there is something else going on here.


On 7/11/12 10:35 AM, Pat Ferrel wrote:
I'm have a custom lucene stemming analyzer that filters out stop
words
and
uses the following seq2sparse. The -x 40 is the only other thing
that
affects tossing frequent terms and as I understand things,
tosses any
term
that appears in over 40% of the docs.

mahout seq2sparse \
       -i b2/seqfiles/ \
       -o b2/vectors/ \
       -ow \
       -chunk 2000 \
       -x 40 \
       -seq \
       -n 2 \
       -nv \
       -a com.finderbots.analyzers.LuceneStemmingAnalyzer


On 7/11/12 9:18 AM, Sebastian Schelter wrote:
Hi Pat,

have you removed highly frequent terms before launching
rowsimilarity
job?

On 11.07.2012 18:14, Pat Ferrel wrote:
I've been trying to get a rowsimilarity job to complete. It
continues
to
timeout on a RowSimilarityJob-CooccurrencesMapper-Reducer task
so I've
upped the timeout to 30 minutes now. There are no errors in
the logs
that I can see and no other task I've tried is acting like
this. Is
this
expected? Shouldn't the task check in more often?

It's doing 34,000 docs with 40 sim docs each on 8 cores so it
is a bit
slow anyway, still I shouldn't have to turn up the timeout so
high
should I?




--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr







Reply via email to