Just like to add that we've too seen extremely slow tasks as a result of ridicously long urls. Adding an urlfilter that filters out urls longer than 2000 characters (or something like that) is pretty mandatory for any serious internet crawl.

On 08/31/2011 11:49 AM, Markus Jelsma wrote:
Thanks for shedding some light. I was already looking for filters/normalizers
in the step but couldn't find it. I forgot to think about the job's output
format. Makes sense indeed.

Cheers

On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
Hi Markus,

You are right in thinking that the reduce step does not do much in itself.
It is not so much the reduce step which is likely to be the source of your
problem but the URLFiltering / Normalizing within ParseoutputFormat.
Basically we get outlinks as a result of the parse and when writing the
output to HDFS we need to filter / normalise them.

I have seen problems on large crawls with ridiculously large URLs which put
the normalisation in disarray with the symptoms you described. You can add
a trace in the log before normalising to see what the URLs look like and
add a custom normaliser which prevents large URLs to be processed.

As usual jstack is your friend and will confirm that this is where the
problem is.

HTH

Julien

On 30 August 2011 23:39, Markus Jelsma<[email protected]>  wrote:
I should add that i sometimes see an url filter exception written to the
reduce log. I don't understand why this is the case; all the
ParseSegment.reduce() code does is collecting key/value data.

I also should point out that most reducers finish in reasonable time and
it's
always one task stalling the job to excessive lengths. The cluster is
homogenous, this is not an assumption (i know the fallacies of distibuted
computing ;) ). A server stalling the process is identical to all others
and
replication factor is only 2 for all files except the crawl db.

Please enlighten me.

Hi,

Any idea why the reducer of the parse job is as slow as a snail taking
a detour? There is no processing in reducer; all it does it copy the
keys
and

values.

The reduce step (meaning the last 33% of the reducer) is even slower
than the whole parsing done in the mapper! It is even slower than the
whole fetch job while it is the fetcher that produces the most output
(high I/O).

A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
amount) while the reducer has 7 times less data to write and no
processing! Yet it takes about 3 times longer to complete, stunning
figures!

This excessive run time came apparant only when i significantly
increased the number of url's to generate (topN). When the topN was
lower the difference between run times of the fetch and parse jobs
were a lot smaller, usually it was the fetcher being slow because of
merging the spills.

Any thoughts?

Thanks

Reply via email to