Yes, we hit another trap, an endless list of crap url's on many hosts. 

Be very careful when you see www.museum-zeeaquarium.netspirit.nl popping up in 
the logs a few too many times: never as host but always tailing the url. It 
has 'infected' url's of many different hosts.

http://<ANY_HOST>/<URI_SEGMENT>/<PAGE>.html;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/contact.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/contact.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/beheren.php;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Vluchtelingen;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/Museum_Friesland;/Museum_Gelderland;/Museum_Utrecht;/Museum_Gelderland;/http;/www.museum-
zeeaquarium.netspirit.nl


> Just like to add that we've too seen extremely slow tasks as a result of
> ridicously long urls. Adding an urlfilter that filters out urls longer
> than 2000 characters (or something like that) is pretty mandatory for
> any serious internet crawl.
> 
> On 08/31/2011 11:49 AM, Markus Jelsma wrote:
> > Thanks for shedding some light. I was already looking for
> > filters/normalizers in the step but couldn't find it. I forgot to think
> > about the job's output format. Makes sense indeed.
> > 
> > Cheers
> > 
> > On Wednesday 31 August 2011 11:26:46 Julien Nioche wrote:
> >> Hi Markus,
> >> 
> >> You are right in thinking that the reduce step does not do much in
> >> itself. It is not so much the reduce step which is likely to be the
> >> source of your problem but the URLFiltering / Normalizing within
> >> ParseoutputFormat. Basically we get outlinks as a result of the parse
> >> and when writing the output to HDFS we need to filter / normalise them.
> >> 
> >> I have seen problems on large crawls with ridiculously large URLs which
> >> put the normalisation in disarray with the symptoms you described. You
> >> can add a trace in the log before normalising to see what the URLs look
> >> like and add a custom normaliser which prevents large URLs to be
> >> processed.
> >> 
> >> As usual jstack is your friend and will confirm that this is where the
> >> problem is.
> >> 
> >> HTH
> >> 
> >> Julien
> >> 
> >> On 30 August 2011 23:39, Markus Jelsma<[email protected]>  
wrote:
> >>> I should add that i sometimes see an url filter exception written to
> >>> the reduce log. I don't understand why this is the case; all the
> >>> ParseSegment.reduce() code does is collecting key/value data.
> >>> 
> >>> I also should point out that most reducers finish in reasonable time
> >>> and it's
> >>> always one task stalling the job to excessive lengths. The cluster is
> >>> homogenous, this is not an assumption (i know the fallacies of
> >>> distibuted computing ;) ). A server stalling the process is identical
> >>> to all others and
> >>> replication factor is only 2 for all files except the crawl db.
> >>> 
> >>> Please enlighten me.
> >>> 
> >>>> Hi,
> >>>> 
> >>>> Any idea why the reducer of the parse job is as slow as a snail taking
> >>>> a detour? There is no processing in reducer; all it does it copy the
> >>>> keys
> >>> 
> >>> and
> >>> 
> >>>> values.
> >>>> 
> >>>> The reduce step (meaning the last 33% of the reducer) is even slower
> >>>> than the whole parsing done in the mapper! It is even slower than the
> >>>> whole fetch job while it is the fetcher that produces the most output
> >>>> (high I/O).
> >>>> 
> >>>> A running cycle has a fetcher writing 70GiB and 10GiB to HDFS (total
> >>>> amount) while the reducer has 7 times less data to write and no
> >>>> processing! Yet it takes about 3 times longer to complete, stunning
> >>>> figures!
> >>>> 
> >>>> This excessive run time came apparant only when i significantly
> >>>> increased the number of url's to generate (topN). When the topN was
> >>>> lower the difference between run times of the fetch and parse jobs
> >>>> were a lot smaller, usually it was the fetcher being slow because of
> >>>> merging the spills.
> >>>> 
> >>>> Any thoughts?
> >>>> 
> >>>> Thanks

Reply via email to