Re: Funky duplicate url's, getting much worse!

Markus Jelsma Tue, 28 Sep 2010 09:52:05 -0700

Anyone? Where is a proper solution for this issue? As expected, the regex won't 
catch all imaginable kinds of funky URL's that somehow ended up in the CrawlDB. 
Before the weekend, i added another news site to the tests i conduct and let it 
run continuously. Unfortunately, the generator now comes up with all kinds of 
completely useless URL's, although they do exist but that's just the web 
application ignoring most parts of the URL's.


 

This is the URL that should be considered as proper URL:

http://www.blikopnieuws.nl/nieuwsblok

 

Here are two URL's that are completely useless:

http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie

http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/

 

It is very hard to use deduplication on these simply because the content is 
actually changes too much as time progresses - the latest news block for 
example. It is therefore a necessity to keep these URL's from ending up in the 
CrawlDB and so not to waste disk space and update time of the CrawlDB and and 
huge load of bandwidth - i'm in my current fetch probably going to waste at 
least a few GB's.

 

Looking at the HTML source, it looks like the parser cannot properly handle 
relative URL's. It is, of course, quite ugly for a site to do this but the 
parser must not fool itself and come up with URL's that really aren't there. 
Combined with the issue i began the thread with i believe the following two 
problems are present - the parser returns imaginary (false) URL's because of:

1. relative href's;

2. URL's in anchors (that is the XML element's body) next to the rhef attribute.

 

Please help in finding the source of the problem (Tika? Nutch?) and how to 
proceed in having it fixed so other users won't waste bandwidth, disk space and 
CPU cycles =)

 

 

 

Oh, here's a snippet of the fetch job that's currently running, also, notice 
the news item with the 119039 ID, it's the same as above although that 
copy/paste was 15 minutes ago. Most item ID's you see below continue to return 
in the current log output.

 

fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
fetching 
http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
-activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
fetching 
http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
fetching 
http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
fetching 
http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel
fetching 
http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland
fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland
fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle
fetching 
http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy
fetching 
http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy
fetching 
http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag
fetching 
http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
-activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
fetching 
http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie
fetching 
http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel
fetching 
http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/

 
-----Original message-----
From: Markus Jelsma <[email protected]>
Sent: Wed 22-09-2010 20:47
To: [email protected]; 
Subject: RE: Re: Funky duplicate url's

Thanks! I've already implemented a similar (but not as generic) regex to catch 
these url's. But it is, of course, not a proper solution to solve a parsing 
problem with subsequent regex's. Please, correct me if i'm wrong, but i'm quite 
sure those url's are not to be found in the HTML sources. I'd better to be 
fixed where the problem seems to be.

 

I'll test your regex but i'd still like to know where the exact problem lies 
and hopefully fix or help fixing it.

 

Thanks
 
-----Original message-----
From: AJ Chen <[email protected]>
Sent: Wed 22-09-2010 20:29
To: [email protected]; 
Subject: Re: Funky duplicate url's

the conf/regex-urlfilter.txt file has an exclusion rule that should skip
these viral urls.

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

-aj

On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <[email protected]>wrote:

> Well, using a regex to catch these troublemakers isn't going to be useful.
> Although i caught the first faulty url's, there can be many more and it's
> unpredictable; here's just a random pick from the list of errors:
>
>
>
>
>
>
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
>
>
>
>
>
> Here's another very disturbing url it's trying to fetch:
>
>
>
>
>
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
>
>
>
>
>
> I'm seems these bad url's are somehow found by the parser and get fetched
> the next time, and the next time making the url grow longer and longer for
> each fetch and parse and updateDB cycle.
>
>
>
>
>
>
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
>
>
>
>
>
> This doesn't look good at all. Anyone got a suggestion or some pointer?
>
>
>
>
>
>
> -----Original message-----
> From: Markus Jelsma <[email protected]>
> Sent: Wed 22-09-2010 12:12
> To: [email protected];
> Subject: Funky duplicate url's
>
> Hi,
>
>
>
> This is not about deduplication, but about preventing certain url's to end
> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the
> usual categories etc. News item pages feature a gray text block that's got
> some url's as well. See
> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
>
>
>
> The problem is, the parser somehow manages to concatenate the href with the
> inner anchor text (which happens to be an url as you can see). So,
> subsequent fetches are completely messed up, i'm almost only fetching
> duplicates:
>
>
>
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> fetching
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
>
>
>
> This is not desired behavior, as you'd expect. The question is, where to
> fix and how to fix it? Is it a problem with the parser? Or is it fixable
> using some freaky url filter for this one?
>
>
>
>
>
> Cheers,
>
>
>
>
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: Funky duplicate url's, getting much worse!

Reply via email to