Re: Funky duplicate url's, getting much worse!

Mathijs Homminga Tue, 28 Sep 2010 12:01:26 -0700

Hi Marcus,

I remember Nutch had some troubles with honoring the page's BASE tag when 
resolving relative outlinks.
However, I don't see this BASE tag being used in the HTML pages you provide so 
that's might not be it.


Mathijs


On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:

> Anyone? Where is a proper solution for this issue? As expected, the regex 
> won't catch all imaginable kinds of funky URL's that somehow ended up in the 
> CrawlDB. Before the weekend, i added another news site to the tests i conduct 
> and let it run continuously. Unfortunately, the generator now comes up with 
> all kinds of completely useless URL's, although they do exist but that's just 
> the web application ignoring most parts of the URL's.
> 
>  
> 
> This is the URL that should be considered as proper URL:
> 
> http://www.blikopnieuws.nl/nieuwsblok
> 
>  
> 
> Here are two URL's that are completely useless:
> 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie
> 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/
> 
>  
> 
> It is very hard to use deduplication on these simply because the content is 
> actually changes too much as time progresses - the latest news block for 
> example. It is therefore a necessity to keep these URL's from ending up in 
> the CrawlDB and so not to waste disk space and update time of the CrawlDB and 
> and huge load of bandwidth - i'm in my current fetch probably going to waste 
> at least a few GB's.
> 
>  
> 
> Looking at the HTML source, it looks like the parser cannot properly handle 
> relative URL's. It is, of course, quite ugly for a site to do this but the 
> parser must not fool itself and come up with URL's that really aren't there. 
> Combined with the issue i began the thread with i believe the following two 
> problems are present - the parser returns imaginary (false) URL's because of:
> 
> 1. relative href's;
> 
> 2. URL's in anchors (that is the XML element's body) next to the rhef 
> attribute.
> 
>  
> 
> Please help in finding the source of the problem (Tika? Nutch?) and how to 
> proceed in having it fixed so other users won't waste bandwidth, disk space 
> and CPU cycles =)
> 
>  
> 
>  
> 
>  
> 
> Oh, here's a snippet of the fetch job that's currently running, also, notice 
> the news item with the 119039 ID, it's the same as above although that 
> copy/paste was 15 minutes ago. Most item ID's you see below continue to 
> return in the current log output.
> 
>  
> 
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
> fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
> -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel
> fetching 
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/
> 
>  
> -----Original message-----
> From: Markus Jelsma <[email protected]>
> Sent: Wed 22-09-2010 20:47
> To: [email protected]; 
> Subject: RE: Re: Funky duplicate url's
> 
> Thanks! I've already implemented a similar (but not as generic) regex to 
> catch these url's. But it is, of course, not a proper solution to solve a 
> parsing problem with subsequent regex's. Please, correct me if i'm wrong, but 
> i'm quite sure those url's are not to be found in the HTML sources. I'd 
> better to be fixed where the problem seems to be.
> 
>  
> 
> I'll test your regex but i'd still like to know where the exact problem lies 
> and hopefully fix or help fixing it.
> 
>  
> 
> Thanks
>  
> -----Original message-----
> From: AJ Chen <[email protected]>
> Sent: Wed 22-09-2010 20:29
> To: [email protected]; 
> Subject: Re: Funky duplicate url's
> 
> the conf/regex-urlfilter.txt file has an exclusion rule that should skip
> these viral urls.
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> -aj
> 
> On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma 
> <[email protected]>wrote:
> 
>> Well, using a regex to catch these troublemakers isn't going to be useful.
>> Although i caught the first faulty url's, there can be many more and it's
>> unpredictable; here's just a random pick from the list of errors:
>> 
>> 
>> 
>> 
>> 
>> 
>> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
>> 
>> 
>> 
>> 
>> 
>> Here's another very disturbing url it's trying to fetch:
>> 
>> 
>> 
>> 
>> 
>> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
>> 
>> 
>> 
>> 
>> 
>> I'm seems these bad url's are somehow found by the parser and get fetched
>> the next time, and the next time making the url grow longer and longer for
>> each fetch and parse and updateDB cycle.
>> 
>> 
>> 
>> 
>> 
>> 
>> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
>> 
>> 
>> 
>> 
>> 
>> This doesn't look good at all. Anyone got a suggestion or some pointer?
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original message-----
>> From: Markus Jelsma <[email protected]>
>> Sent: Wed 22-09-2010 12:12
>> To: [email protected];
>> Subject: Funky duplicate url's
>> 
>> Hi,
>> 
>> 
>> 
>> This is not about deduplication, but about preventing certain url's to end
>> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the
>> usual categories etc. News item pages feature a gray text block that's got
>> some url's as well. See
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
>> 
>> 
>> 
>> The problem is, the parser somehow manages to concatenate the href with the
>> inner anchor text (which happens to be an url as you can see). So,
>> subsequent fetches are completely messed up, i'm almost only fetching
>> duplicates:
>> 
>> 
>> 
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
>> fetching
>> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
>> 
>> 
>> 
>> This is not desired behavior, as you'd expect. The question is, where to
>> fix and how to fix it? Is it a problem with the parser? Or is it fixable
>> using some freaky url filter for this one?
>> 
>> 
>> 
>> 
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> -- 
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA

Re: Funky duplicate url's, getting much worse!

Reply via email to