Hello Mathijs,
I inspected the code base and found that the problem is most likely in the parse-tika code where the text is being extracted and the OutlinkExtractor is called. The OutlinkExtractor uses a regular expression that can output a lot of garbage. I've added a test to the TestOutlinkExtractor where it's clear that at least one URL does not pass but it does not point me in the right direction for solving the relative path problem. Unless someone knows, i'll try to find out how the OutlinkExtractor works with the current base URL because just a plain relative URL in the test will obviously fail. Thanks for the pointer =) Cheers, -----Original message----- From: Mathijs Homminga <[email protected]> Sent: Tue 28-09-2010 21:01 To: [email protected]; Subject: Re: Funky duplicate url's, getting much worse! Hi Marcus, I remember Nutch had some troubles with honoring the page's BASE tag when resolving relative outlinks. However, I don't see this BASE tag being used in the HTML pages you provide so that's might not be it. Mathijs On Sep 28, 2010, at 18:51 , Markus Jelsma wrote: > Anyone? Where is a proper solution for this issue? As expected, the regex > won't catch all imaginable kinds of funky URL's that somehow ended up in the > CrawlDB. Before the weekend, i added another news site to the tests i conduct > and let it run continuously. Unfortunately, the generator now comes up with > all kinds of completely useless URL's, although they do exist but that's just > the web application ignoring most parts of the URL's. > > > > This is the URL that should be considered as proper URL: > > http://www.blikopnieuws.nl/nieuwsblok > > > > Here are two URL's that are completely useless: > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/ > > > > It is very hard to use deduplication on these simply because the content is > actually changes too much as time progresses - the latest news block for > example. It is therefore a necessity to keep these URL's from ending up in > the CrawlDB and so not to waste disk space and update time of the CrawlDB and > and huge load of bandwidth - i'm in my current fetch probably going to waste > at least a few GB's. > > > > Looking at the HTML source, it looks like the parser cannot properly handle > relative URL's. It is, of course, quite ugly for a site to do this but the > parser must not fool itself and come up with URL's that really aren't there. > Combined with the issue i began the thread with i believe the following two > problems are present - the parser returns imaginary (false) URL's because of: > > 1. relative href's; > > 2. URL's in anchors (that is the XML element's body) next to the rhef > attribute. > > > > Please help in finding the source of the problem (Tika? Nutch?) and how to > proceed in having it fixed so other users won't waste bandwidth, disk space > and CPU cycles =) > > > > > > > > Oh, here's a snippet of the fetch job that's currently running, also, notice > the news item with the 119039 ID, it's the same as above although that > copy/paste was 15 minutes ago. Most item ID's you see below continue to > return in the current log output. > > > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag > fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488 > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/ > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/ > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493 > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/ > > > -----Original message----- > From: Markus Jelsma <[email protected]> > Sent: Wed 22-09-2010 20:47 > To: [email protected]; > Subject: RE: Re: Funky duplicate url's > > Thanks! I've already implemented a similar (but not as generic) regex to > catch these url's. But it is, of course, not a proper solution to solve a > parsing problem with subsequent regex's. Please, correct me if i'm wrong, but > i'm quite sure those url's are not to be found in the HTML sources. I'd > better to be fixed where the problem seems to be. > > > > I'll test your regex but i'd still like to know where the exact problem lies > and hopefully fix or help fixing it. > > > > Thanks > > -----Original message----- > From: AJ Chen <[email protected]> > Sent: Wed 22-09-2010 20:29 > To: [email protected]; > Subject: Re: Funky duplicate url's > > the conf/regex-urlfilter.txt file has an exclusion rule that should skip > these viral urls. > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > -aj > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma > <[email protected]>wrote: > >> Well, using a regex to catch these troublemakers isn't going to be useful. >> Although i caught the first faulty url's, there can be many more and it's >> unpredictable; here's just a random pick from the list of errors: >> >> >> >> >> >> >> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/ >> >> >> >> >> >> Here's another very disturbing url it's trying to fetch: >> >> >> >> >> >> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ >> >> >> >> >> >> I'm seems these bad url's are somehow found by the parser and get fetched >> the next time, and the next time making the url grow longer and longer for >> each fetch and parse and updateDB cycle. >> >> >> >> >> >> >> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus >> >> >> >> >> >> This doesn't look good at all. Anyone got a suggestion or some pointer? >> >> >> >> >> >> >> -----Original message----- >> From: Markus Jelsma <[email protected]> >> Sent: Wed 22-09-2010 12:12 >> To: [email protected]; >> Subject: Funky duplicate url's >> >> Hi, >> >> >> >> This is not about deduplication, but about preventing certain url's to end >> up in the CrawlDB. I'm crawling a news site for testing purposes, it has the >> usual categories etc. News item pages feature a gray text block that's got >> some url's as well. See >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example. >> >> >> >> The problem is, the parser somehow manages to concatenate the href with the >> inner anchor text (which happens to be an url as you can see). So, >> subsequent fetches are completely messed up, i'm almost only fetching >> duplicates: >> >> >> >> fetching >> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece >> fetching >> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece >> fetching >> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece >> >> >> >> This is not desired behavior, as you'd expect. The question is, where to >> fix and how to fix it? Is it a problem with the parser? Or is it fixable >> using some freaky url filter for this one? >> >> >> >> >> >> Cheers, >> >> >> >> >> > > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA

