Hi guys, IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could you please open a JIRA and attach a patch for the TestOutlinkExtractor so that we can reproduce the problem?
Thanks Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com On 28 September 2010 21:14, Markus Jelsma <[email protected]> wrote: > Hello Mathijs, > > > > I inspected the code base and found that the problem is most likely in the > parse-tika code where the text is being extracted and the OutlinkExtractor > is called. The OutlinkExtractor uses a regular expression that can output a > lot of garbage. I've added a test to the TestOutlinkExtractor where it's > clear that at least one URL does not pass but it does not point me in the > right direction for solving the relative path problem. > > > > Unless someone knows, i'll try to find out how the OutlinkExtractor works > with the current base URL because just a plain relative URL in the test will > obviously fail. > > > > Thanks for the pointer =) > > > > Cheers, > > -----Original message----- > From: Mathijs Homminga <[email protected]> > Sent: Tue 28-09-2010 21:01 > To: [email protected]; > Subject: Re: Funky duplicate url's, getting much worse! > > Hi Marcus, > > I remember Nutch had some troubles with honoring the page's BASE tag when > resolving relative outlinks. > However, I don't see this BASE tag being used in the HTML pages you provide > so that's might not be it. > > Mathijs > > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote: > > > Anyone? Where is a proper solution for this issue? As expected, the regex > won't catch all imaginable kinds of funky URL's that somehow ended up in the > CrawlDB. Before the weekend, i added another news site to the tests i > conduct and let it run continuously. Unfortunately, the generator now comes > up with all kinds of completely useless URL's, although they do exist but > that's just the web application ignoring most parts of the URL's. > > > > > > > > This is the URL that should be considered as proper URL: > > > > http://www.blikopnieuws.nl/nieuwsblok > > > > > > > > Here are two URL's that are completely useless: > > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie > > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/ > > > > > > > > It is very hard to use deduplication on these simply because the content > is actually changes too much as time progresses - the latest news block for > example. It is therefore a necessity to keep these URL's from ending up in > the CrawlDB and so not to waste disk space and update time of the CrawlDB > and and huge load of bandwidth - i'm in my current fetch probably going to > waste at least a few GB's. > > > > > > > > Looking at the HTML source, it looks like the parser cannot properly > handle relative URL's. It is, of course, quite ugly for a site to do this > but the parser must not fool itself and come up with URL's that really > aren't there. Combined with the issue i began the thread with i believe the > following two problems are present - the parser returns imaginary (false) > URL's because of: > > > > 1. relative href's; > > > > 2. URL's in anchors (that is the XML element's body) next to the rhef > attribute. > > > > > > > > Please help in finding the source of the problem (Tika? Nutch?) and how > to proceed in having it fixed so other users won't waste bandwidth, disk > space and CPU cycles =) > > > > > > > > > > > > > > > > Oh, here's a snippet of the fetch job that's currently running, also, > notice the news item with the 119039 ID, it's the same as above although > that copy/paste was 15 minutes ago. Most item ID's you see below continue to > return in the current log output. > > > > > > > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren > > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag > > fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488 > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/ > > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap > > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html > > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/ > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel > > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle > > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy > > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy > > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag > > fetching > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493 > > fetching > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie > > fetching > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel > > fetching > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/ > > > > > > -----Original message----- > > From: Markus Jelsma <[email protected]> > > Sent: Wed 22-09-2010 20:47 > > To: [email protected]; > > Subject: RE: Re: Funky duplicate url's > > > > Thanks! I've already implemented a similar (but not as generic) regex to > catch these url's. But it is, of course, not a proper solution to solve a > parsing problem with subsequent regex's. Please, correct me if i'm wrong, > but i'm quite sure those url's are not to be found in the HTML sources. I'd > better to be fixed where the problem seems to be. > > > > > > > > I'll test your regex but i'd still like to know where the exact problem > lies and hopefully fix or help fixing it. > > > > > > > > Thanks > > > > -----Original message----- > > From: AJ Chen <[email protected]> > > Sent: Wed 22-09-2010 20:29 > > To: [email protected]; > > Subject: Re: Funky duplicate url's > > > > the conf/regex-urlfilter.txt file has an exclusion rule that should skip > > these viral urls. > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > > loops > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > > -aj > > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <[email protected] > >wrote: > > > >> Well, using a regex to catch these troublemakers isn't going to be > useful. > >> Although i caught the first faulty url's, there can be many more and > it's > >> unpredictable; here's just a random pick from the list of errors: > >> > >> > >> > >> > >> > >> > >> > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/ > >> > >> > >> > >> > >> > >> Here's another very disturbing url it's trying to fetch: > >> > >> > >> > >> > >> > >> > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ > >> > >> > >> > >> > >> > >> I'm seems these bad url's are somehow found by the parser and get > fetched > >> the next time, and the next time making the url grow longer and longer > for > >> each fetch and parse and updateDB cycle. > >> > >> > >> > >> > >> > >> > >> > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus > >> > >> > >> > >> > >> > >> This doesn't look good at all. Anyone got a suggestion or some pointer? > >> > >> > >> > >> > >> > >> > >> -----Original message----- > >> From: Markus Jelsma <[email protected]> > >> Sent: Wed 22-09-2010 12:12 > >> To: [email protected]; > >> Subject: Funky duplicate url's > >> > >> Hi, > >> > >> > >> > >> This is not about deduplication, but about preventing certain url's to > end > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it has > the > >> usual categories etc. News item pages feature a gray text block that's > got > >> some url's as well. See > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an > example. > >> > >> > >> > >> The problem is, the parser somehow manages to concatenate the href with > the > >> inner anchor text (which happens to be an url as you can see). So, > >> subsequent fetches are completely messed up, i'm almost only fetching > >> duplicates: > >> > >> > >> > >> fetching > >> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece > >> fetching > >> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece > >> fetching > >> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece > >> > >> > >> > >> This is not desired behavior, as you'd expect. The question is, where to > >> fix and how to fix it? Is it a problem with the parser? Or is it fixable > >> using some freaky url filter for this one? > >> > >> > >> > >> > >> > >> Cheers, > >> > >> > >> > >> > >> > > > > > > > > -- > > AJ Chen, PhD > > Chair, Semantic Web SIG, sdforum.org > > http://web2express.org > > twitter @web2express > > Palo Alto, CA, USA > >

