Yes but i need a little more testing. Anyone knows how i can only test that class? I currently use ant -v test -l logfile and need to dig through the log file, also, it takes too long because of other tests.
On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote: > Hi guys, > > IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could > you please open a JIRA and attach a patch for the TestOutlinkExtractor so > that we can reproduce the problem? > > Thanks > > Julien > > > Hello Mathijs, > > > > > > > > I inspected the code base and found that the problem is most likely in > > the parse-tika code where the text is being extracted and the > > OutlinkExtractor is called. The OutlinkExtractor uses a regular > > expression that can output a lot of garbage. I've added a test to the > > TestOutlinkExtractor where it's clear that at least one URL does not pass > > but it does not point me in the right direction for solving the relative > > path problem. > > > > > > > > Unless someone knows, i'll try to find out how the OutlinkExtractor works > > with the current base URL because just a plain relative URL in the test > > will obviously fail. > > > > > > > > Thanks for the pointer =) > > > > > > > > Cheers, > > > > -----Original message----- > > From: Mathijs Homminga <[email protected]> > > Sent: Tue 28-09-2010 21:01 > > To: [email protected]; > > Subject: Re: Funky duplicate url's, getting much worse! > > > > Hi Marcus, > > > > I remember Nutch had some troubles with honoring the page's BASE tag when > > resolving relative outlinks. > > However, I don't see this BASE tag being used in the HTML pages you > > provide so that's might not be it. > > > > Mathijs > > > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote: > > > Anyone? Where is a proper solution for this issue? As expected, the > > > regex > > > > won't catch all imaginable kinds of funky URL's that somehow ended up in > > the CrawlDB. Before the weekend, i added another news site to the tests i > > conduct and let it run continuously. Unfortunately, the generator now > > comes up with all kinds of completely useless URL's, although they do > > exist but that's just the web application ignoring most parts of the > > URL's. > > > > > This is the URL that should be considered as proper URL: > > > > > > http://www.blikopnieuws.nl/nieuwsblok > > > > > > > > > > > > Here are two URL's that are completely useless: > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri > >cht/119033/bericht/119047/economie > > > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/1190 > >35/archief/bericht/119038/archief/ > > > > > It is very hard to use deduplication on these simply because the > > > content > > > > is actually changes too much as time progresses - the latest news block > > for example. It is therefore a necessity to keep these URL's from ending > > up in the CrawlDB and so not to waste disk space and update time of the > > CrawlDB and and huge load of bandwidth - i'm in my current fetch probably > > going to waste at least a few GB's. > > > > > Looking at the HTML source, it looks like the parser cannot properly > > > > handle relative URL's. It is, of course, quite ugly for a site to do this > > but the parser must not fool itself and come up with URL's that really > > aren't there. Combined with the issue i began the thread with i believe > > the following two problems are present - the parser returns imaginary > > (false) > > > > URL's because of: > > > 1. relative href's; > > > > > > 2. URL's in anchors (that is the XML element's body) next to the rhef > > > > attribute. > > > > > Please help in finding the source of the problem (Tika? Nutch?) and how > > > > to proceed in having it fixed so other users won't waste bandwidth, disk > > space and CPU cycles =) > > > > > Oh, here's a snippet of the fetch job that's currently running, also, > > > > notice the news item with the 119039 ID, it's the same as above although > > that copy/paste was 15 minutes ago. Most item ID's you see below continue > > to return in the current log output. > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/ > >hetweer/game/persberichtaanleveren > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht > >/119036/game/tipons > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht > >/119035/bericht/119033/disclaimer > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/ber > >icht/119036/groningen > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/b > >ericht/119042/persberichtaanleveren > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archi > >ef/bericht/119036/bericht/119038/zuidholland > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/ > >bericht/119036/game/hetweer/vandaag > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht > >/119035/game/archief/donderdag > > > > > fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht > >/119034/archief/zeeland > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri > >cht/119041/bericht/119047/lifestyle > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488 > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht > >/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/be > >richt/119038/game/lennythelizard > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archi > >ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.h > >tml > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/ > >game/bericht/119035/noordbrabant > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/b > >ericht/119036/ > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/arch > >ief/bericht/119043/game/bioballboom > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/1190 > >33/archief/bericht/119046/wetenschap > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/arch > >ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game > >/archief/rss/ > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetwe > >er/game/archief/overijssel > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038 > >/bericht/119048/binnenland > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/ > >bericht/119038/game/auto > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief > >/bericht/119049/zeeland > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/ar > >chief/meewerken > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/ > >game/bericht/119034/gelderland > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/ > >bericht/119042/game/binnenland > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archi > >ef/bericht/119035/bericht/119035/gelderland > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/1 > >19038/archief/lifestyle > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/beri > >cht/119041/hetweer/archief/woensdag > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/1190 > >42/archief/bericht/119047/lifestyle > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/beri > >cht/119034/bericht/119047/glossy > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht > >/119038/bericht/119045/glossy > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/1190 > >36/game/bericht/119042/archief/zaterdag > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/ > >archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493 > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/1190 > >37/archief/bericht/119046/economie > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119 > >033/bericht/119037/overijssel > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht > >/119036/bericht/119037/ > > > > > -----Original message----- > > > From: Markus Jelsma <[email protected]> > > > Sent: Wed 22-09-2010 20:47 > > > To: [email protected]; > > > Subject: RE: Re: Funky duplicate url's > > > > > > Thanks! I've already implemented a similar (but not as generic) regex > > > to > > > > catch these url's. But it is, of course, not a proper solution to solve a > > parsing problem with subsequent regex's. Please, correct me if i'm wrong, > > but i'm quite sure those url's are not to be found in the HTML sources. > > I'd better to be fixed where the problem seems to be. > > > > > I'll test your regex but i'd still like to know where the exact problem > > > > lies and hopefully fix or help fixing it. > > > > > Thanks > > > > > > -----Original message----- > > > From: AJ Chen <[email protected]> > > > Sent: Wed 22-09-2010 20:29 > > > To: [email protected]; > > > Subject: Re: Funky duplicate url's > > > > > > the conf/regex-urlfilter.txt file has an exclusion rule that should > > > skip these viral urls. > > > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to > > > break loops > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > > > > -aj > > > > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma > > > <[email protected] > > > > > >wrote: > > >> Well, using a regex to catch these troublemakers isn't going to be > > > > useful. > > > > >> Although i caught the first faulty url's, there can be many more and > > > > it's > > > > >> unpredictable; here's just a random pick from the list of errors: > > > > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is > >/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Center > >s-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest. > >is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Cent > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/ > > > > >> Here's another very disturbing url it's trying to fetch: > > > > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/ > >02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_li > >censes_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h > >ttp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi > >ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0 > >2/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lic > >enses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht > >tp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis > >ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02 > >/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lice > >nses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt > >p/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist > >er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/ > >04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licen > >ses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http > >/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste > >r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0 > >4/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licens > >es_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ > >www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister > >.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04 > >/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_license > >s_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/w > >ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister. > >com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/ > >elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses > >_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww > >w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c > >om/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/e > >lpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ > >ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www > >.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co > >m/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/el > >pida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_o > >vonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www. > >theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com > >/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elp > >ida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ov > >onyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t > >heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/ > >2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpi > >da_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovo > >nyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.th > >eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2 > >005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpid > >a_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovon > >yx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.the > >register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/20 > >05/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida > >_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovony > >x/ > > > > >> I'm seems these bad url's are somehow found by the parser and get > > > > fetched > > > > >> the next time, and the next time making the url grow longer and longer > > > > for > > > > >> each fetch and parse and updateDB cycle. > > > > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_199 > >9/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www > >.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/ww > >w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/w > >ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/ > >www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office > >/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic > >e/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi > >ce/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off > >ice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of > >fice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus > > > > >> This doesn't look good at all. Anyone got a suggestion or some > > >> pointer? > > >> > > >> > > >> > > >> > > >> > > >> > > >> -----Original message----- > > >> From: Markus Jelsma <[email protected]> > > >> Sent: Wed 22-09-2010 12:12 > > >> To: [email protected]; > > >> Subject: Funky duplicate url's > > >> > > >> Hi, > > >> > > >> > > >> > > >> This is not about deduplication, but about preventing certain url's to > > > > end > > > > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it > > >> has > > > > the > > > > >> usual categories etc. News item pages feature a gray text block that's > > > > got > > > > >> some url's as well. See > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an > > > > example. > > > > >> The problem is, the parser somehow manages to concatenate the href > > >> with > > > > the > > > > >> inner anchor text (which happens to be an url as you can see). So, > > >> subsequent fetches are completely messed up, i'm almost only fetching > > >> duplicates: > > >> > > >> > > >> > > >> fetching > > > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni > >euws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www. > >trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/e > >conomie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw. > >nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/ > >www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieu > >ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece > > > > >> fetching > > > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni > >euws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www > >.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/ > >weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw > >.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/econo > >mie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/ > >nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece > > > > >> fetching > > > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/op > >inie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www > >.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/ > >weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw > >.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblo > >gs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/o > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece > > > > >> This is not desired behavior, as you'd expect. The question is, where > > >> to fix and how to fix it? Is it a problem with the parser? Or is it > > >> fixable using some freaky url filter for this one? > > >> > > >> > > >> > > >> > > >> > > >> Cheers, > > > > > > -- > > > AJ Chen, PhD > > > Chair, Semantic Web SIG, sdforum.org > > > http://web2express.org > > > twitter @web2express > > > Palo Alto, CA, USA > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

