The following regex -.*(/[^/]+)/[^/]+\1/[^/]+\1/
prevents URL's such as http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ to end up in the CrawlDB. The problem with the blikopnieuws URL's is that they don't contain exact repeating parts. They do have stuff like http://HOST/path/item/ID_1/item/ID_2 but that's quite a common schema on the internet. Adding a regex that filters these occurences would silently discard many other valid URL's. http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie Thanks for your comments, it looks like i'm stuck with this at least for now =) On Wednesday 29 September 2010 14:58:10 Julien Nioche wrote: > What I did for similarpages.com was to write a custom URL filter that > detected repetition of path elements and discarded a URL if it had a path > occurring more than N times. I don't know what regex AJ suggested but the > approach above was generic and also quite fast. > > We also had other things like filtering out ridiculously long URLS (not > only do they tend to be rubbish but they cause the normalisation to take a > lot of CPU) or dynamically generated host names by splitting on say dashes > and remove the URL if the hostname had more than N tokens. > > These are all small tricks but they help controlling the content of the > crawldb and not waste time trying to fetch rubbish or scanning an > unnecessarily large number of entries during the generation or update. > > Detecting adult pages is also quite important for large scale crawls as > these tend to quickly take over the whole crawldb and they generally yield > an awful lot of outlinks. > > HTH > > Julien > > > Thanks! > > > > We're back with the base URL issue. The stuff i `found` in the > > TestOutlinkExtractor was my own doing. No patch here. Using the > > ParserChecker > > it was clear that the problem came up because the http:// URL schema was > > not > > present in some href's. The problem is also present when using an > > ordinary browser and it can be solved by using the regex AJ supplied. > > > > The problem with the blikopnieuws site (relative URL's without base URL) > > remains, though. Check this link http://www.blikopnieuws.nl/nieuwsblok > > On the right side you'll see a latest news block with (in the browser) > > proper > > URL's. Check the source and you'll see relative URL's. It, of course, > > also stops working the the browser when you have a trailing slash. > > > > Now use the parser checker: > > bin/nutch org.apache.nutch.parse.ParserChecker > > http://www.blikopnieuws.nl/nieuwsblok > > > > And you'll see that Nutch uses http://www.blikopnieuws.nl/nieuwsblok/ as > > base > > URL for relative URL's, just as the browser does. Everything works as > > expected > > because of the relative URL's. > > > > The problem is, the website is itself not consistent. It mostly features > > the > > URL in the footer without trailing slash but from some unknown page i got > > the > > same URL with the trailing slash. From there on, everything starts to go > > wrong. > > > > To conclude, i got fooled! But how can we in the future prevent this from > > happening? I could use url filtering but that would mean the index > > already contains garbage because i cannot filter what i don't know. > > > > Cheers, > > > > On Wednesday 29 September 2010 11:25:55 Julien Nioche wrote: > > > Don't know how to run a single test but if you do ant test you should > > > be able to find the logs for each individual class in ./build/test with > > > a separate log for TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt > > > > that > > > > > will be easier that going through a single huge file > > > > > > J. > > > > > > > > > On 29 September 2010 10:11, Markus Jelsma <[email protected]> > > > > wrote: > > > Yes but i need a little more testing. Anyone knows how i can only test > > > > that > > > > > class? I currently use ant -v test -l logfile and need to dig through > > > the log file, also, it takes too long because of other tests. > > > > > > On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote: > > > > Hi guys, > > > > > > > > IIRC the OutlinkExtractor is the same in parse-tika and parse-html. > > > > Could > > > > > > you please open a JIRA and attach a patch for the > > > > TestOutlinkExtractor > > > > so > > > > > > that we can reproduce the problem? > > > > > > > > Thanks > > > > > > > > Julien > > > > > > > > > Hello Mathijs, > > > > > > > > > > > > > > > > > > > > I inspected the code base and found that the problem is most likely > > > > in > > > > > > > the parse-tika code where the text is being extracted and the > > > > > OutlinkExtractor is called. The OutlinkExtractor uses a regular > > > > > expression that can output a lot of garbage. I've added a test to > > > > > the TestOutlinkExtractor where it's clear that at least one URL > > > > > does not pass but it does not point me in the right direction for > > > > > solving the relative path problem. > > > > > > > > > > > > > > > > > > > > Unless someone knows, i'll try to find out how the OutlinkExtractor > > > > > works with the current base URL because just a plain relative URL > > > > > in the test will obviously fail. > > > > > > > > > > > > > > > > > > > > Thanks for the pointer =) > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > -----Original message----- > > > > > From: Mathijs Homminga <[email protected]> > > > > > Sent: Tue 28-09-2010 21:01 > > > > > To: [email protected]; > > > > > Subject: Re: Funky duplicate url's, getting much worse! > > > > > > > > > > Hi Marcus, > > > > > > > > > > I remember Nutch had some troubles with honoring the page's BASE > > > > > tag when resolving relative outlinks. > > > > > However, I don't see this BASE tag being used in the HTML pages you > > > > > provide so that's might not be it. > > > > > > > > > > Mathijs > > > > > > > > > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote: > > > > > > Anyone? Where is a proper solution for this issue? As expected, > > > > > > the regex > > > > > > > > > > won't catch all imaginable kinds of funky URL's that somehow ended > > > > > up in the CrawlDB. Before the weekend, i added another news site to > > > > > the tests i conduct and let it run continuously. Unfortunately, the > > > > > generator now comes up with all kinds of completely useless URL's, > > > > > although they do exist but that's just the web application ignoring > > > > > most parts of the URL's. > > > > > > > > > > > This is the URL that should be considered as proper URL: > > > > > > > > > > > > http://www.blikopnieuws.nl/nieuwsblok > > > > > > > > > > > > > > > > > > > > > > > > Here are two URL's that are completely useless: > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be > > > > > > >ri cht/119033/bericht/119047/economie > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/11 > > > > > > >90 35/archief/bericht/119038/archief/ > > > > > > > > > > > It is very hard to use deduplication on these simply because the > > > > > > content > > > > > > > > > > is actually changes too much as time progresses - the latest news > > > > block > > > > > > > for example. It is therefore a necessity to keep these URL's from > > > > > ending up in the CrawlDB and so not to waste disk space and update > > > > time > > > > > > > of the CrawlDB and and huge load of bandwidth - i'm in my current > > > > fetch > > > > > > > probably going to waste at least a few GB's. > > > > > > > > > > > Looking at the HTML source, it looks like the parser cannot > > > > properly > > > > > > > handle relative URL's. It is, of course, quite ugly for a site to > > > > > do this but the parser must not fool itself and come up with URL's > > > > > that really aren't there. Combined with the issue i began the > > > > > thread with > > > > i > > > > > > > believe the following two problems are present - the parser returns > > > > > imaginary (false) > > > > > > > > > > URL's because of: > > > > > > 1. relative href's; > > > > > > > > > > > > 2. URL's in anchors (that is the XML element's body) next to the > > > > rhef > > > > > > > attribute. > > > > > > > > > > > Please help in finding the source of the problem (Tika? Nutch?) > > > > > > and how > > > > > > > > > > to proceed in having it fixed so other users won't waste bandwidth, > > > > > disk space and CPU cycles =) > > > > > > > > > > > Oh, here's a snippet of the fetch job that's currently running, > > > > also, > > > > > > > notice the news item with the 119039 ID, it's the same as above > > > > > although that copy/paste was 15 minutes ago. Most item ID's you see > > > > > below continue to return in the current log output. > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/11904 > > > > > > >2/ hetweer/game/persberichtaanleveren > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/beric > > > > > > >ht /119036/game/tipons > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/beric > > > > > > >ht /119035/bericht/119033/disclaimer > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/b > > > > > > >er icht/119036/groningen > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss > > > > > > >/b ericht/119042/persberichtaanleveren > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/arc > > > > > > >hi ef/bericht/119036/bericht/119038/zuidholland > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903 > > > > > > >5/ bericht/119036/game/hetweer/vandaag > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/beric > > > > > > >ht /119035/game/archief/donderdag > > > > > > > > > > > fetching > > > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/beric > > > > > > >ht /119034/archief/zeeland > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be > > > > > > >ri cht/119041/bericht/119047/lifestyle > > > > > > > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488 > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/beric > > > > > > >ht > > > > /119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.h > > > > > > >tml > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/ > > > > > > >be richt/119038/game/lennythelizard > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/arc > > > > > > >hi > > > > ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defec > > > > > > >t.h tml > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903 > > > > > > >5/ game/bericht/119035/noordbrabant > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss > > > > > > >/b ericht/119036/ > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/ar > > > > > > >ch ief/bericht/119043/game/bioballboom > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/11 > > > > > > >90 33/archief/bericht/119046/wetenschap > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/ar > > > > > > >ch > > > > > ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/ga > > > > > > >me /archief/rss/ > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/het > > > > > > >we er/game/archief/overijssel > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/1190 > > > > > > >38 /bericht/119048/binnenland > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/11904 > > > > > > >2/ bericht/119038/game/auto > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archi > > > > > > >ef /bericht/119049/zeeland > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/ > > > > > > >ar chief/meewerken > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/11903 > > > > > > >5/ game/bericht/119034/gelderland > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/gam > > > > > > >e/ bericht/119042/game/binnenland > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/arc > > > > > > >hi ef/bericht/119035/bericht/119035/gelderland > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht > > > > > > >/1 19038/archief/lifestyle > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/be > > > > > > >ri cht/119041/hetweer/archief/woensdag > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/11 > > > > > > >90 42/archief/bericht/119047/lifestyle > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/be > > > > > > >ri cht/119034/bericht/119047/glossy > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/beric > > > > > > >ht /119038/bericht/119045/glossy > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/11 > > > > > > >90 36/game/bericht/119042/archief/zaterdag > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/11903 > > > > > > >5/ archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html > > > > > > > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493 > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/11 > > > > > > >90 37/archief/bericht/119046/economie > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/1 > > > > > > >19 033/bericht/119037/overijssel > > > > > > > > > > > fetching > > > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/beric > > > > > > >ht /119036/bericht/119037/ > > > > > > > > > > > -----Original message----- > > > > > > From: Markus Jelsma <[email protected]> > > > > > > Sent: Wed 22-09-2010 20:47 > > > > > > To: [email protected]; > > > > > > Subject: RE: Re: Funky duplicate url's > > > > > > > > > > > > Thanks! I've already implemented a similar (but not as generic) > > > > regex > > > > > > > > to > > > > > > > > > > catch these url's. But it is, of course, not a proper solution to > > > > solve > > > > > > > a parsing problem with subsequent regex's. Please, correct me if > > > > > i'm wrong, but i'm quite sure those url's are not to be found in > > > > > the HTML sources. I'd better to be fixed where the problem seems to > > > > > be. > > > > > > > > > > > I'll test your regex but i'd still like to know where the exact > > > > > > problem > > > > > > > > > > lies and hopefully fix or help fixing it. > > > > > > > > > > > Thanks > > > > > > > > > > > > -----Original message----- > > > > > > From: AJ Chen <[email protected]> > > > > > > Sent: Wed 22-09-2010 20:29 > > > > > > To: [email protected]; > > > > > > Subject: Re: Funky duplicate url's > > > > > > > > > > > > the conf/regex-urlfilter.txt file has an exclusion rule that > > > > > > should skip these viral urls. > > > > > > > > > > > > # skip URLs with slash-delimited segment that repeats 3+ times, > > > > > > to break loops > > > > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > > > > > > > > > > -aj > > > > > > > > > > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma > > > > > > <[email protected] > > > > > > > > > > > >wrote: > > > > > >> Well, using a regex to catch these troublemakers isn't going to > > > > > >> be > > > > > > > > > > useful. > > > > > > > > > > >> Although i caught the first faulty url's, there can be many more > > > > and > > > > > > > it's > > > > > > > > > > >> unpredictable; here's just a random pick from the list of > > > > > >> errors: > > > > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest. > > > > > > >is > > > > > /Key-Sectors/Data-Centers-in-Iceland/ > > > > www.invest.is/Key-Sectors/Data-Cen > > > > > > >ter > > > > > s-in-Iceland/ > > > > www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inve > > > > > > >st. > > > > > is/Key-Sectors/Data-Centers-in-Iceland/ > > > > www.invest.is/Key-Sectors/Data-C > > > > > > >ent > > > > > > > > > > > > > > > > > > > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/ > > > > > > > > > > >> Here's another very disturbing url it's trying to fetch: > > > > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/200 > > > > > > >5/ > > > > > 02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida > > > > > > >_li > > > > > censes_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovony > > > > > > >x/h > > > > > ttp/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther > > > > > > >egi > > > > ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200 > > > > > > >5/0 > > > > > 2/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_ > > > > > > >lic > > > > > enses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx > > > > > > >/ht > > > > > tp/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there > > > > > > >gis > > > > ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005 > > > > > > >/02 > > > > > /04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_l > > > > > > >ice > > > > > nses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ > > > > > > >htt > > > > > p/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg > > > > > > >ist > > > > er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/ > > > > > > >02/ > > > > > 04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_li > > > > > > >cen > > > > > ses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h > > > > > > >ttp > > > > > / > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi > > > > > > >ste > > > > r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0 > > > > > > >2/0 > > > > > 4/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_lic > > > > > > >ens > > > > > es_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht > > > > > > >tp/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis > > > > > > >ter > > > > > .com/2005/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02 > > > > > > >/04 > > > > > /elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_lice > > > > > > >nse > > > > > s_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt > > > > > > >p/w > > > > ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist > > > > > > >er. > > > > > com/2005/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/ > > > > > > >04/ > > > > > elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licen > > > > > > >ses > > > > > _ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http > > > > > > >/ww > > > > w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste > > > > > > >r.c > > > > > om/2005/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/0 > > > > > > >4/e > > > > > lpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licens > > > > > > >es_ > > > > > ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ > > > > > > >www > > > > > . > > > > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister > > > > > > >.co > > > > > m/2005/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04 > > > > > > >/el > > > > > pida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_license > > > > > > >s_o > > > > > > > > > > > > > > >vonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww > > > > > > >w. > > > > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister. > > > > > > >com > > > > > /2005/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/ > > > > > > >elp > > > > > ida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses > > > > > > >_ov > > > > > > > > > >onyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www > > > > > > >.t > > > > heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c > > > > > > >om/ > > > > > 2005/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/e > > > > > > >lpi > > > > > da_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ > > > > > > >ovo > > > > > > > > > >nyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www. > > > > > > >th > > > > eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co > > > > > > >m/2 > > > > > 005/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/el > > > > > > >pid > > > > > a_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_o > > > > > > >von > > > > > > > > > >yx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t > > > > > > >he > > > > register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com > > > > > > >/20 > > > > > 05/02/04/elpida_licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elp > > > > > > >ida > > > > > _licenses_ovonyx/http/ > > > > www.theregister.com/2005/02/04/elpida_licenses_ov > > > > > > >ony x/ > > > > > > > > > > >> I'm seems these bad url's are somehow found by the parser and > > > > > >> get > > > > > > > > > > fetched > > > > > > > > > > >> the next time, and the next time making the url grow longer and > > > > > >> longer > > > > > > > > > > for > > > > > > > > > > >> each fetch and parse and updateDB cycle. > > > > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1 > > > > > > >99 > > > > > 9/article1513468.ece/ > > > > www.microsoft.com/office/www.microsoft.com/office/ > > > > > > >www > > > > > . > > > > microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office > > > > > > >/ww > > > > w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic > > > > > > >e/w > > > > ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi > > > > > > >ce/ > > > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off > > > > > > >ice > > > > > / > > > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of > > > > > > >fic > > > > > e/ > > > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o > > > > > > >ffi > > > > > ce/ > > > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/ > > > > > > >off > > > > > ice/ > > > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com > > > > > > >/of > > > > > > > > > >fice/ > > > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com > > > > > > >/o > > > > > > > > > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus > > > > > > > > > > >> This doesn't look good at all. Anyone got a suggestion or some > > > > > >> pointer? > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> -----Original message----- > > > > > >> From: Markus Jelsma <[email protected]> > > > > > >> Sent: Wed 22-09-2010 12:12 > > > > > >> To: [email protected]; > > > > > >> Subject: Funky duplicate url's > > > > > >> > > > > > >> Hi, > > > > > >> > > > > > >> > > > > > >> > > > > > >> This is not about deduplication, but about preventing certain > > > > url's > > > > > > > >> to > > > > > > > > > > end > > > > > > > > > > >> up in the CrawlDB. I'm crawling a news site for testing > > > > > >> purposes, > > > > it > > > > > > > >> has > > > > > > > > > > the > > > > > > > > > > >> usual categories etc. News item pages feature a gray text block > > > > > >> that's > > > > > > > > > > got > > > > > > > > > > >> some url's as well. See > > > > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an > > > > > > > > > > example. > > > > > > > > > > >> The problem is, the parser somehow manages to concatenate the > > > > > >> href with > > > > > > > > > > the > > > > > > > > > > >> inner anchor text (which happens to be an url as you can see). > > > > > >> So, subsequent fetches are completely messed up, i'm almost only > > > > > >> fetching duplicates: > > > > > >> > > > > > >> > > > > > >> > > > > > >> fetching > > > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ > > > > > > >ni > > > > > euws/economie/ > > > > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/w > > > > > > >ww. > > > > trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuw > > > > > > >s/e > > > > > conomie/ > > > > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro > > > > > > >uw. > > > > > nl/opinie/weblogs/ > > > > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblo > > > > > > >gs/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/n > > > > > > >ieu ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece > > > > > > > > > > >> fetching > > > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ > > > > > > >ni > > > > > euws/economie/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/ > > > > > > >www > > > > > . > > > > trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin > > > > > > >ie/ > > > > > weblogs/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr > > > > > > >ouw > > > > > .nl/nieuws/economie/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/ec > > > > > > >ono > > > > > mie/ > > > > www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw. > > > > > > >nl/ nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece > > > > > > > > > > >> fetching > > > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ > > > > > > >op > > > > > inie/weblogs/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/ > > > > > > >www > > > > > . > > > > trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin > > > > > > >ie/ > > > > > weblogs/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr > > > > > > >ouw > > > > > .nl/nieuws/economie/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/we > > > > > > >blo > > > > > > > > > > > > > > >gs/ > > > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl > > > > > > >/o > > > > > > > > > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece > > > > > > > > > > >> This is not desired behavior, as you'd expect. The question is, > > > > > >> where to fix and how to fix it? Is it a problem with the parser? > > > > Or > > > > > > > >> is it fixable using some freaky url filter for this one? > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> Cheers, > > > > > > > > > > > > -- > > > > > > AJ Chen, PhD > > > > > > Chair, Semantic Web SIG, sdforum.org > > > > > > http://web2express.org > > > > > > twitter @web2express > > > > > > Palo Alto, CA, USA > > > > > > Markus Jelsma - Technisch Architect - Buyways BV > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 > > > > Markus Jelsma - Technisch Architect - Buyways BV > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 > Markus Jelsma - Technisch Architect - Buyways BV http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

