Re: Funky duplicate url's, getting much worse!

Markus Jelsma Wed, 29 Sep 2010 06:46:14 -0700

The following regex 

-.*(/[^/]+)/[^/]+\1/[^/]+\1/


prevents URL's such as

http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/

to end up in the CrawlDB.  The problem with the blikopnieuws URL's is that 
they don't contain exact repeating parts. They do have stuff like 
http://HOST/path/item/ID_1/item/ID_2 but that's quite a common schema on the 
internet. Adding a regex that filters these occurences would silently discard 
many other valid URL's.

http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie

Thanks for your comments, it looks like i'm stuck with this at least for now 
=)


On Wednesday 29 September 2010 14:58:10 Julien Nioche wrote:
> What I did for similarpages.com was to write a custom URL filter that
> detected repetition of path elements and discarded a URL if it had a path
> occurring more than N times. I don't know what regex AJ suggested but the
> approach above was generic and also quite fast.
> 
> We also had other things like filtering out ridiculously long URLS (not
>  only do they tend to be rubbish but they cause the normalisation to take a
>  lot of CPU) or dynamically generated host names by splitting on say dashes
>  and remove the URL if the hostname had more than N tokens.
> 
> These are all small tricks but they help controlling the content of the
> crawldb and not waste time trying to fetch rubbish or scanning an
> unnecessarily large number of entries during the generation or update.
> 
> Detecting adult pages is also quite important for large scale crawls as
> these tend to quickly take over the whole crawldb and they generally yield
> an awful lot of outlinks.
> 
> HTH
> 
> Julien
> 
> > Thanks!
> >
> > We're back with the base URL issue. The stuff i `found` in the
> > TestOutlinkExtractor was my own doing. No patch here. Using the
> > ParserChecker
> > it was clear that the problem came up because the http:// URL schema was
> > not
> > present in some href's. The problem is also present when using an
> > ordinary browser and it can be solved by using the regex AJ supplied.
> >
> > The problem with the blikopnieuws site (relative URL's without base URL)
> > remains, though. Check this link http://www.blikopnieuws.nl/nieuwsblok
> > On the right side you'll see a latest news block with (in the browser)
> > proper
> > URL's. Check the source and you'll see relative URL's. It, of course,
> > also stops working the the browser when you have a trailing slash.
> >
> > Now use the parser checker:
> > bin/nutch org.apache.nutch.parse.ParserChecker
> > http://www.blikopnieuws.nl/nieuwsblok
> >
> > And you'll see that Nutch uses http://www.blikopnieuws.nl/nieuwsblok/ as
> > base
> > URL for relative URL's, just as the browser does. Everything works as
> > expected
> > because of the relative URL's.
> >
> > The problem is, the website is itself not consistent. It mostly features
> > the
> > URL in the footer without trailing slash but from some unknown page i got
> > the
> > same URL with the trailing slash. From there on, everything starts to go
> > wrong.
> >
> > To conclude, i got fooled! But how can we in the future prevent this from
> > happening? I could use url filtering but that would mean the index
> > already contains garbage because i cannot filter what i don't know.
> >
> > Cheers,
> >
> > On Wednesday 29 September 2010 11:25:55 Julien Nioche wrote:
> > > Don't know how to run a single test but if you do ant test you should
> > > be able to find the logs for each individual class in ./build/test with
> > > a separate log for TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt
> >
> > that
> >
> > >  will be easier that going through a single huge file
> > >
> > > J.
> > >
> > >
> > > On 29 September 2010 10:11, Markus Jelsma <[email protected]>
> >
> > wrote:
> > > Yes but i need a little more testing. Anyone knows how i can only test
> >
> > that
> >
> > > class? I currently use ant -v test -l logfile and need to dig through
> > > the log file, also, it takes too long because of other tests.
> > >
> > > On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > > > Hi guys,
> > > >
> > > > IIRC the OutlinkExtractor is the same in parse-tika and parse-html.
> >
> > Could
> >
> > > > you please open a JIRA and attach a patch for the
> > > > TestOutlinkExtractor
> >
> > so
> >
> > > > that we can reproduce the problem?
> > > >
> > > > Thanks
> > > >
> > > > Julien
> > > >
> > > > > Hello Mathijs,
> > > > >
> > > > >
> > > > >
> > > > > I inspected the code base and found that the problem is most likely
> >
> > in
> >
> > > > > the parse-tika code where the text is being extracted and the
> > > > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > > > expression that can output a lot of garbage. I've added a test to
> > > > > the TestOutlinkExtractor where it's clear that at least one URL
> > > > > does not pass but it does not point me in the right direction for
> > > > > solving the relative path problem.
> > > > >
> > > > >
> > > > >
> > > > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> > > > > works with the current base URL because just a plain relative URL
> > > > > in the test will obviously fail.
> > > > >
> > > > >
> > > > >
> > > > > Thanks for the pointer =)
> > > > >
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > -----Original message-----
> > > > > From: Mathijs Homminga <[email protected]>
> > > > > Sent: Tue 28-09-2010 21:01
> > > > > To: [email protected];
> > > > > Subject: Re: Funky duplicate url's, getting much worse!
> > > > >
> > > > > Hi Marcus,
> > > > >
> > > > > I remember Nutch had some troubles with honoring the page's BASE
> > > > > tag when resolving relative outlinks.
> > > > > However, I don't see this BASE tag being used in the HTML pages you
> > > > > provide so that's might not be it.
> > > > >
> > > > > Mathijs
> > > > >
> > > > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > > > Anyone? Where is a proper solution for this issue? As expected,
> > > > > > the regex
> > > > >
> > > > > won't catch all imaginable kinds of funky URL's that somehow ended
> > > > > up in the CrawlDB. Before the weekend, i added another news site to
> > > > > the tests i conduct and let it run continuously. Unfortunately, the
> > > > > generator now comes up with all kinds of completely useless URL's,
> > > > > although they do exist but that's just the web application ignoring
> > > > > most parts of the URL's.
> > > > >
> > > > > > This is the URL that should be considered as proper URL:
> > > > > >
> > > > > > http://www.blikopnieuws.nl/nieuwsblok
> > > > > >
> > > > > >
> > > > > >
> > > > > > Here are two URL's that are completely useless:
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> >
> > > > >ri cht/119033/bericht/119047/economie
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/11
> >
> > > > >90 35/archief/bericht/119038/archief/
> > > > >
> > > > > > It is very hard to use deduplication on these simply because the
> > > > > > content
> > > > >
> > > > > is actually changes too much as time progresses - the latest news
> >
> > block
> >
> > > > > for example. It is therefore a necessity to keep these URL's from
> > > > > ending up in the CrawlDB and so not to waste disk space and update
> >
> > time
> >
> > > > > of the CrawlDB and and huge load of bandwidth - i'm in my current
> >
> > fetch
> >
> > > > > probably going to waste at least a few GB's.
> > > > >
> > > > > > Looking at the HTML source, it looks like the parser cannot
> >
> > properly
> >
> > > > > handle relative URL's. It is, of course, quite ugly for a site to
> > > > > do this but the parser must not fool itself and come up with URL's
> > > > > that really aren't there. Combined with the issue i began the
> > > > > thread with
> >
> > i
> >
> > > > > believe the following two problems are present - the parser returns
> > > > > imaginary (false)
> > > > >
> > > > > URL's because of:
> > > > > > 1. relative href's;
> > > > > >
> > > > > > 2. URL's in anchors (that is the XML element's body) next to the
> >
> > rhef
> >
> > > > > attribute.
> > > > >
> > > > > > Please help in finding the source of the problem (Tika? Nutch?)
> > > > > > and how
> > > > >
> > > > > to proceed in having it fixed so other users won't waste bandwidth,
> > > > > disk space and CPU cycles =)
> > > > >
> > > > > > Oh, here's a snippet of the fetch job that's currently running,
> >
> > also,
> >
> > > > > notice the news item with the 119039 ID, it's the same as above
> > > > > although that copy/paste was 15 minutes ago. Most item ID's you see
> > > > > below continue to return in the current log output.
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/11904
> >
> > > > >2/ hetweer/game/persberichtaanleveren
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/beric
> >
> > > > >ht /119036/game/tipons
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/beric
> >
> > > > >ht /119035/bericht/119033/disclaimer
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/b
> >
> > > > >er icht/119036/groningen
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss
> >
> > > > >/b ericht/119042/persberichtaanleveren
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/arc
> >
> > > > >hi ef/bericht/119036/bericht/119038/zuidholland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> >
> > > > >5/ bericht/119036/game/hetweer/vandaag
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/beric
> >
> > > > >ht /119035/game/archief/donderdag
> > > > >
> > > > > > fetching
> > > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/beric
> >
> > > > >ht /119034/archief/zeeland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> >
> > > > >ri cht/119041/bericht/119047/lifestyle
> > > > >
> > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/beric
> >
> > > > >ht
> >
> > /119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.h
> >
> > > > >tml
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/
> >
> > > > >be richt/119038/game/lennythelizard
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/arc
> >
> > > > >hi
> >
> > ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defec
> >
> > > > >t.h tml
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> >
> > > > >5/ game/bericht/119035/noordbrabant
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss
> >
> > > > >/b ericht/119036/
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/ar
> >
> > > > >ch ief/bericht/119043/game/bioballboom
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/11
> >
> > > > >90 33/archief/bericht/119046/wetenschap
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/ar
> >
> > > > >ch
> > > > > ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/ga
> >
> > > > >me /archief/rss/
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/het
> >
> > > > >we er/game/archief/overijssel
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/1190
> >
> > > > >38 /bericht/119048/binnenland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/11904
> >
> > > > >2/ bericht/119038/game/auto
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archi
> >
> > > > >ef /bericht/119049/zeeland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/
> >
> > > > >ar chief/meewerken
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/11903
> >
> > > > >5/ game/bericht/119034/gelderland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/gam
> >
> > > > >e/ bericht/119042/game/binnenland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/arc
> >
> > > > >hi ef/bericht/119035/bericht/119035/gelderland
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht
> >
> > > > >/1 19038/archief/lifestyle
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/be
> >
> > > > >ri cht/119041/hetweer/archief/woensdag
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/11
> >
> > > > >90 42/archief/bericht/119047/lifestyle
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/be
> >
> > > > >ri cht/119034/bericht/119047/glossy
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/beric
> >
> > > > >ht /119038/bericht/119045/glossy
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/11
> >
> > > > >90 36/game/bericht/119042/archief/zaterdag
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/11903
> >
> > > > >5/ archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > > > >
> > > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/11
> >
> > > > >90 37/archief/bericht/119046/economie
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/1
> >
> > > > >19 033/bericht/119037/overijssel
> > > > >
> > > > > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/beric
> >
> > > > >ht /119036/bericht/119037/
> > > > >
> > > > > > -----Original message-----
> > > > > > From: Markus Jelsma <[email protected]>
> > > > > > Sent: Wed 22-09-2010 20:47
> > > > > > To: [email protected];
> > > > > > Subject: RE: Re: Funky duplicate url's
> > > > > >
> > > > > > Thanks! I've already implemented a similar (but not as generic)
> >
> > regex
> >
> > > > > > to
> > > > >
> > > > > catch these url's. But it is, of course, not a proper solution to
> >
> > solve
> >
> > > > > a parsing problem with subsequent regex's. Please, correct me if
> > > > > i'm wrong, but i'm quite sure those url's are not to be found in
> > > > > the HTML sources. I'd better to be fixed where the problem seems to
> > > > > be.
> > > > >
> > > > > > I'll test your regex but i'd still like to know where the exact
> > > > > > problem
> > > > >
> > > > > lies and hopefully fix or help fixing it.
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > -----Original message-----
> > > > > > From: AJ Chen <[email protected]>
> > > > > > Sent: Wed 22-09-2010 20:29
> > > > > > To: [email protected];
> > > > > > Subject: Re: Funky duplicate url's
> > > > > >
> > > > > > the conf/regex-urlfilter.txt file has an exclusion rule that
> > > > > > should skip these viral urls.
> > > > > >
> > > > > > # skip URLs with slash-delimited segment that repeats 3+ times,
> > > > > > to break loops
> > > > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > > > >
> > > > > > -aj
> > > > > >
> > > > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > > > <[email protected]
> > > > > >
> > > > > >wrote:
> > > > > >> Well, using a regex to catch these troublemakers isn't going to
> > > > > >> be
> > > > >
> > > > > useful.
> > > > >
> > > > > >> Although i caught the first faulty url's, there can be many more
> >
> > and
> >
> > > > > it's
> > > > >
> > > > > >> unpredictable; here's just a random pick from the list of
> > > > > >> errors:
> >
> > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.
> >
> > > > >is
> > > > > /Key-Sectors/Data-Centers-in-Iceland/
> >
> > www.invest.is/Key-Sectors/Data-Cen
> >
> > > > >ter
> > > > > s-in-Iceland/
> >
> > www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inve
> >
> > > > >st.
> > > > > is/Key-Sectors/Data-Centers-in-Iceland/
> >
> > www.invest.is/Key-Sectors/Data-C
> >
> > > > >ent
> > > > >
> > > > >
> > > > >
> > > > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > > > >
> > > > > >> Here's another very disturbing url it's trying to fetch:
> >
> > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/200
> >
> > > > >5/
> > > > > 02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida
> >
> > > > >_li
> > > > > censes_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovony
> >
> > > > >x/h
> > > > > ttp/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> >
> > > > >egi
> >
> > ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> >
> > > > >5/0
> > > > > 2/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_
> >
> > > > >lic
> > > > > enses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> >
> > > > >/ht
> > > > > tp/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> >
> > > > >gis
> >
> > ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> >
> > > > >/02
> > > > > /04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_l
> >
> > > > >ice
> > > > > nses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> >
> > > > >htt
> > > > > p/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> >
> > > > >ist
> >
> > er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/
> >
> > > > >02/
> > > > > 04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_li
> >
> > > > >cen
> > > > > ses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> >
> > > > >ttp
> > > > > /
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> >
> > > > >ste
> >
> > r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> >
> > > > >2/0
> > > > > 4/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_lic
> >
> > > > >ens
> > > > > es_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> >
> > > > >tp/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> >
> > > > >ter
> > > > > .com/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02
> >
> > > > >/04
> > > > > /elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_lice
> >
> > > > >nse
> > > > > s_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> >
> > > > >p/w
> >
> > ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> >
> > > > >er.
> > > > > com/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/
> >
> > > > >04/
> > > > > elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licen
> >
> > > > >ses
> > > > > _ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> >
> > > > >/ww
> >
> > w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> >
> > > > >r.c
> > > > > om/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/0
> >
> > > > >4/e
> > > > > lpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licens
> >
> > > > >es_
> > > > > ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > > > >www
> > > > > .
> >
> > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> >
> > > > >.co
> > > > > m/2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04
> >
> > > > >/el
> > > > > pida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_license
> >
> > > > >s_o
> > > > >
> > > > >
> > > > >vonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> >
> > > > >w.
> >
> > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> >
> > > > >com
> > > > > /2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/
> >
> > > > >elp
> > > > > ida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses
> >
> > > > >_ov
> > > > >
> > > > >onyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> >
> > > > >.t
> >
> > heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> >
> > > > >om/
> > > > > 2005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/e
> >
> > > > >lpi
> > > > > da_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_
> >
> > > > >ovo
> > > > >
> > > > >nyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> >
> > > > >th
> >
> > eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> >
> > > > >m/2
> > > > > 005/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/el
> >
> > > > >pid
> > > > > a_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_o
> >
> > > > >von
> > > > >
> > > > >yx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> >
> > > > >he
> >
> > register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> >
> > > > >/20
> > > > > 05/02/04/elpida_licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elp
> >
> > > > >ida
> > > > > _licenses_ovonyx/http/
> >
> > www.theregister.com/2005/02/04/elpida_licenses_ov
> >
> > > > >ony x/
> > > > >
> > > > > >> I'm seems these bad url's are somehow found by the parser and
> > > > > >> get
> > > > >
> > > > > fetched
> > > > >
> > > > > >> the next time, and the next time making the url grow longer and
> > > > > >> longer
> > > > >
> > > > > for
> > > > >
> > > > > >> each fetch and parse and updateDB cycle.
> >
> > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1
> >
> > > > >99
> > > > > 9/article1513468.ece/
> >
> > www.microsoft.com/office/www.microsoft.com/office/
> >
> > > > >www
> > > > > .
> >
> > microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> >
> > > > >/ww
> >
> > w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> >
> > > > >e/w
> >
> > ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> >
> > > > >ce/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> >
> > > > >ice
> > > > > /
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> >
> > > > >fic
> > > > > e/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> >
> > > > >ffi
> > > > > ce/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/
> >
> > > > >off
> > > > > ice/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> >
> > > > >/of
> > > > >
> > > > >fice/
> >
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> >
> > > > >/o
> > > > >
> > > > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > > > >
> > > > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > > > >> pointer?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> -----Original message-----
> > > > > >> From: Markus Jelsma <[email protected]>
> > > > > >> Sent: Wed 22-09-2010 12:12
> > > > > >> To: [email protected];
> > > > > >> Subject: Funky duplicate url's
> > > > > >>
> > > > > >> Hi,
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> This is not about deduplication, but about preventing certain
> >
> > url's
> >
> > > > > >> to
> > > > >
> > > > > end
> > > > >
> > > > > >> up in the CrawlDB. I'm crawling a news site for testing
> > > > > >> purposes,
> >
> > it
> >
> > > > > >> has
> > > > >
> > > > > the
> > > > >
> > > > > >> usual categories etc. News item pages feature a gray text block
> > > > > >> that's
> > > > >
> > > > > got
> > > > >
> > > > > >> some url's as well. See
> > > > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > > > >
> > > > > example.
> > > > >
> > > > > >> The problem is, the parser somehow manages to concatenate the
> > > > > >> href with
> > > > >
> > > > > the
> > > > >
> > > > > >> inner anchor text (which happens to be an url as you can see).
> > > > > >> So, subsequent fetches are completely messed up, i'm almost only
> > > > > >> fetching duplicates:
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> >
> > > > >ni
> > > > > euws/economie/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/w
> >
> > > > >ww.
> >
> > trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuw
> >
> > > > >s/e
> > > > > conomie/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> >
> > > > >uw.
> > > > > nl/opinie/weblogs/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblo
> >
> > > > >gs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/n
> >
> > > > >ieu ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > > > >
> > > > > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> >
> > > > >ni
> > > > > euws/economie/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/
> >
> > > > >www
> > > > > .
> >
> > trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> >
> > > > >ie/
> > > > > weblogs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> >
> > > > >ouw
> > > > > .nl/nieuws/economie/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/ec
> >
> > > > >ono
> > > > > mie/
> >
> > www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.
> >
> > > > >nl/ nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > > > >
> > > > > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> >
> > > > >op
> > > > > inie/weblogs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/
> >
> > > > >www
> > > > > .
> >
> > trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> >
> > > > >ie/
> > > > > weblogs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> >
> > > > >ouw
> > > > > .nl/nieuws/economie/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/we
> >
> > > > >blo
> > > > >
> > > > >
> > > > >gs/
> >
> > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> >
> > > > >/o
> > > > >
> > > > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > > > >
> > > > > >> This is not desired behavior, as you'd expect. The question is,
> > > > > >> where to fix and how to fix it? Is it a problem with the parser?
> >
> > Or
> >
> > > > > >> is it fixable using some freaky url filter for this one?
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Cheers,
> > > > > >
> > > > > > --
> > > > > > AJ Chen, PhD
> > > > > > Chair, Semantic Web SIG, sdforum.org
> > > > > > http://web2express.org
> > > > > > twitter @web2express
> > > > > > Palo Alto, CA, USA
> > >
> > > Markus Jelsma - Technisch Architect - Buyways BV
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Funky duplicate url's, getting much worse!

Reply via email to