Re: Funky duplicate url's, getting much worse!

Julien Nioche Wed, 29 Sep 2010 05:58:46 -0700

What I did for similarpages.com was to write a custom URL filter that
detected repetition of path elements and discarded a URL if it had a path
occurring more than N times. I don't know what regex AJ suggested but the
approach above was generic and also quite fast.


We also had other things like filtering out ridiculously long URLS (not only
do they tend to be rubbish but they cause the normalisation to take a lot of
CPU) or dynamically generated host names by splitting on say dashes and
remove the URL if the hostname had more than N tokens.

These are all small tricks but they help controlling the content of the
crawldb and not waste time trying to fetch rubbish or scanning an
unnecessarily large number of entries during the generation or update.

Detecting adult pages is also quite important for large scale crawls as
these tend to quickly take over the whole crawldb and they generally yield
an awful lot of outlinks.

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 29 September 2010 13:27, Markus Jelsma <[email protected]> wrote:

> Thanks!
>
> We're back with the base URL issue. The stuff i `found` in the
> TestOutlinkExtractor was my own doing. No patch here. Using the
> ParserChecker
> it was clear that the problem came up because the http:// URL schema was
> not
> present in some href's. The problem is also present when using an ordinary
> browser and it can be solved by using the regex AJ supplied.
>
> The problem with the blikopnieuws site (relative URL's without base URL)
> remains, though. Check this link http://www.blikopnieuws.nl/nieuwsblok
> On the right side you'll see a latest news block with (in the browser)
> proper
> URL's. Check the source and you'll see relative URL's. It, of course, also
> stops working the the browser when you have a trailing slash.
>
> Now use the parser checker:
> bin/nutch org.apache.nutch.parse.ParserChecker
> http://www.blikopnieuws.nl/nieuwsblok
>
> And you'll see that Nutch uses http://www.blikopnieuws.nl/nieuwsblok/ as
> base
> URL for relative URL's, just as the browser does. Everything works as
> expected
> because of the relative URL's.
>
> The problem is, the website is itself not consistent. It mostly features
> the
> URL in the footer without trailing slash but from some unknown page i got
> the
> same URL with the trailing slash. From there on, everything starts to go
> wrong.
>
> To conclude, i got fooled! But how can we in the future prevent this from
> happening? I could use url filtering but that would mean the index already
> contains garbage because i cannot filter what i don't know.
>
> Cheers,
>
> On Wednesday 29 September 2010 11:25:55 Julien Nioche wrote:
> > Don't know how to run a single test but if you do ant test you should be
> >  able to find the logs for each individual class in ./build/test with a
> >  separate log for TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt
> that
> >  will be easier that going through a single huge file
> >
> > J.
> >
> >
> > On 29 September 2010 10:11, Markus Jelsma <[email protected]>
> wrote:
> > Yes but i need a little more testing. Anyone knows how i can only test
> that
> > class? I currently use ant -v test -l logfile and need to dig through the
> >  log file, also, it takes too long because of other tests.
> >
> > On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > > Hi guys,
> > >
> > > IIRC the OutlinkExtractor is the same in parse-tika and parse-html.
> Could
> > > you please open a JIRA and attach a patch for the TestOutlinkExtractor
> so
> > > that we can reproduce the problem?
> > >
> > > Thanks
> > >
> > > Julien
> > >
> > > > Hello Mathijs,
> > > >
> > > >
> > > >
> > > > I inspected the code base and found that the problem is most likely
> in
> > > > the parse-tika code where the text is being extracted and the
> > > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > > expression that can output a lot of garbage. I've added a test to the
> > > > TestOutlinkExtractor where it's clear that at least one URL does not
> > > > pass but it does not point me in the right direction for solving the
> > > > relative path problem.
> > > >
> > > >
> > > >
> > > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> > > > works with the current base URL because just a plain relative URL in
> > > > the test will obviously fail.
> > > >
> > > >
> > > >
> > > > Thanks for the pointer =)
> > > >
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > -----Original message-----
> > > > From: Mathijs Homminga <[email protected]>
> > > > Sent: Tue 28-09-2010 21:01
> > > > To: [email protected];
> > > > Subject: Re: Funky duplicate url's, getting much worse!
> > > >
> > > > Hi Marcus,
> > > >
> > > > I remember Nutch had some troubles with honoring the page's BASE tag
> > > > when resolving relative outlinks.
> > > > However, I don't see this BASE tag being used in the HTML pages you
> > > > provide so that's might not be it.
> > > >
> > > > Mathijs
> > > >
> > > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > > Anyone? Where is a proper solution for this issue? As expected, the
> > > > > regex
> > > >
> > > > won't catch all imaginable kinds of funky URL's that somehow ended up
> > > > in the CrawlDB. Before the weekend, i added another news site to the
> > > > tests i conduct and let it run continuously. Unfortunately, the
> > > > generator now comes up with all kinds of completely useless URL's,
> > > > although they do exist but that's just the web application ignoring
> > > > most parts of the URL's.
> > > >
> > > > > This is the URL that should be considered as proper URL:
> > > > >
> > > > > http://www.blikopnieuws.nl/nieuwsblok
> > > > >
> > > > >
> > > > >
> > > > > Here are two URL's that are completely useless:
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > > >ri cht/119033/bericht/119047/economie
> > > >
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/11
> > > >90 35/archief/bericht/119038/archief/
> > > >
> > > > > It is very hard to use deduplication on these simply because the
> > > > > content
> > > >
> > > > is actually changes too much as time progresses - the latest news
> block
> > > > for example. It is therefore a necessity to keep these URL's from
> > > > ending up in the CrawlDB and so not to waste disk space and update
> time
> > > > of the CrawlDB and and huge load of bandwidth - i'm in my current
> fetch
> > > > probably going to waste at least a few GB's.
> > > >
> > > > > Looking at the HTML source, it looks like the parser cannot
> properly
> > > >
> > > > handle relative URL's. It is, of course, quite ugly for a site to do
> > > > this but the parser must not fool itself and come up with URL's that
> > > > really aren't there. Combined with the issue i began the thread with
> i
> > > > believe the following two problems are present - the parser returns
> > > > imaginary (false)
> > > >
> > > > URL's because of:
> > > > > 1. relative href's;
> > > > >
> > > > > 2. URL's in anchors (that is the XML element's body) next to the
> rhef
> > > >
> > > > attribute.
> > > >
> > > > > Please help in finding the source of the problem (Tika? Nutch?) and
> > > > > how
> > > >
> > > > to proceed in having it fixed so other users won't waste bandwidth,
> > > > disk space and CPU cycles =)
> > > >
> > > > > Oh, here's a snippet of the fetch job that's currently running,
> also,
> > > >
> > > > notice the news item with the 119039 ID, it's the same as above
> > > > although that copy/paste was 15 minutes ago. Most item ID's you see
> > > > below continue to return in the current log output.
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/11904
> > > >2/ hetweer/game/persberichtaanleveren
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/beric
> > > >ht /119036/game/tipons
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/beric
> > > >ht /119035/bericht/119033/disclaimer
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/b
> > > >er icht/119036/groningen
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss
> > > >/b ericht/119042/persberichtaanleveren
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/arc
> > > >hi ef/bericht/119036/bericht/119038/zuidholland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > > >5/ bericht/119036/game/hetweer/vandaag
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/beric
> > > >ht /119035/game/archief/donderdag
> > > >
> > > > > fetching
> > > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/beric
> > > >ht /119034/archief/zeeland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > > >ri cht/119041/bericht/119047/lifestyle
> > > >
> > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/beric
> > > >ht
> > > >
> /119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.h
> > > >tml
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/
> > > >be richt/119038/game/lennythelizard
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/arc
> > > >hi
> > > >
> ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defec
> > > >t.h tml
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > > >5/ game/bericht/119035/noordbrabant
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss
> > > >/b ericht/119036/
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/ar
> > > >ch ief/bericht/119043/game/bioballboom
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/11
> > > >90 33/archief/bericht/119046/wetenschap
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/ar
> > > >ch ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/ga
> > > >me /archief/rss/
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/het
> > > >we er/game/archief/overijssel
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/1190
> > > >38 /bericht/119048/binnenland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/11904
> > > >2/ bericht/119038/game/auto
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archi
> > > >ef /bericht/119049/zeeland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/
> > > >ar chief/meewerken
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/11903
> > > >5/ game/bericht/119034/gelderland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/gam
> > > >e/ bericht/119042/game/binnenland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/arc
> > > >hi ef/bericht/119035/bericht/119035/gelderland
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht
> > > >/1 19038/archief/lifestyle
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/be
> > > >ri cht/119041/hetweer/archief/woensdag
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/11
> > > >90 42/archief/bericht/119047/lifestyle
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/be
> > > >ri cht/119034/bericht/119047/glossy
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/beric
> > > >ht /119038/bericht/119045/glossy
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/11
> > > >90 36/game/bericht/119042/archief/zaterdag
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/11903
> > > >5/ archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > > >
> > > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/11
> > > >90 37/archief/bericht/119046/economie
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/1
> > > >19 033/bericht/119037/overijssel
> > > >
> > > > > fetching
> > > >
> > > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/beric
> > > >ht /119036/bericht/119037/
> > > >
> > > > > -----Original message-----
> > > > > From: Markus Jelsma <[email protected]>
> > > > > Sent: Wed 22-09-2010 20:47
> > > > > To: [email protected];
> > > > > Subject: RE: Re: Funky duplicate url's
> > > > >
> > > > > Thanks! I've already implemented a similar (but not as generic)
> regex
> > > > > to
> > > >
> > > > catch these url's. But it is, of course, not a proper solution to
> solve
> > > > a parsing problem with subsequent regex's. Please, correct me if i'm
> > > > wrong, but i'm quite sure those url's are not to be found in the HTML
> > > > sources. I'd better to be fixed where the problem seems to be.
> > > >
> > > > > I'll test your regex but i'd still like to know where the exact
> > > > > problem
> > > >
> > > > lies and hopefully fix or help fixing it.
> > > >
> > > > > Thanks
> > > > >
> > > > > -----Original message-----
> > > > > From: AJ Chen <[email protected]>
> > > > > Sent: Wed 22-09-2010 20:29
> > > > > To: [email protected];
> > > > > Subject: Re: Funky duplicate url's
> > > > >
> > > > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > > > skip these viral urls.
> > > > >
> > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > > > break loops
> > > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > > >
> > > > > -aj
> > > > >
> > > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > > <[email protected]
> > > > >
> > > > >wrote:
> > > > >> Well, using a regex to catch these troublemakers isn't going to be
> > > >
> > > > useful.
> > > >
> > > > >> Although i caught the first faulty url's, there can be many more
> and
> > > >
> > > > it's
> > > >
> > > > >> unpredictable; here's just a random pick from the list of errors:
> > > >
> > > >
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.
> > > >is
> > > > /Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-Cen
> > > >ter
> > > > s-in-Iceland/
> www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inve
> > > >st.
> > > > is/Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-C
> > > >ent
> > > >
> > > >
> > > >
> > > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > > >
> > > > >> Here's another very disturbing url it's trying to fetch:
> > > >
> > > >
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/200
> > > >5/
> > > > 02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida
> > > >_li
> > > > censes_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovony
> > > >x/h
> > > > ttp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> > > >egi
> > > >
> ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> > > >5/0
> > > > 2/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_
> > > >lic
> > > > enses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> > > >/ht
> > > > tp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> > > >gis
> > > >
> ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> > > >/02
> > > > /04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_l
> > > >ice
> > > > nses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > > >htt
> > > > p/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> > > >ist
> > > >
> er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/
> > > >02/
> > > > 04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_li
> > > >cen
> > > > ses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > > >ttp
> > > > /
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> > > >ste
> > > >
> r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > > >2/0
> > > > 4/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lic
> > > >ens
> > > > es_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> > > >tp/
> > > >
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > > >ter
> > > > .com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02
> > > >/04
> > > > /elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lice
> > > >nse
> > > > s_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > > >p/w
> > > >
> ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > > >er.
> > > > com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/
> > > >04/
> > > > elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licen
> > > >ses
> > > > _ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > > >/ww
> > > >
> w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> > > >r.c
> > > > om/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/0
> > > >4/e
> > > > lpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licens
> > > >es_
> > > > ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> > > >www
> > > > .
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > > >.co
> > > > m/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04
> > > >/el
> > > > pida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_license
> > > >s_o
> > > >
> > > >
> > > >vonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > > >w.
> > > >
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > > >com
> > > > /2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/
> > > >elp
> > > > ida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses
> > > >_ov
> > > >
> > > >onyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > > >.t
> > > >
> heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> > > >om/
> > > > 2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/e
> > > >lpi
> > > > da_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_
> > > >ovo
> > > >
> > > >nyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > > >th
> > > >
> eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > > >m/2
> > > > 005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/el
> > > >pid
> > > > a_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_o
> > > >von
> > > >
> > > >yx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > > >he
> > > >
> register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > > >/20
> > > > 05/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elp
> > > >ida
> > > > _licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ov
> > > >ony x/
> > > >
> > > > >> I'm seems these bad url's are somehow found by the parser and get
> > > >
> > > > fetched
> > > >
> > > > >> the next time, and the next time making the url grow longer and
> > > > >> longer
> > > >
> > > > for
> > > >
> > > > >> each fetch and parse and updateDB cycle.
> > > >
> > > >
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1
> > > >99
> > > > 9/article1513468.ece/
> www.microsoft.com/office/www.microsoft.com/office/
> > > >www
> > > > .
> microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> > > >/ww
> > > >
> w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > > >e/w
> > > >
> ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> > > >ce/
> > > >
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > > >ice
> > > > /
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> > > >fic
> > > > e/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > > >ffi
> > > > ce/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/
> > > >off
> > > > ice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > > >/of
> > > >
> > > >fice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > > >/o
> > > >
> > > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > > >
> > > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > > >> pointer?
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> -----Original message-----
> > > > >> From: Markus Jelsma <[email protected]>
> > > > >> Sent: Wed 22-09-2010 12:12
> > > > >> To: [email protected];
> > > > >> Subject: Funky duplicate url's
> > > > >>
> > > > >> Hi,
> > > > >>
> > > > >>
> > > > >>
> > > > >> This is not about deduplication, but about preventing certain
> url's
> > > > >> to
> > > >
> > > > end
> > > >
> > > > >> up in the CrawlDB. I'm crawling a news site for testing purposes,
> it
> > > > >> has
> > > >
> > > > the
> > > >
> > > > >> usual categories etc. News item pages feature a gray text block
> > > > >> that's
> > > >
> > > > got
> > > >
> > > > >> some url's as well. See
> > > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > > >
> > > > example.
> > > >
> > > > >> The problem is, the parser somehow manages to concatenate the href
> > > > >> with
> > > >
> > > > the
> > > >
> > > > >> inner anchor text (which happens to be an url as you can see). So,
> > > > >> subsequent fetches are completely messed up, i'm almost only
> > > > >> fetching duplicates:
> > > > >>
> > > > >>
> > > > >>
> > > > >> fetching
> > > >
> > > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > > >ni
> > > > euws/economie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/w
> > > >ww.
> > > >
> trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuw
> > > >s/e
> > > > conomie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> > > >uw.
> > > > nl/opinie/weblogs/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblo
> > > >gs/
> > > >
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/n
> > > >ieu ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > > >
> > > > >> fetching
> > > >
> > > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > > >ni
> > > > euws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/
> > > >www
> > > > .
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > > >ie/
> > > > weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > > >ouw
> > > > .nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/ec
> > > >ono
> > > > mie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.
> > > >nl/ nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > > >
> > > > >> fetching
> > > >
> > > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > > >op
> > > > inie/weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/
> > > >www
> > > > .
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > > >ie/
> > > > weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > > >ouw
> > > > .nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/we
> > > >blo
> > > >
> > > >
> > > >gs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> > > >/o
> > > >
> > > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > > >
> > > > >> This is not desired behavior, as you'd expect. The question is,
> > > > >> where to fix and how to fix it? Is it a problem with the parser?
> Or
> > > > >> is it fixable using some freaky url filter for this one?
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Cheers,
> > > > >
> > > > > --
> > > > > AJ Chen, PhD
> > > > > Chair, Semantic Web SIG, sdforum.org
> > > > > http://web2express.org
> > > > > twitter @web2express
> > > > > Palo Alto, CA, USA
> >
> > Markus Jelsma - Technisch Architect - Buyways BV
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>

Re: Funky duplicate url's, getting much worse!

Reply via email to