Re: Funky duplicate url's, getting much worse!

Markus Jelsma Wed, 29 Sep 2010 05:30:13 -0700

Thanks!

We're back with the base URL issue. The stuff i `found` in the 
TestOutlinkExtractor was my own doing. No patch here. Using the ParserChecker 
it was clear that the problem came up because the http:// URL schema was not 
present in some href's. The problem is also present when using an ordinary 
browser and it can be solved by using the regex AJ supplied.


The problem with the blikopnieuws site (relative URL's without base URL) 
remains, though. Check this link http://www.blikopnieuws.nl/nieuwsblok
On the right side you'll see a latest news block with (in the browser) proper 
URL's. Check the source and you'll see relative URL's. It, of course, also 
stops working the the browser when you have a trailing slash.

Now use the parser checker:
bin/nutch org.apache.nutch.parse.ParserChecker 
http://www.blikopnieuws.nl/nieuwsblok

And you'll see that Nutch uses http://www.blikopnieuws.nl/nieuwsblok/ as base 
URL for relative URL's, just as the browser does. Everything works as expected 
because of the relative URL's.

The problem is, the website is itself not consistent. It mostly features the 
URL in the footer without trailing slash but from some unknown page i got the 
same URL with the trailing slash. From there on, everything starts to go 
wrong.

To conclude, i got fooled! But how can we in the future prevent this from 
happening? I could use url filtering but that would mean the index already 
contains garbage because i cannot filter what i don't know.

Cheers,

On Wednesday 29 September 2010 11:25:55 Julien Nioche wrote:
> Don't know how to run a single test but if you do ant test you should be
>  able to find the logs for each individual class in ./build/test with a
>  separate log for TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt that
>  will be easier that going through a single huge file
> 
> J.
> 
> 
> On 29 September 2010 10:11, Markus Jelsma <[email protected]> wrote:
> Yes but i need a little more testing. Anyone knows how i can only test that
> class? I currently use ant -v test -l logfile and need to dig through the
>  log file, also, it takes too long because of other tests.
> 
> On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > Hi guys,
> >
> > IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
> > you please open a JIRA and attach a patch for the TestOutlinkExtractor so
> > that we can reproduce the problem?
> >
> > Thanks
> >
> > Julien
> >
> > > Hello Mathijs,
> > >
> > >
> > >
> > > I inspected the code base and found that the problem is most likely in
> > > the parse-tika code where the text is being extracted and the
> > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > expression that can output a lot of garbage. I've added a test to the
> > > TestOutlinkExtractor where it's clear that at least one URL does not
> > > pass but it does not point me in the right direction for solving the
> > > relative path problem.
> > >
> > >
> > >
> > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> > > works with the current base URL because just a plain relative URL in
> > > the test will obviously fail.
> > >
> > >
> > >
> > > Thanks for the pointer =)
> > >
> > >
> > >
> > > Cheers,
> > >
> > > -----Original message-----
> > > From: Mathijs Homminga <[email protected]>
> > > Sent: Tue 28-09-2010 21:01
> > > To: [email protected];
> > > Subject: Re: Funky duplicate url's, getting much worse!
> > >
> > > Hi Marcus,
> > >
> > > I remember Nutch had some troubles with honoring the page's BASE tag
> > > when resolving relative outlinks.
> > > However, I don't see this BASE tag being used in the HTML pages you
> > > provide so that's might not be it.
> > >
> > > Mathijs
> > >
> > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > Anyone? Where is a proper solution for this issue? As expected, the
> > > > regex
> > >
> > > won't catch all imaginable kinds of funky URL's that somehow ended up
> > > in the CrawlDB. Before the weekend, i added another news site to the
> > > tests i conduct and let it run continuously. Unfortunately, the
> > > generator now comes up with all kinds of completely useless URL's,
> > > although they do exist but that's just the web application ignoring
> > > most parts of the URL's.
> > >
> > > > This is the URL that should be considered as proper URL:
> > > >
> > > > http://www.blikopnieuws.nl/nieuwsblok
> > > >
> > > >
> > > >
> > > > Here are two URL's that are completely useless:
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > >ri cht/119033/bericht/119047/economie
> > >
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/11
> > >90 35/archief/bericht/119038/archief/
> > >
> > > > It is very hard to use deduplication on these simply because the
> > > > content
> > >
> > > is actually changes too much as time progresses - the latest news block
> > > for example. It is therefore a necessity to keep these URL's from
> > > ending up in the CrawlDB and so not to waste disk space and update time
> > > of the CrawlDB and and huge load of bandwidth - i'm in my current fetch
> > > probably going to waste at least a few GB's.
> > >
> > > > Looking at the HTML source, it looks like the parser cannot properly
> > >
> > > handle relative URL's. It is, of course, quite ugly for a site to do
> > > this but the parser must not fool itself and come up with URL's that
> > > really aren't there. Combined with the issue i began the thread with i
> > > believe the following two problems are present - the parser returns
> > > imaginary (false)
> > >
> > > URL's because of:
> > > > 1. relative href's;
> > > >
> > > > 2. URL's in anchors (that is the XML element's body) next to the rhef
> > >
> > > attribute.
> > >
> > > > Please help in finding the source of the problem (Tika? Nutch?) and
> > > > how
> > >
> > > to proceed in having it fixed so other users won't waste bandwidth,
> > > disk space and CPU cycles =)
> > >
> > > > Oh, here's a snippet of the fetch job that's currently running, also,
> > >
> > > notice the news item with the 119039 ID, it's the same as above
> > > although that copy/paste was 15 minutes ago. Most item ID's you see
> > > below continue to return in the current log output.
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/11904
> > >2/ hetweer/game/persberichtaanleveren
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/beric
> > >ht /119036/game/tipons
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/beric
> > >ht /119035/bericht/119033/disclaimer
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/b
> > >er icht/119036/groningen
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss
> > >/b ericht/119042/persberichtaanleveren
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/arc
> > >hi ef/bericht/119036/bericht/119038/zuidholland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > >5/ bericht/119036/game/hetweer/vandaag
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/beric
> > >ht /119035/game/archief/donderdag
> > >
> > > > fetching
> > > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/beric
> > >ht /119034/archief/zeeland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/be
> > >ri cht/119041/bericht/119047/lifestyle
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/beric
> > >ht
> > > /119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.h
> > >tml
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/
> > >be richt/119038/game/lennythelizard
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/arc
> > >hi
> > > ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defec
> > >t.h tml
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/11903
> > >5/ game/bericht/119035/noordbrabant
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss
> > >/b ericht/119036/
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/ar
> > >ch ief/bericht/119043/game/bioballboom
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/11
> > >90 33/archief/bericht/119046/wetenschap
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/ar
> > >ch ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/ga
> > >me /archief/rss/
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/het
> > >we er/game/archief/overijssel
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/1190
> > >38 /bericht/119048/binnenland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/11904
> > >2/ bericht/119038/game/auto
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archi
> > >ef /bericht/119049/zeeland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/
> > >ar chief/meewerken
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/11903
> > >5/ game/bericht/119034/gelderland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/gam
> > >e/ bericht/119042/game/binnenland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/arc
> > >hi ef/bericht/119035/bericht/119035/gelderland
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht
> > >/1 19038/archief/lifestyle
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/be
> > >ri cht/119041/hetweer/archief/woensdag
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/11
> > >90 42/archief/bericht/119047/lifestyle
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/be
> > >ri cht/119034/bericht/119047/glossy
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/beric
> > >ht /119038/bericht/119045/glossy
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/11
> > >90 36/game/bericht/119042/archief/zaterdag
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/11903
> > >5/ archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/11
> > >90 37/archief/bericht/119046/economie
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/1
> > >19 033/bericht/119037/overijssel
> > >
> > > > fetching
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/beric
> > >ht /119036/bericht/119037/
> > >
> > > > -----Original message-----
> > > > From: Markus Jelsma <[email protected]>
> > > > Sent: Wed 22-09-2010 20:47
> > > > To: [email protected];
> > > > Subject: RE: Re: Funky duplicate url's
> > > >
> > > > Thanks! I've already implemented a similar (but not as generic) regex
> > > > to
> > >
> > > catch these url's. But it is, of course, not a proper solution to solve
> > > a parsing problem with subsequent regex's. Please, correct me if i'm
> > > wrong, but i'm quite sure those url's are not to be found in the HTML
> > > sources. I'd better to be fixed where the problem seems to be.
> > >
> > > > I'll test your regex but i'd still like to know where the exact
> > > > problem
> > >
> > > lies and hopefully fix or help fixing it.
> > >
> > > > Thanks
> > > >
> > > > -----Original message-----
> > > > From: AJ Chen <[email protected]>
> > > > Sent: Wed 22-09-2010 20:29
> > > > To: [email protected];
> > > > Subject: Re: Funky duplicate url's
> > > >
> > > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > > skip these viral urls.
> > > >
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > > break loops
> > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > >
> > > > -aj
> > > >
> > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > <[email protected]
> > > >
> > > >wrote:
> > > >> Well, using a regex to catch these troublemakers isn't going to be
> > >
> > > useful.
> > >
> > > >> Although i caught the first faulty url's, there can be many more and
> > >
> > > it's
> > >
> > > >> unpredictable; here's just a random pick from the list of errors:
> > >
> > > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.
> > >is
> > > /Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Cen
> > >ter
> > > s-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inve
> > >st.
> > > is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-C
> > >ent
> > >
> > >
> > >
> > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > >
> > > >> Here's another very disturbing url it's trying to fetch:
> > >
> > > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/200
> > >5/
> > > 02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida
> > >_li
> > > censes_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovony
> > >x/h
> > > ttp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> > >egi
> > > ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> > >5/0
> > > 2/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_
> > >lic
> > > enses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> > >/ht
> > > tp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> > >gis
> > > ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> > >/02
> > > /04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_l
> > >ice
> > > nses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > >htt
> > > p/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> > >ist
> > > er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/
> > >02/
> > > 04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_li
> > >cen
> > > ses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > >ttp
> > > /www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> > >ste
> > > r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > >2/0
> > > 4/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lic
> > >ens
> > > es_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> > >tp/
> > > www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > >ter
> > > .com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> > >/04
> > > /elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lice
> > >nse
> > > s_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > >p/w
> > > ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > >er.
> > > com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/
> > >04/
> > > elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licen
> > >ses
> > > _ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > >/ww
> > > w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> > >r.c
> > > om/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> > >4/e
> > > lpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licens
> > >es_
> > > ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> > >www
> > > .theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > >.co
> > > m/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04
> > >/el
> > > pida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_license
> > >s_o
> > >
> > >
> > >vonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > >w.
> > > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > >com
> > > /2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/
> > >elp
> > > ida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses
> > >_ov
> > >
> > >onyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > >.t
> > > heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> > >om/
> > > 2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/e
> > >lpi
> > > da_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_
> > >ovo
> > >
> > >nyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > >th
> > > eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > >m/2
> > > 005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/el
> > >pid
> > > a_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_o
> > >von
> > >
> > >yx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > >he
> > > register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > >/20
> > > 05/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elp
> > >ida
> > > _licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ov
> > >ony x/
> > >
> > > >> I'm seems these bad url's are somehow found by the parser and get
> > >
> > > fetched
> > >
> > > >> the next time, and the next time making the url grow longer and
> > > >> longer
> > >
> > > for
> > >
> > > >> each fetch and parse and updateDB cycle.
> > >
> > > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1
> > >99
> > > 9/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/
> > >www
> > > .microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> > >/ww
> > > w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > >e/w
> > > ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> > >ce/
> > > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > >ice
> > > /www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> > >fic
> > > e/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > >ffi
> > > ce/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/
> > >off
> > > ice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > >/of
> > >
> > >fice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > >/o
> > >
> > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > >
> > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > >> pointer?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> -----Original message-----
> > > >> From: Markus Jelsma <[email protected]>
> > > >> Sent: Wed 22-09-2010 12:12
> > > >> To: [email protected];
> > > >> Subject: Funky duplicate url's
> > > >>
> > > >> Hi,
> > > >>
> > > >>
> > > >>
> > > >> This is not about deduplication, but about preventing certain url's
> > > >> to
> > >
> > > end
> > >
> > > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > > >> has
> > >
> > > the
> > >
> > > >> usual categories etc. News item pages feature a gray text block
> > > >> that's
> > >
> > > got
> > >
> > > >> some url's as well. See
> > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > >
> > > example.
> > >
> > > >> The problem is, the parser somehow manages to concatenate the href
> > > >> with
> > >
> > > the
> > >
> > > >> inner anchor text (which happens to be an url as you can see). So,
> > > >> subsequent fetches are completely messed up, i'm almost only
> > > >> fetching duplicates:
> > > >>
> > > >>
> > > >>
> > > >> fetching
> > >
> > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > >ni
> > > euws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/w
> > >ww.
> > > trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuw
> > >s/e
> > > conomie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> > >uw.
> > > nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblo
> > >gs/
> > > www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/n
> > >ieu ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > >
> > > >> fetching
> > >
> > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > >ni
> > > euws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/
> > >www
> > > .trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > >ie/
> > > weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > >ouw
> > > .nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/ec
> > >ono
> > > mie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.
> > >nl/ nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> fetching
> > >
> > > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/
> > >op
> > > inie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/
> > >www
> > > .trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opin
> > >ie/
> > > weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > >ouw
> > > .nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/we
> > >blo
> > >
> > >
> > >gs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> > >/o
> > >
> > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> This is not desired behavior, as you'd expect. The question is,
> > > >> where to fix and how to fix it? Is it a problem with the parser? Or
> > > >> is it fixable using some freaky url filter for this one?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Cheers,
> > > >
> > > > --
> > > > AJ Chen, PhD
> > > > Chair, Semantic Web SIG, sdforum.org
> > > > http://web2express.org
> > > > twitter @web2express
> > > > Palo Alto, CA, USA
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Funky duplicate url's, getting much worse!

Reply via email to