Re: Funky duplicate url's, getting much worse!

Julien Nioche Wed, 29 Sep 2010 02:26:27 -0700

Don't know how to run a single test but if you do *ant test *you should be
able to find the logs for each individual class in ./build/test with a
separate log for *TEST-org.apache.nutch.parse.TestOutlinkExtractor.txt*
that will be easier that going through a single huge file


J.


On 29 September 2010 10:11, Markus Jelsma <[email protected]> wrote:

> Yes but i need a little more testing. Anyone knows how i can only test that
> class? I currently use ant -v test -l logfile and need to dig through the
> log
> file, also, it takes too long because of other tests.
>
>
> On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> > Hi guys,
> >
> > IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
> > you please open a JIRA and attach a patch for the TestOutlinkExtractor so
> > that we can reproduce the problem?
> >
> > Thanks
> >
> > Julien
> >
> > > Hello Mathijs,
> > >
> > >
> > >
> > > I inspected the code base and found that the problem is most likely in
> > > the parse-tika code where the text is being extracted and the
> > > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > > expression that can output a lot of garbage. I've added a test to the
> > > TestOutlinkExtractor where it's clear that at least one URL does not
> pass
> > > but it does not point me in the right direction for solving the
> relative
> > > path problem.
> > >
> > >
> > >
> > > Unless someone knows, i'll try to find out how the OutlinkExtractor
> works
> > > with the current base URL because just a plain relative URL in the test
> > > will obviously fail.
> > >
> > >
> > >
> > > Thanks for the pointer =)
> > >
> > >
> > >
> > > Cheers,
> > >
> > > -----Original message-----
> > > From: Mathijs Homminga <[email protected]>
> > > Sent: Tue 28-09-2010 21:01
> > > To: [email protected];
> > > Subject: Re: Funky duplicate url's, getting much worse!
> > >
> > > Hi Marcus,
> > >
> > > I remember Nutch had some troubles with honoring the page's BASE tag
> when
> > > resolving relative outlinks.
> > > However, I don't see this BASE tag being used in the HTML pages you
> > > provide so that's might not be it.
> > >
> > > Mathijs
> > >
> > > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > > Anyone? Where is a proper solution for this issue? As expected, the
> > > > regex
> > >
> > > won't catch all imaginable kinds of funky URL's that somehow ended up
> in
> > > the CrawlDB. Before the weekend, i added another news site to the tests
> i
> > > conduct and let it run continuously. Unfortunately, the generator now
> > > comes up with all kinds of completely useless URL's, although they do
> > > exist but that's just the web application ignoring most parts of the
> > > URL's.
> > >
> > > > This is the URL that should be considered as proper URL:
> > > >
> > > > http://www.blikopnieuws.nl/nieuwsblok
> > > >
> > > >
> > > >
> > > > Here are two URL's that are completely useless:
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> > >cht/119033/bericht/119047/economie
> > >
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/1190
> > >35/archief/bericht/119038/archief/
> > >
> > > > It is very hard to use deduplication on these simply because the
> > > > content
> > >
> > > is actually changes too much as time progresses - the latest news block
> > > for example. It is therefore a necessity to keep these URL's from
> ending
> > > up in the CrawlDB and so not to waste disk space and update time of the
> > > CrawlDB and and huge load of bandwidth - i'm in my current fetch
> probably
> > > going to waste at least a few GB's.
> > >
> > > > Looking at the HTML source, it looks like the parser cannot properly
> > >
> > > handle relative URL's. It is, of course, quite ugly for a site to do
> this
> > > but the parser must not fool itself and come up with URL's that really
> > > aren't there. Combined with the issue i began the thread with i believe
> > > the following two problems are present - the parser returns imaginary
> > > (false)
> > >
> > > URL's because of:
> > > > 1. relative href's;
> > > >
> > > > 2. URL's in anchors (that is the XML element's body) next to the rhef
> > >
> > > attribute.
> > >
> > > > Please help in finding the source of the problem (Tika? Nutch?) and
> how
> > >
> > > to proceed in having it fixed so other users won't waste bandwidth,
> disk
> > > space and CPU cycles =)
> > >
> > > > Oh, here's a snippet of the fetch job that's currently running, also,
> > >
> > > notice the news item with the 119039 ID, it's the same as above
> although
> > > that copy/paste was 15 minutes ago. Most item ID's you see below
> continue
> > > to return in the current log output.
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/
> > >hetweer/game/persberichtaanleveren
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht
> > >/119036/game/tipons
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht
> > >/119035/bericht/119033/disclaimer
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/ber
> > >icht/119036/groningen
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/b
> > >ericht/119042/persberichtaanleveren
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archi
> > >ef/bericht/119036/bericht/119038/zuidholland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> > >bericht/119036/game/hetweer/vandaag
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht
> > >/119035/game/archief/donderdag
> > >
> > > > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht
> > >/119034/archief/zeeland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> > >cht/119041/bericht/119047/lifestyle
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht
> >
> >/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/be
> > >richt/119038/game/lennythelizard
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archi
> >
> >ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.h
> > >tml
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> > >game/bericht/119035/noordbrabant
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/b
> > >ericht/119036/
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/arch
> > >ief/bericht/119043/game/bioballboom
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/1190
> > >33/archief/bericht/119046/wetenschap
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/arch
> > >ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game
> > >/archief/rss/
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetwe
> > >er/game/archief/overijssel
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038
> > >/bericht/119048/binnenland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/
> > >bericht/119038/game/auto
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief
> > >/bericht/119049/zeeland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/ar
> > >chief/meewerken
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/
> > >game/bericht/119034/gelderland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/
> > >bericht/119042/game/binnenland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archi
> > >ef/bericht/119035/bericht/119035/gelderland
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/1
> > >19038/archief/lifestyle
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/beri
> > >cht/119041/hetweer/archief/woensdag
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/1190
> > >42/archief/bericht/119047/lifestyle
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/beri
> > >cht/119034/bericht/119047/glossy
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht
> > >/119038/bericht/119045/glossy
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/1190
> > >36/game/bericht/119042/archief/zaterdag
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/
> > >archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > >
> > > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/1190
> > >37/archief/bericht/119046/economie
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119
> > >033/bericht/119037/overijssel
> > >
> > > > fetching
> > >
> > >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht
> > >/119036/bericht/119037/
> > >
> > > > -----Original message-----
> > > > From: Markus Jelsma <[email protected]>
> > > > Sent: Wed 22-09-2010 20:47
> > > > To: [email protected];
> > > > Subject: RE: Re: Funky duplicate url's
> > > >
> > > > Thanks! I've already implemented a similar (but not as generic) regex
> > > > to
> > >
> > > catch these url's. But it is, of course, not a proper solution to solve
> a
> > > parsing problem with subsequent regex's. Please, correct me if i'm
> wrong,
> > > but i'm quite sure those url's are not to be found in the HTML sources.
> > > I'd better to be fixed where the problem seems to be.
> > >
> > > > I'll test your regex but i'd still like to know where the exact
> problem
> > >
> > > lies and hopefully fix or help fixing it.
> > >
> > > > Thanks
> > > >
> > > > -----Original message-----
> > > > From: AJ Chen <[email protected]>
> > > > Sent: Wed 22-09-2010 20:29
> > > > To: [email protected];
> > > > Subject: Re: Funky duplicate url's
> > > >
> > > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > > skip these viral urls.
> > > >
> > > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > > break loops
> > > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > > >
> > > > -aj
> > > >
> > > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > > <[email protected]
> > > >
> > > >wrote:
> > > >> Well, using a regex to catch these troublemakers isn't going to be
> > >
> > > useful.
> > >
> > > >> Although i caught the first faulty url's, there can be many more and
> > >
> > > it's
> > >
> > > >> unpredictable; here's just a random pick from the list of errors:
> > >
> > >
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is
> > >/Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-Center
> > >s-in-Iceland/
> www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.
> > >is/Key-Sectors/Data-Centers-in-Iceland/
> www.invest.is/Key-Sectors/Data-Cent
> > >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > >
> > > >> Here's another very disturbing url it's trying to fetch:
> > >
> > >
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/
> > >02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_li
> > >censes_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > >ttp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> > >
> ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > >2/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lic
> > >enses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> > >tp/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > >
> ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> > >/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_lice
> > >nses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > >p/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > >
> er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/
> > >04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licen
> > >ses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > >/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> > >
> r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> > >4/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licens
> > >es_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> > >
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > >.com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04
> > >/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_license
> > >s_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/w
> > >
> ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > >com/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/
> > >elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses
> > >_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > >
> w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> > >om/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/e
> > >lpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_
> > >ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > >.
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > >m/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/el
> > >pida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_o
> > >vonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > >
> theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > >/2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elp
> > >ida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ov
> > >onyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > >
> heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/
> > >2005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpi
> > >da_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovo
> > >nyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.th
> > >
> eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2
> > >005/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpid
> > >a_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovon
> > >yx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.the
> > >
> register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/20
> > >05/02/04/elpida_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida
> > >_licenses_ovonyx/http/
> www.theregister.com/2005/02/04/elpida_licenses_ovony
> > >x/
> > >
> > > >> I'm seems these bad url's are somehow found by the parser and get
> > >
> > > fetched
> > >
> > > >> the next time, and the next time making the url grow longer and
> longer
> > >
> > > for
> > >
> > > >> each fetch and parse and updateDB cycle.
> > >
> > >
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_199
> > >9/article1513468.ece/
> www.microsoft.com/office/www.microsoft.com/office/www
> > >.
> microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/ww
> > >
> w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/w
> > >
> ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/
> > >
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> > >/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > >e/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> > >ce/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > >ice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> > >fice/
> www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> > >
> > > >> This doesn't look good at all. Anyone got a suggestion or some
> > > >> pointer?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> -----Original message-----
> > > >> From: Markus Jelsma <[email protected]>
> > > >> Sent: Wed 22-09-2010 12:12
> > > >> To: [email protected];
> > > >> Subject: Funky duplicate url's
> > > >>
> > > >> Hi,
> > > >>
> > > >>
> > > >>
> > > >> This is not about deduplication, but about preventing certain url's
> to
> > >
> > > end
> > >
> > > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > > >> has
> > >
> > > the
> > >
> > > >> usual categories etc. News item pages feature a gray text block
> that's
> > >
> > > got
> > >
> > > >> some url's as well. See
> > > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> > >
> > > example.
> > >
> > > >> The problem is, the parser somehow manages to concatenate the href
> > > >> with
> > >
> > > the
> > >
> > > >> inner anchor text (which happens to be an url as you can see). So,
> > > >> subsequent fetches are completely messed up, i'm almost only
> fetching
> > > >> duplicates:
> > > >>
> > > >>
> > > >>
> > > >> fetching
> > >
> > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > >euws/economie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.
> > >
> trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/e
> > >conomie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.
> > >nl/opinie/weblogs/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/
> > >
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieu
> > >ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> > >
> > > >> fetching
> > >
> > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > >euws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www
> > >.
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> > >weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> > >.nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/econo
> > >mie/
> www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/
> > >nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> fetching
> > >
> > >
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/op
> > >inie/weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www
> > >.
> trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> > >weblogs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> > >.nl/nieuws/economie/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblo
> > >gs/
> www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/o
> > >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > >
> > > >> This is not desired behavior, as you'd expect. The question is,
> where
> > > >> to fix and how to fix it? Is it a problem with the parser? Or is it
> > > >> fixable using some freaky url filter for this one?
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Cheers,
> > > >
> > > > --
> > > > AJ Chen, PhD
> > > > Chair, Semantic Web SIG, sdforum.org
> > > > http://web2express.org
> > > > twitter @web2express
> > > > Palo Alto, CA, USA
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Funky duplicate url's, getting much worse!

Reply via email to