Re: Funky duplicate url's, getting much worse!

Markus Jelsma Wed, 29 Sep 2010 02:15:01 -0700

Yes but i need a little more testing. Anyone knows how i can only test that 
class? I currently use ant -v test -l logfile and need to dig through the log 
file, also, it takes too long because of other tests.



On Wednesday 29 September 2010 09:43:04 Julien Nioche wrote:
> Hi guys,
> 
> IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
> you please open a JIRA and attach a patch for the TestOutlinkExtractor so
> that we can reproduce the problem?
> 
> Thanks
> 
> Julien
> 
> > Hello Mathijs,
> >
> >
> >
> > I inspected the code base and found that the problem is most likely in
> > the parse-tika code where the text is being extracted and the
> > OutlinkExtractor is called. The OutlinkExtractor uses a regular
> > expression that can output a lot of garbage. I've added a test to the
> > TestOutlinkExtractor where it's clear that at least one URL does not pass
> > but it does not point me in the right direction for solving the relative
> > path problem.
> >
> >
> >
> > Unless someone knows, i'll try to find out how the OutlinkExtractor works
> > with the current base URL because just a plain relative URL in the test
> > will obviously fail.
> >
> >
> >
> > Thanks for the pointer =)
> >
> >
> >
> > Cheers,
> >
> > -----Original message-----
> > From: Mathijs Homminga <[email protected]>
> > Sent: Tue 28-09-2010 21:01
> > To: [email protected];
> > Subject: Re: Funky duplicate url's, getting much worse!
> >
> > Hi Marcus,
> >
> > I remember Nutch had some troubles with honoring the page's BASE tag when
> > resolving relative outlinks.
> > However, I don't see this BASE tag being used in the HTML pages you
> > provide so that's might not be it.
> >
> > Mathijs
> >
> > On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
> > > Anyone? Where is a proper solution for this issue? As expected, the
> > > regex
> >
> > won't catch all imaginable kinds of funky URL's that somehow ended up in
> > the CrawlDB. Before the weekend, i added another news site to the tests i
> > conduct and let it run continuously. Unfortunately, the generator now
> > comes up with all kinds of completely useless URL's, although they do
> > exist but that's just the web application ignoring most parts of the
> > URL's.
> >
> > > This is the URL that should be considered as proper URL:
> > >
> > > http://www.blikopnieuws.nl/nieuwsblok
> > >
> > >
> > >
> > > Here are two URL's that are completely useless:
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> >cht/119033/bericht/119047/economie
> >
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/1190
> >35/archief/bericht/119038/archief/
> >
> > > It is very hard to use deduplication on these simply because the
> > > content
> >
> > is actually changes too much as time progresses - the latest news block
> > for example. It is therefore a necessity to keep these URL's from ending
> > up in the CrawlDB and so not to waste disk space and update time of the
> > CrawlDB and and huge load of bandwidth - i'm in my current fetch probably
> > going to waste at least a few GB's.
> >
> > > Looking at the HTML source, it looks like the parser cannot properly
> >
> > handle relative URL's. It is, of course, quite ugly for a site to do this
> > but the parser must not fool itself and come up with URL's that really
> > aren't there. Combined with the issue i began the thread with i believe
> > the following two problems are present - the parser returns imaginary
> > (false)
> >
> > URL's because of:
> > > 1. relative href's;
> > >
> > > 2. URL's in anchors (that is the XML element's body) next to the rhef
> >
> > attribute.
> >
> > > Please help in finding the source of the problem (Tika? Nutch?) and how
> >
> > to proceed in having it fixed so other users won't waste bandwidth, disk
> > space and CPU cycles =)
> >
> > > Oh, here's a snippet of the fetch job that's currently running, also,
> >
> > notice the news item with the 119039 ID, it's the same as above although
> > that copy/paste was 15 minutes ago. Most item ID's you see below continue
> > to return in the current log output.
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/
> >hetweer/game/persberichtaanleveren
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht
> >/119036/game/tipons
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht
> >/119035/bericht/119033/disclaimer
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/ber
> >icht/119036/groningen
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/b
> >ericht/119042/persberichtaanleveren
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archi
> >ef/bericht/119036/bericht/119038/zuidholland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> >bericht/119036/game/hetweer/vandaag
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht
> >/119035/game/archief/donderdag
> >
> > > fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht
> >/119034/archief/zeeland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/beri
> >cht/119041/bericht/119047/lifestyle
> >
> > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht
> >/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/be
> >richt/119038/game/lennythelizard
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archi
> >ef/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.h
> >tml
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/
> >game/bericht/119035/noordbrabant
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/b
> >ericht/119036/
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/arch
> >ief/bericht/119043/game/bioballboom
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/1190
> >33/archief/bericht/119046/wetenschap
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/arch
> >ief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game
> >/archief/rss/
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetwe
> >er/game/archief/overijssel
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038
> >/bericht/119048/binnenland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/
> >bericht/119038/game/auto
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief
> >/bericht/119049/zeeland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/ar
> >chief/meewerken
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/
> >game/bericht/119034/gelderland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/
> >bericht/119042/game/binnenland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archi
> >ef/bericht/119035/bericht/119035/gelderland
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/1
> >19038/archief/lifestyle
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/beri
> >cht/119041/hetweer/archief/woensdag
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/1190
> >42/archief/bericht/119047/lifestyle
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/beri
> >cht/119034/bericht/119047/glossy
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht
> >/119038/bericht/119045/glossy
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/1190
> >36/game/bericht/119042/archief/zaterdag
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/
> >archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> >
> > > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/1190
> >37/archief/bericht/119046/economie
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119
> >033/bericht/119037/overijssel
> >
> > > fetching
> >
> > http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht
> >/119036/bericht/119037/
> >
> > > -----Original message-----
> > > From: Markus Jelsma <[email protected]>
> > > Sent: Wed 22-09-2010 20:47
> > > To: [email protected];
> > > Subject: RE: Re: Funky duplicate url's
> > >
> > > Thanks! I've already implemented a similar (but not as generic) regex
> > > to
> >
> > catch these url's. But it is, of course, not a proper solution to solve a
> > parsing problem with subsequent regex's. Please, correct me if i'm wrong,
> > but i'm quite sure those url's are not to be found in the HTML sources.
> > I'd better to be fixed where the problem seems to be.
> >
> > > I'll test your regex but i'd still like to know where the exact problem
> >
> > lies and hopefully fix or help fixing it.
> >
> > > Thanks
> > >
> > > -----Original message-----
> > > From: AJ Chen <[email protected]>
> > > Sent: Wed 22-09-2010 20:29
> > > To: [email protected];
> > > Subject: Re: Funky duplicate url's
> > >
> > > the conf/regex-urlfilter.txt file has an exclusion rule that should
> > > skip these viral urls.
> > >
> > > # skip URLs with slash-delimited segment that repeats 3+ times, to
> > > break loops
> > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > >
> > > -aj
> > >
> > > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma
> > > <[email protected]
> > >
> > >wrote:
> > >> Well, using a regex to catch these troublemakers isn't going to be
> >
> > useful.
> >
> > >> Although i caught the first faulty url's, there can be many more and
> >
> > it's
> >
> > >> unpredictable; here's just a random pick from the list of errors:
> >
> > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is
> >/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Center
> >s-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.
> >is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Cent
> >ers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> >
> > >> Here's another very disturbing url it's trying to fetch:
> >
> > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/
> >02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_li
> >censes_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> >ttp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregi
> >ster.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> >2/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lic
> >enses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/ht
> >tp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> >ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> >/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lice
> >nses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> >p/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> >er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/
> >04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licen
> >ses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> >/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregiste
> >r.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> >4/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licens
> >es_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/
> >www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> >.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04
> >/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_license
> >s_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/w
> >ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> >com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/
> >elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses
> >_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> >w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.c
> >om/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/e
> >lpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_
> >ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> >.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> >m/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/el
> >pida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_o
> >vonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> >theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> >/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elp
> >ida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ov
> >onyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> >heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/
> >2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpi
> >da_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovo
> >nyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.th
> >eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2
> >005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpid
> >a_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovon
> >yx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.the
> >register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/20
> >05/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida
> >_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovony
> >x/
> >
> > >> I'm seems these bad url's are somehow found by the parser and get
> >
> > fetched
> >
> > >> the next time, and the next time making the url grow longer and longer
> >
> > for
> >
> > >> each fetch and parse and updateDB cycle.
> >
> > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_199
> >9/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www
> >.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/ww
> >w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/w
> >ww.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/
> >www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office
> >/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> >e/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offi
> >ce/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> >ice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/of
> >fice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> >ffice/www.microsoft.com/office/www.microsoft.com/office/antivirus
> >
> > >> This doesn't look good at all. Anyone got a suggestion or some
> > >> pointer?
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> From: Markus Jelsma <[email protected]>
> > >> Sent: Wed 22-09-2010 12:12
> > >> To: [email protected];
> > >> Subject: Funky duplicate url's
> > >>
> > >> Hi,
> > >>
> > >>
> > >>
> > >> This is not about deduplication, but about preventing certain url's to
> >
> > end
> >
> > >> up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > >> has
> >
> > the
> >
> > >> usual categories etc. News item pages feature a gray text block that's
> >
> > got
> >
> > >> some url's as well. See
> > >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> >
> > example.
> >
> > >> The problem is, the parser somehow manages to concatenate the href
> > >> with
> >
> > the
> >
> > >> inner anchor text (which happens to be an url as you can see). So,
> > >> subsequent fetches are completely messed up, i'm almost only fetching
> > >> duplicates:
> > >>
> > >>
> > >>
> > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> >euws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.
> >trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/e
> >conomie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.
> >nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/
> >www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieu
> >ws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> >
> > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> >euws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www
> >.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> >weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> >.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/econo
> >mie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/
> >nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> >
> > >> fetching
> >
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/op
> >inie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www
> >.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/
> >weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> >.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblo
> >gs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/o
> >pinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> >
> > >> This is not desired behavior, as you'd expect. The question is, where
> > >> to fix and how to fix it? Is it a problem with the parser? Or is it
> > >> fixable using some freaky url filter for this one?
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Cheers,
> > >
> > > --
> > > AJ Chen, PhD
> > > Chair, Semantic Web SIG, sdforum.org
> > > http://web2express.org
> > > twitter @web2express
> > > Palo Alto, CA, USA
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Funky duplicate url's, getting much worse!

Reply via email to