Re: Re: Funky duplicate url's, getting much worse!

Julien Nioche Wed, 29 Sep 2010 00:43:38 -0700

Hi guys,

IIRC the OutlinkExtractor is the same in parse-tika and parse-html. Could
you please open a JIRA and attach a patch for the TestOutlinkExtractor so
that we can reproduce the problem?


Thanks

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 28 September 2010 21:14, Markus Jelsma <[email protected]> wrote:

> Hello Mathijs,
>
>
>
> I inspected the code base and found that the problem is most likely in the
> parse-tika code where the text is being extracted and the OutlinkExtractor
> is called. The OutlinkExtractor uses a regular expression that can output a
> lot of garbage. I've added a test to the TestOutlinkExtractor where it's
> clear that at least one URL does not pass but it does not point me in the
> right direction for solving the relative path problem.
>
>
>
> Unless someone knows, i'll try to find out how the OutlinkExtractor works
> with the current base URL because just a plain relative URL in the test will
> obviously fail.
>
>
>
> Thanks for the pointer =)
>
>
>
> Cheers,
>
> -----Original message-----
> From: Mathijs Homminga <[email protected]>
> Sent: Tue 28-09-2010 21:01
> To: [email protected];
> Subject: Re: Funky duplicate url's, getting much worse!
>
> Hi Marcus,
>
> I remember Nutch had some troubles with honoring the page's BASE tag when
> resolving relative outlinks.
> However, I don't see this BASE tag being used in the HTML pages you provide
> so that's might not be it.
>
> Mathijs
>
>
> On Sep 28, 2010, at 18:51 , Markus Jelsma wrote:
>
> > Anyone? Where is a proper solution for this issue? As expected, the regex
> won't catch all imaginable kinds of funky URL's that somehow ended up in the
> CrawlDB. Before the weekend, i added another news site to the tests i
> conduct and let it run continuously. Unfortunately, the generator now comes
> up with all kinds of completely useless URL's, although they do exist but
> that's just the web application ignoring most parts of the URL's.
> >
> >
> >
> > This is the URL that should be considered as proper URL:
> >
> > http://www.blikopnieuws.nl/nieuwsblok
> >
> >
> >
> > Here are two URL's that are completely useless:
> >
> >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119033/bericht/119047/economie
> >
> >
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/bericht/119035/archief/bericht/119038/archief/
> >
> >
> >
> > It is very hard to use deduplication on these simply because the content
> is actually changes too much as time progresses - the latest news block for
> example. It is therefore a necessity to keep these URL's from ending up in
> the CrawlDB and so not to waste disk space and update time of the CrawlDB
> and and huge load of bandwidth - i'm in my current fetch probably going to
> waste at least a few GB's.
> >
> >
> >
> > Looking at the HTML source, it looks like the parser cannot properly
> handle relative URL's. It is, of course, quite ugly for a site to do this
> but the parser must not fool itself and come up with URL's that really
> aren't there. Combined with the issue i began the thread with i believe the
> following two problems are present - the parser returns imaginary (false)
> URL's because of:
> >
> > 1. relative href's;
> >
> > 2. URL's in anchors (that is the XML element's body) next to the rhef
> attribute.
> >
> >
> >
> > Please help in finding the source of the problem (Tika? Nutch?) and how
> to proceed in having it fixed so other users won't waste bandwidth, disk
> space and CPU cycles =)
> >
> >
> >
> >
> >
> >
> >
> > Oh, here's a snippet of the fetch job that's currently running, also,
> notice the news item with the 119039 ID, it's the same as above although
> that copy/paste was 15 minutes ago. Most item ID's you see below continue to
> return in the current log output.
> >
> >
> >
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/bericht/119042/hetweer/game/persberichtaanleveren
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/bericht/119034/bericht/119036/game/tipons
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/archief/game/bericht/119035/bericht/119033/disclaimer
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/rss/archief/bericht/119035/bericht/119036/groningen
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/bericht/119039/rss/bericht/119042/persberichtaanleveren
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119037/archief/bericht/119036/bericht/119038/zuidholland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/bericht/119036/game/hetweer/vandaag
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/game/archief/bericht/119035/game/archief/donderdag
> > fetching http://www.blikopnieuws.nl/nieuwsblok/hetweer/rss/rss/rss/auto
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119037/hetweer/bericht/119034/archief/zeeland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/archief/bericht/119041/bericht/119047/lifestyle
> > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2488
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119035/archief/bericht/119037/game/bericht/119037/N381_moet_mooi_in_landschap_worden_gelegd.html
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/game/bericht/119037/archief/bericht/119038/game/lennythelizard
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/bericht/119041/archief/game/bericht/119039/bericht/119050/A-brug_in_Groningen_opnieuw_defect.html
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/game/bericht/119035/game/bericht/119035/noordbrabant
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/bericht/119036/rss/bericht/119036/
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119033/archief/archief/bericht/119043/game/bioballboom
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/archief/bericht/119033/archief/bericht/119046/wetenschap
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/bericht/119042/archief/hetweer/bericht/119042/Kernreactor_Petten_weer_stilgelegd.html
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/bericht/119034/archief/game/archief/rss/
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/bericht/119039/hetweer/game/archief/overijssel
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/game/archief/bericht/119038/bericht/119048/binnenland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119038/game/bericht/119042/bericht/119038/game/auto
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119042/game/archief/archief/bericht/119049/zeeland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/game/archief/bericht/119043/archief/meewerken
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/game/bericht/119035/game/bericht/119034/gelderland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119037/bericht/119042/game/bericht/119042/game/binnenland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119036/bericht/119042/archief/bericht/119035/bericht/119035/gelderland
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/archief/archief/game/bericht/119038/archief/lifestyle
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119034/archief/archief/bericht/119041/hetweer/archief/woensdag
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119035/archief/bericht/119042/archief/bericht/119047/lifestyle
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/archief/bericht/119034/bericht/119047/glossy
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119038/game/bericht/119038/bericht/119045/glossy
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119035/bericht/119036/game/bericht/119042/archief/zaterdag
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/game/bericht/119036/bericht/119035/archief/bericht/119046/bericht/119064/A4_ritueel_begraven.html
> > -activeThreads=50, spinWaiting=45, fetchQueues.totalSize=2493
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/archief/bericht/119037/bericht/119037/archief/bericht/119046/economie
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/hetweer/archief/archief/bericht/119033/bericht/119037/overijssel
> > fetching
> http://www.blikopnieuws.nl/nieuwsblok/bericht/119039/archief/game/bericht/119036/bericht/119037/
> >
> >
> > -----Original message-----
> > From: Markus Jelsma <[email protected]>
> > Sent: Wed 22-09-2010 20:47
> > To: [email protected];
> > Subject: RE: Re: Funky duplicate url's
> >
> > Thanks! I've already implemented a similar (but not as generic) regex to
> catch these url's. But it is, of course, not a proper solution to solve a
> parsing problem with subsequent regex's. Please, correct me if i'm wrong,
> but i'm quite sure those url's are not to be found in the HTML sources. I'd
> better to be fixed where the problem seems to be.
> >
> >
> >
> > I'll test your regex but i'd still like to know where the exact problem
> lies and hopefully fix or help fixing it.
> >
> >
> >
> > Thanks
> >
> > -----Original message-----
> > From: AJ Chen <[email protected]>
> > Sent: Wed 22-09-2010 20:29
> > To: [email protected];
> > Subject: Re: Funky duplicate url's
> >
> > the conf/regex-urlfilter.txt file has an exclusion rule that should skip
> > these viral urls.
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> > loops
> > -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >
> > -aj
> >
> > On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma <[email protected]
> >wrote:
> >
> >> Well, using a regex to catch these troublemakers isn't going to be
> useful.
> >> Although i caught the first faulty url's, there can be many more and
> it's
> >> unpredictable; here's just a random pick from the list of errors:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> >>
> >>
> >>
> >>
> >>
> >> Here's another very disturbing url it's trying to fetch:
> >>
> >>
> >>
> >>
> >>
> >>
> http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> >>
> >>
> >>
> >>
> >>
> >> I'm seems these bad url's are somehow found by the parser and get
> fetched
> >> the next time, and the next time making the url grow longer and longer
> for
> >> each fetch and parse and updateDB cycle.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
> >>
> >>
> >>
> >>
> >>
> >> This doesn't look good at all. Anyone got a suggestion or some pointer?
> >>
> >>
> >>
> >>
> >>
> >>
> >> -----Original message-----
> >> From: Markus Jelsma <[email protected]>
> >> Sent: Wed 22-09-2010 12:12
> >> To: [email protected];
> >> Subject: Funky duplicate url's
> >>
> >> Hi,
> >>
> >>
> >>
> >> This is not about deduplication, but about preventing certain url's to
> end
> >> up in the CrawlDB. I'm crawling a news site for testing purposes, it has
> the
> >> usual categories etc. News item pages feature a gray text block that's
> got
> >> some url's as well. See
> >> http://www.trouw.nl/opinie/columnisten/article2018983.ece for an
> example.
> >>
> >>
> >>
> >> The problem is, the parser somehow manages to concatenate the href with
> the
> >> inner anchor text (which happens to be an url as you can see). So,
> >> subsequent fetches are completely messed up, i'm almost only fetching
> >> duplicates:
> >>
> >>
> >>
> >> fetching
> >>
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece
> >> fetching
> >>
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> >> fetching
> >>
> http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> >>
> >>
> >>
> >> This is not desired behavior, as you'd expect. The question is, where to
> >> fix and how to fix it? Is it a problem with the parser? Or is it fixable
> >> using some freaky url filter for this one?
> >>
> >>
> >>
> >>
> >>
> >> Cheers,
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> > --
> > AJ Chen, PhD
> > Chair, Semantic Web SIG, sdforum.org
> > http://web2express.org
> > twitter @web2express
> > Palo Alto, CA, USA
>
>

Re: Re: Funky duplicate url's, getting much worse!

Reply via email to