Hi,

That pattern works nicely with some repeating URL's but not all. I did manage 
to find a pattern that looks for repeating substrings and modified it to match 
3 
out of 4 example URL's, the 4th URL got caught by your pattern so everything 
seems fine.

The problem is, i'm not too familliar with regex' and the differences between 
PCRE and Java variants.

In PHP i came up with:

/(?=((.+)(.?\2{8,})+))/'

Which detects substrings with a minimum length of 8 characters, it detects the 
following URL's:

http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_1999/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivirus
http://www.nrc.nl/dossiers/orkanen/slachtoffers_hulp/article1636844.ece/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/http/www.nytimes.com/2005/09/02/opinion/02krugman.html

The problem is, the pattern fails to match in Java! Is there anyone here with 
any insights in modifying the pattern to work in Java's regex lib?

Cheers,

On Wednesday 22 September 2010 20:29:01 AJ Chen wrote:
> the conf/regex-urlfilter.txt file has an exclusion rule that should skip
> these viral urls.
> 
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> 
> -aj
> 
> On Wed, Sep 22, 2010 at 4:48 AM, Markus Jelsma 
<[email protected]>wrote:
> > Well, using a regex to catch these troublemakers isn't going to be
> > useful. Although i caught the first faulty url's, there can be many more
> > and it's unpredictable; here's just a random pick from the list of
> > errors:
> > 
> > 
> > 
> > 
> > 
> > 
> > http://www.trouw.nl/achtergrond/Dossiers/article1851907.ece/www.invest.is
> > /Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-Cente
> > rs-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/www.inves
> > t.is/Key-Sectors/Data-Centers-in-Iceland/www.invest.is/Key-Sectors/Data-C
> > enters-in-Iceland/www.invest.is/Key-Sectors/Data-Centers-in-Iceland/
> > 
> > 
> > 
> > 
> > 
> > Here's another very disturbing url it's trying to fetch:
> > 
> > 
> > 
> > 
> > 
> > http://www.nrc.nl/krant/article1860140.ece/http/www.theregister.com/2005/
> > 02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_l
> > icenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx
> > /http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.ther
> > egister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/20
> > 05/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpid
> > a_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovo
> > nyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.t
> > heregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com
> > /2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/el
> > pida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_
> > ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/ww
> > w.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.
> > com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04
> > /elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licens
> > es_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http
> > /www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregist
> > er.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02
> > /04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_lic
> > enses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/h
> > ttp/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.thereg
> > ister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005
> > /02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_
> > licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovony
> > x/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.the
> > register.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2
> > 005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpi
> > da_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ov
> > onyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.
> > theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.co
> > m/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/e
> > lpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses
> > _ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/w
> > ww.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister
> > .com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/0
> > 4/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licen
> > ses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/htt
> > p/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregis
> > ter.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/0
> > 2/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_li
> > censes_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.there
> > gister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/200
> > 5/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida
> > _licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovon
> > yx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.th
> > eregister.com/2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/
> > 2005/02/04/elpida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elp
> > ida_licenses_ovonyx/http/www.theregister.com/2005/02/04/elpida_licenses_o
> > vonyx/http/www.theregister.com/2005/02/04/elpida_licenses_ovonyx/http/www
> > .theregister.com/2005/02/04/elpida_licenses_ovonyx/
> > 
> > 
> > 
> > 
> > 
> > I'm seems these bad url's are somehow found by the parser and get fetched
> > the next time, and the next time making the url grow longer and longer
> > for each fetch and parse and updateDB cycle.
> > 
> > 
> > 
> > 
> > 
> > 
> > http://www.nrc.nl/dossiers/computerbeveiliging/virussen/melissa_maart_199
> > 9/article1513468.ece/www.microsoft.com/office/www.microsoft.com/office/ww
> > w.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/office/
> > www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/offic
> > e/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/off
> > ice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com/o
> > ffice/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.com
> > /office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft.c
> > om/office/www.microsoft.com/office/www.microsoft.com/office/www.microsoft
> > .com/office/www.microsoft.com/office/www.microsoft.com/office/www.microso
> > ft.com/office/www.microsoft.com/office/www.microsoft.com/office/www.micro
> > soft.com/office/www.microsoft.com/office/www.microsoft.com/office/antivir
> > us
> > 
> > 
> > 
> > 
> > 
> > This doesn't look good at all. Anyone got a suggestion or some pointer?
> > 
> > 
> > 
> > 
> > 
> > 
> > -----Original message-----
> > From: Markus Jelsma <[email protected]>
> > Sent: Wed 22-09-2010 12:12
> > To: [email protected];
> > Subject: Funky duplicate url's
> > 
> > Hi,
> > 
> > 
> > 
> > This is not about deduplication, but about preventing certain url's to
> > end up in the CrawlDB. I'm crawling a news site for testing purposes, it
> > has the usual categories etc. News item pages feature a gray text block
> > that's got some url's as well. See
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece for an example.
> > 
> > 
> > 
> > The problem is, the parser somehow manages to concatenate the href with
> > the inner anchor text (which happens to be an url as you can see). So,
> > subsequent fetches are completely messed up, i'm almost only fetching
> > duplicates:
> > 
> > 
> > 
> > fetching
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > euws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www
> > .trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws
> > /economie/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.tro
> > uw.nl/opinie/weblogs/www.trouw.nl/opinie/weblogs/www.trouw.nl/opinie/webl
> > ogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw.nl
> > /nieuws/economie/www.trouw.nl/opinie/weblogs/article2012945.ece fetching
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/ni
> > euws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/ww
> > w.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opini
> > e/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > ouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/e
> > conomie/www.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trou
> > w.nl/nieuws/economie/www.trouw.nl/nieuws/economie/article1504915.ece
> > fetching
> > http://www.trouw.nl/opinie/columnisten/article2018983.ece/www.trouw.nl/op
> > inie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/ww
> > w.trouw.nl/opinie/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opini
> > e/weblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.tr
> > ouw.nl/nieuws/economie/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/w
> > eblogs/www.trouw.nl/nieuws/economie/www.trouw.nl/opinie/weblogs/www.trouw
> > .nl/opinie/weblogs/www.trouw.nl/nieuws/economie/article1504915.ece
> > 
> > 
> > 
> > This is not desired behavior, as you'd expect. The question is, where to
> > fix and how to fix it? Is it a problem with the parser? Or is it fixable
> > using some freaky url filter for this one?
> > 
> > 
> > 
> > 
> > 
> > Cheers,

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Reply via email to