Hi Remi,

it's not a bug, the substitution pattern is wrong.
A captured group $1 is used but nothing is captured.
The pattern should be:

    <pattern>http://google1.com/(.+)</pattern>

Now $1 is defined and contains the part matched by .+

Beside, the rule

<regex>
   <pattern>^http://google1\.com/</pattern>
   <substitution>http://google.com/</substitution>
</regex>

will do (almost) the same and should be faster - capturing
content has some cost.

Sebastian


On 04/02/2012 09:40 AM, remi tassing wrote:
Hi all,

I just found a weird error and it looks like a JDK bug but I'm not sure.
Whenever replacing a URL-A, that contains a number, with a URL-B, then I
get an error: "IndexOutOfBoundsException: No group 1"

In my regex-normalize.xml, I have:
<regex>
   <pattern>http://google1.com/.+</pattern>
   <substitution>http://google.com/$1</substitution>
</regex>

and trying:
echo 
'http://google2.com/whatever'|bin/nutchorg.apache.nutch.net.URLNormalizerChecker
gives:
Checking combination of all URLNormalizers available
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
         at java.util.regex.Matcher.start(Matcher.java:374)
         at java.util.regex.Matcher.appendReplacement(Matcher.java:830)
         at java.util.regex.Matcher.replaceAll(Matcher.java:905)
         at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:181)
         at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:188)
         at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
         at
org.apache.nutch.net.URLNormalizerChecker.checkAll(URLNormalizerChecker.java:83)
         at
org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110)

Have you experienced this before?

Remi


Reply via email to