Hi all,

I just found a weird error and it looks like a JDK bug but I'm not sure.
Whenever replacing a URL-A, that contains a number, with a URL-B, then I
get an error: "IndexOutOfBoundsException: No group 1"

In my regex-normalize.xml, I have:
<regex>
  <pattern>http://google1.com/.+</pattern>
  <substitution>http://google.com/$1</substitution>
</regex>

and trying:
echo 
'http://google2.com/whatever'|bin/nutchorg.apache.nutch.net.URLNormalizerChecker
gives:
Checking combination of all URLNormalizers available
Exception in thread "main" java.lang.IndexOutOfBoundsException: No group 1
        at java.util.regex.Matcher.start(Matcher.java:374)
        at java.util.regex.Matcher.appendReplacement(Matcher.java:830)
        at java.util.regex.Matcher.replaceAll(Matcher.java:905)
        at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.regexNormalize(RegexURLNormalizer.java:181)
        at
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.normalize(RegexURLNormalizer.java:188)
        at
org.apache.nutch.net.URLNormalizers.normalize(URLNormalizers.java:286)
        at
org.apache.nutch.net.URLNormalizerChecker.checkAll(URLNormalizerChecker.java:83)
        at
org.apache.nutch.net.URLNormalizerChecker.main(URLNormalizerChecker.java:110)

Have you experienced this before?

Remi

Reply via email to