The url is stored in a different order (reversed domain name:protocol:port and path) from the order normally seen in your web browser so that it can be searched more quickly in NoSQL data stores like hbase. Nutch has a brief explanation and convenience utility methods around this at TableUtil (http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm l)
-----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Monday, August 13, 2012 9:25 AM To: [email protected] Subject: updatedb error in nutch-2.0 Hello, I get the following error when I do bin/nutch updatedb in nutch-2.0 with hbase java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54) at org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) I see this is because of reversing and unreversing urls. What is the idea behind this reversal and unreversal in nutch-2.0? Thanks. Alex.

