The url is stored in a different order (reversed domain
name:protocol:port and path) from the order normally seen in your web
browser so that it can be searched more quickly in NoSQL data stores
like hbase. Nutch has a brief explanation and convenience utility
methods around this at TableUtil
(http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm
l)


-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Monday, August 13, 2012 9:25 AM
To: [email protected]
Subject: updatedb error in nutch-2.0



Hello,


I get the following error when I do bin/nutch updatedb in nutch-2.0 with
hbase

java.lang.ArrayIndexOutOfBoundsException: 1
        at
org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
        at
org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54)
        at
org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

I see this is because of reversing and unreversing urls. What is the
idea behind this reversal and unreversal in nutch-2.0?

Thanks.
Alex.

 

Reply via email to