FYI I have attached a patch in nutch-1448. On Mon, Aug 13, 2012 at 7:54 PM, <[email protected]> wrote:
> I found out that the key sent to > unreverseUrl in DbUpdateMapper.map was ":index.php/http" > > > This happened in the depth 3 and I checked seed file there was no line in > the form of http:/index.php > > Thanks. > Alex. > > > > -----Original Message----- > From: Ferdy Galema <[email protected]> > To: user <[email protected]> > Sent: Mon, Aug 13, 2012 1:53 am > Subject: Re: updatedb error in nutch-2.0 > > > Hi, > > In the specific case of Alex, it means that a row name in the database is > malformed. Looking at the stacktrace lines in TableUtil, it looks like an > url is stored without protocol (at least without a ":"). This is probably > because of redirected urls not correctly being checked for wellformedness. > If you look at line 664 in the FetcherReducer (HEAD) it writes out a new > url directly as a row in the database. I have never experienced this > exception and this might be because I changed some behaviour that makes > sure a redirected url is handled a bit more like a general outlink. I have > created an issue for this that I will update shortly: > https://issues.apache.org/jira/browse/NUTCH-1448 > > Ferdy. > > On Mon, Aug 13, 2012 at 2:52 AM, <[email protected]> wrote: > > > The url is stored in a different order (reversed domain > > name:protocol:port and path) from the order normally seen in your web > > browser so that it can be searched more quickly in NoSQL data stores > > like hbase. Nutch has a brief explanation and convenience utility > > methods around this at TableUtil > > (http://nutch.apache.org/apidocs-2.0/org/apache/nutch/util/TableUtil.htm > > l) > > > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] > > Sent: Monday, August 13, 2012 9:25 AM > > To: [email protected] > > Subject: updatedb error in nutch-2.0 > > > > > > > > Hello, > > > > > > I get the following error when I do bin/nutch updatedb in nutch-2.0 with > > hbase > > > > java.lang.ArrayIndexOutOfBoundsException: 1 > > at > > org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98) > > at > > org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:54) > > at > > org.apache.nutch.crawl.DbUpdateMapper.map(DbUpdateMapper.java:37) > > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > > at > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > > > > I see this is because of reversing and unreversing urls. What is the > > idea behind this reversal and unreversal in nutch-2.0? > > > > Thanks. > > Alex. > > > > > > > > >

