RE: readdb to dump a specific url

Markus Jelsma Sat, 04 Mar 2017 02:02:05 -0800

Hi - this very long standing problem has been fixed in a Hadoop more recent 
than you are using now. Upgrade to 2.7.3 or 2.8.0 if that's out some day.


Markus

 
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Saturday 4th March 2017 3:49
> To: User <[email protected]>
> Subject: readdb to dump a specific url
> 
> I want to find out what the crawldb knows about some specific urls. According 
> to the nutch wiki, I should use nutch readdb with the -url option. But when I 
> do a command like the following, I get nasty "can't find class" exceptions.
> 
> 
> $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/popular/data/crawldb -url 
> http://fabulous.com/
> 
> The error message isException in thread "main" java.io.IOException: can't 
> find class: org.apache.nutch.protocol.ProtocolStatus because 
> org.apache.nutch.protocol.ProtocolStatus
>         at 
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:212)
>         at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:167)
>         at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:317)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2256)
>         at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:680)
>         at 
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:99)
>         at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:465)
>         at 
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:472)
>         at org.apache.nutch.crawl.CrawlDbReader.run(CrawlDbReader.java:717)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:736)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> 
> 
> 
> The above message occurs for any url that is actually in the crawldb. If I 
> specify a url that does not exist, I get a more understandable message. Also, 
> nutch readdb -stats works reliably.
> How can we make this work?
>

RE: readdb to dump a specific url

Reply via email to