Hi The dedup is now independent from any specific backend as you can see by typing './nutch dedup'
*Usage: DeduplicationJob <crawldb>* what is does is that it marks the duplicates within the crawldb and this is then used by the indexer to delete the corresponding entries. I have updated the description on https://wiki.apache.org/nutch/bin/nutch%20dedup. Thanks! Julien On 15 May 2014 05:29, Bayu Widyasanyata <[email protected]> wrote: > Hi All, > > I want to run deduplications data on nutch 1.8 using command: nutch dedup > <solr_URL> since "nutch solrdedup" command is not supported anymore on 1.8. > But this command raised error: > > 2014-05-15 11:19:59,334 INFO crawl.DeduplicationJob - DeduplicationJob: > starting at 2014-05-15 11:19:59 > 2014-05-15 11:19:59,749 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-05-15 11:19:59,831 ERROR security.UserGroupInformation - > PriviledgedActionException as:root cause:java.io.IOException: No FileSystem > for scheme: http > 2014-05-15 11:19:59,833 ERROR crawl.DeduplicationJob - DeduplicationJob: > java.io.IOException: No FileSystem for scheme: http > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) > at > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176) > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) > at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) > at > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) > at > org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > at > org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:251) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:294) > > What was this "No FileSystem for scheme: http" meant? > What am I missing here? > > I use nutch 1.7 previously with success. > > Thank you.- > > -- > wassalam, > [bayu] > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

