Hi Julien, IMHO, the usage command of dedup is made me confused. Since the command "DeduplicationJob" is shown there as part of Java class, not command line parameter of nutch command.
Maybe it should be (which is : Usage: bin/nutch dedup <crawldb> Quote from this page you updated: "This command takes a path to a crawldb as parameter...", >>> crawldb as parameter It's more clear for users who runs deduplicate from nutch command. CMIIW. On Fri, May 16, 2014 at 6:21 PM, Julien Nioche < [email protected]> wrote: > Hi > > The dedup is now independent from any specific backend as you can see by > typing './nutch dedup' > > *Usage: DeduplicationJob <crawldb>* > > what is does is that it marks the duplicates within the crawldb and this is > then used by the indexer to delete the corresponding entries. > > I have updated the description on > https://wiki.apache.org/nutch/bin/nutch%20dedup. > > Thanks! > > Julien > > > > On 15 May 2014 05:29, Bayu Widyasanyata <[email protected]> wrote: > > > Hi All, > > > > I want to run deduplications data on nutch 1.8 using command: nutch dedup > > <solr_URL> since "nutch solrdedup" command is not supported anymore on > 1.8. > > But this command raised error: > > > > 2014-05-15 11:19:59,334 INFO crawl.DeduplicationJob - DeduplicationJob: > > starting at 2014-05-15 11:19:59 > > 2014-05-15 11:19:59,749 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-05-15 11:19:59,831 ERROR security.UserGroupInformation - > > PriviledgedActionException as:root cause:java.io.IOException: No > FileSystem > > for scheme: http > > 2014-05-15 11:19:59,833 ERROR crawl.DeduplicationJob - DeduplicationJob: > > java.io.IOException: No FileSystem for scheme: http > > at > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434) > > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) > > at > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455) > > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) > > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) > > at > > > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176) > > at > > > > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) > > at > > > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) > > at > > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) > > at > > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) > > at > > org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > > at > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > > at > > org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:251) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > > org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:294) > > > > What was this "No FileSystem for scheme: http" meant? > > What am I missing here? > > > > I use nutch 1.7 previously with success. > > > > Thank you.- > > > > -- > > wassalam, > > [bayu] > > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- wassalam, [bayu]

