Re: nutch dedup on 1.8

Julien Nioche Fri, 16 May 2014 14:21:32 -0700

Hi

The dedup is now independent from any specific backend as you can see by
typing './nutch dedup'


*Usage: DeduplicationJob <crawldb>*

what is does is that it marks the duplicates within the crawldb and this is
then used by the indexer to delete the corresponding entries.

I have updated the description on
https://wiki.apache.org/nutch/bin/nutch%20dedup.

Thanks!

Julien



On 15 May 2014 05:29, Bayu Widyasanyata <[email protected]> wrote:

> Hi All,
>
> I want to run deduplications data on nutch 1.8 using command: nutch dedup
> <solr_URL> since "nutch solrdedup" command is not supported anymore on 1.8.
> But this command raised error:
>
> 2014-05-15 11:19:59,334 INFO  crawl.DeduplicationJob - DeduplicationJob:
> starting at 2014-05-15 11:19:59
> 2014-05-15 11:19:59,749 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2014-05-15 11:19:59,831 ERROR security.UserGroupInformation -
> PriviledgedActionException as:root cause:java.io.IOException: No FileSystem
> for scheme: http
> 2014-05-15 11:19:59,833 ERROR crawl.DeduplicationJob - DeduplicationJob:
> java.io.IOException: No FileSystem for scheme: http
>         at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>         at
>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
>         at
>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
>         at
>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
>         at
> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
>         at
> org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
>         at
> org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
>         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
>         at
> org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:251)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:294)
>
> What was this "No FileSystem for scheme: http" meant?
> What am I missing here?
>
> I use nutch 1.7 previously with success.
>
> Thank you.-
>
> --
> wassalam,
> [bayu]
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: nutch dedup on 1.8

Reply via email to