Re: nutch dedup on 1.8

Bayu Widyasanyata Mon, 19 May 2014 07:14:30 -0700

Hi Julien,

IMHO, the usage command of dedup is made me confused.
Since the command "DeduplicationJob" is shown there as part of Java class,
not command line parameter of nutch command.


Maybe it should be (which is :

Usage: bin/nutch dedup <crawldb>

Quote from this page you updated:
"This command takes a path to a crawldb as parameter...", >>> crawldb as
parameter

It's more clear for users who runs deduplicate from nutch command.

CMIIW.


On Fri, May 16, 2014 at 6:21 PM, Julien Nioche <
[email protected]> wrote:

> Hi
>
> The dedup is now independent from any specific backend as you can see by
> typing './nutch dedup'
>
> *Usage: DeduplicationJob <crawldb>*
>
> what is does is that it marks the duplicates within the crawldb and this is
> then used by the indexer to delete the corresponding entries.
>
> I have updated the description on
> https://wiki.apache.org/nutch/bin/nutch%20dedup.
>
> Thanks!
>
> Julien
>
>
>
> On 15 May 2014 05:29, Bayu Widyasanyata <[email protected]> wrote:
>
> > Hi All,
> >
> > I want to run deduplications data on nutch 1.8 using command: nutch dedup
> > <solr_URL> since "nutch solrdedup" command is not supported anymore on
> 1.8.
> > But this command raised error:
> >
> > 2014-05-15 11:19:59,334 INFO  crawl.DeduplicationJob - DeduplicationJob:
> > starting at 2014-05-15 11:19:59
> > 2014-05-15 11:19:59,749 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-05-15 11:19:59,831 ERROR security.UserGroupInformation -
> > PriviledgedActionException as:root cause:java.io.IOException: No
> FileSystem
> > for scheme: http
> > 2014-05-15 11:19:59,833 ERROR crawl.DeduplicationJob - DeduplicationJob:
> > java.io.IOException: No FileSystem for scheme: http
> >         at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1434)
> >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> >         at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1455)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
> >         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
> >         at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
> >         at
> >
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> >         at
> >
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
> >         at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> >         at
> > org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> >         at
> > org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> >         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> >         at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> >         at java.security.AccessController.doPrivileged(Native Method)
> >         at javax.security.auth.Subject.doAs(Subject.java:422)
> >         at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> >         at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> >         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> >         at
> > org.apache.nutch.crawl.DeduplicationJob.run(DeduplicationJob.java:251)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at
> > org.apache.nutch.crawl.DeduplicationJob.main(DeduplicationJob.java:294)
> >
> > What was this "No FileSystem for scheme: http" meant?
> > What am I missing here?
> >
> > I use nutch 1.7 previously with success.
> >
> > Thank you.-
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
wassalam,
[bayu]

Re: nutch dedup on 1.8

Reply via email to