Re: Error running nutch 1.11

Sebastian Nagel Sun, 27 Dec 2015 09:28:02 -0800

Hi Jerritt,

> $ bin/crawl -D 
> C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seeds.txt 
> \
       Test Crawl http://localhost:8983/solr/  2


Afaics, there are two issues with the command:

1. The option -D expects a key=value pair to set a property, e.g.
     -D solr.server.url=http://localhost:8983/solr/

2. In case, you are using a crawl or seed directory containing white spaces in 
the path,
   the argument needs to be passed with quotes. However, better don't use space 
in
   file names. :)

Running bin/crawl without arguments shows a command-line help:

$ apache-nutch-1.11/bin/crawl
Usage: crawl [-i|--index] [-D "key=value"] [-w|--wait] <Seed Dir> <Crawl Dir> 
<Num Rounds>
        -i|--index      Indexes crawl results into a configured indexer
        -D              A Java property to pass to Nutch calls
        -w|--wait       NUMBER[SUFFIX] Time to wait before generating a new 
segment when no URLs
                        are scheduled for fetching. Suffix can be: s for second,
                        m for minute, h for hour and d for day. If no suffix is
                        specified second is used by default.
        Seed Dir        Directory in which to look for a seeds file
        Crawl Dir       Directory where the crawl/link/segments dirs are saved
        Num Rounds      The number of rounds to run this crawl for

I guess this should do the job for you:

$ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ 
.../urls/seeds.txt TestCrawl 2

Cheers,
Sebastian


On 12/26/2015 07:16 PM, Jerritt Pace wrote:
> I have tried a lot ofdifferent things, but I can't get nutch to tun a crawl 
> command.  
> 
> I am using cygwin on windows 7.
> I have the java classpath set, and I am getting feedback when I run bin/nutch.
> But the crawl execute gives me an error:
> Error running:
>   
> /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch
>  inject http://localhost:8983/solr//crawldb TestCrawl
> Failed with exit value 127.
> My command is
> 
> 
> $ bin/crawl -D 
> C:/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/urls/seeds.txt 
> Test Crawl http://localhost:8983/solr/  2
> The full output is:
> 
> Injecting seed URLs
> /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch
>  inject http://localhost:8983/solr//crawldb TestCrawl
> Injector: starting at 2015-12-26 13:11:12
> Injector: crawlDb: http://localhost:8983/solr/crawldb
> Injector: urlDir: TestCrawl
> Injector: Converting injected urls to crawl db entries.
> Injector: java.lang.IllegalArgumentException: Wrong FS: 
> http://localhost:8983/solr/crawldb, expected: file:///
>         at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:79)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:506)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>         at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
>         at org.apache.nutch.crawl.Injector.inject(Injector.java:298)
>         at org.apache.nutch.crawl.Injector.run(Injector.java:379)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>         at org.apache.nutch.crawl.Injector.main(Injector.java:369)
> 
> Error running:
>   
> /cygdrive/c/Users/User5/Documents/Nutch/apache-nutch-1.11/runtime/local/bin/nutch
>  inject http://localhost:8983/solr//crawldb TestCrawl
> Failed with exit value 127.
> 
> Any help with this would be much appreciated!!
> 
> 
>

Re: Error running nutch 1.11

Reply via email to