BTW, I try the below with several nutch and solr versions and I had errors
but now I am
using nutch 1.7 ans solr 5.52 on ubuntu and I am trying to crawl a
subfolder and anything under
that subfolder.  The subfolder contains yearly subfolders for every year
since 2005(12 year subfolder)
and each year subfolder has a month subfolder (12 month subfolder) and each
month subfolder has
at least 30 days subfolders.  I know that I have more than 3,960
index.phtml and some other
regular .html, .phtml and PDF files

Ok so I start the crawl and I follow the step by step instruction:

bin/nutch inject crawl/crawldb urls

After crawling at least 7 times:

 bin/nutch generate crawl/crawldb crawl/segments -topN 10000000 -Depth
 s7=`ls -d crawl/segments/2* | tail -1`
 bin/nutch fetch $s7
 bin/nutch parse $s7
 bin/nutch updatedb crawl/crawldb $s7

Followed by:

 bin/nutch invertlinks crawl/linkdb -dir crawl/segments
 bin/nutch solrindex http://localhost:9191/solr/clips crawl/crawldb/
-linkdb crawl/linkdb/ crawl/segments/20161004205432/ -filter -normalize

But it only finds 289 records(docs) when I look at the solr page.
it seems that it only sees the clips/2016, clips/2015 and clips/2011

I also try all in one command but it FAILS:
bin/nutch crawl urls -solr http://localhost:9191/solr/clips -dir newcrawl
-depth 3 -topN 3

Indexer: starting at 2016-10-14 18:53:55
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication

Indexer: finished at 2016-10-14 18:53:57, elapsed: 00:00:01
SolrDeleteDuplicates: starting at 2016-10-14 18:53:57
SolrDeleteDuplicates: Solr url: http://localhost:9191/solr/clips
*Exception in thread "main" Job failed!*
        at org.apache.hadoop.mapred.JobClient.runJob(
        at org.apache.nutch.crawl.Crawl.main(

*How can I make it crawl the entire subfolder?*
*and What does that error means?*



Né§t☼r  *Authority gone to one's head is the greatest enemy of Truth*

Reply via email to