Hi all,

When I am doing full re-crawl, the old urls that are modified should be
updated correct?That is not happening.

 Please correct me where I am wrong. Below are the list of steps:


   - property set db.fetch.interval.default=600sec db.injector.update=true
   - crawl : bin/nutch crawl urls -solr
http://localhost:8080/solrnutch-dir crawltest -depth 10
   - after 600 sec
   - crawl : bin/nutch crawl urls -solr
http://localhost:8080/solrnutch-dir crawltest -depth 10


   - Nothing updated.  data in solr indexes remain same. I checked the
   fetch segments(bin/nutch readseg), it is also old, But the fetch took
   place.. please see the brief steps of log.
   - I also deleted one URL and made it site not found so that it also
   delete from indexes (using -deleteGone) but this is also not deleted. The
   log shows it deleted but in indexes it is not deleted. I still this URL
   searchable.
   This Seems to be some cache problem (I cleared cache -webserver)or any
   setting that I have to do? Please let me know.]


Please see :  This question is related to my old thread but different
question about update nt successful: data is not re-fetched.


Thanks very much - David
*
*
*
*
*
*
*The brief log trace while second crawl:*
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
http://david.wordpress.in/ overwritten with injected record but update was
specified.
Injector: finished at 2013-03-05 23:25:49, elapsed: 00:00:03
Generator: starting at 2013-03-05 23:25:49
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Fetcher: segment: crawltest/segments/20130305232551
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 5 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
fetching http://david.wordpress.in/2011_09_01_archive.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=4
* queue: david.wordpress.in

so on......

Indexing 10 documents
Deleting 1 documents
SolrIndexer: finished at 2013-03-05 23:27:37, elapsed: 00:00:09
SolrDeleteDuplicates: starting at 2013-03-05 23:27:37
SolrDeleteDuplicates: Solr url:
http://localhost:8080/nutch_solr4/collection1/
SolrDeleteDuplicates: finished at 2013-03-05 23:27:38, elapsed: 00:00:01
crawl finished: crawltest

Reply via email to