RE: Nutch not crawling deep enough into directory structure

Mattmann, Chris A (3980) Mon, 08 Sep 2014 14:16:46 -0700

Hi Paul,

Try expanding your last parameter (which is the # of crawling rounds).


Also make sure to check these properties:

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

The first can be set to false so that Nutch actually processes inlinks
from the same host and the second to true, so that Nutch folks external
links (if necessary).

Also check your max outlinks per page property.

HTH,
Chris

________________________________________
From: Paul Rogers [[email protected]]
Sent: Monday, September 08, 2014 2:09 PM
To: [email protected]
Subject: Nutch not crawling deep enough into directory structure

Hi Guys

Reposting this since I think it got lost in the tail end of the last post.

I have a web site serving a series of documents (pdf's) and am using Nutch
1.8 to index them in solr.  The base url is http://localhost/ and the
documents are stored in a series of directories in the directory
http://localhost/doccontrol/, e.g.

/
|_doccontrol
    |_DC-10 Incoming Correspondence
    |_DC-11 Outgoing Correspondence

If when I first run nutch the folders DC-10 and DC-11 contain all the files
to be indexed then nutch crawls everything without a problem - GOOD :-)

If I add a new folder or documents to the root or doc control folder then
the next time nutch runs it crawls all the new files and indexes them -
GOOD :-)

However any new files that are added to the DC-10 or DC-11 directories are
not indexed with nutch's output as follows (summarised):

Injector: starting at 2014-08-29 15:19:59
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: overwrite: true
Injector: update: false
Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:02
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20140829152005
Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
Operating on segment : 20140829152005
Fetching : 20140829152005
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2014-08-29 15:20:06
Fetcher: segment: crawl/segments/20140829152005
Fetcher Timelimit set for : 1409354406733
Using queue mode : byHost
Fetcher: threads: 50
Fetcher: time-out divisor: 2
QueueFeeder finished: total 1 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=1
.
.
.
-finishing thread FetcherThread, activeThreads=1
Using queue mode : byHost
Fetcher: throughput threshold: -1
-finishing thread FetcherThread, activeThreads=1
Fetcher: throughput threshold retries: 5
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
Parsing : 20140829152005
ParseSegment: starting at 2014-08-29 15:20:09
ParseSegment: segment: crawl/segments/20140829152005
Parsed (3ms):http://ws0895/doccontrol/
ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
CrawlDB update
CrawlDb update: starting at 2014-08-29 15:20:11
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20140829152005]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
Link inversion
LinkDb: starting at 2014-08-29 15:20:13
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: crawl/segments/20140829152005
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
Dedup on crawldb
Indexing 20140829152005 on SOLR index ->
http://localhost:8983/solr/collection1
Indexer: starting at 2014-08-29 15:20:19
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
Cleanup on SOLR index -> http://localhost:8983/solr/collection1
Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

BAD - :-(

What I'd like nutch to do is to index any newly added docs whatever level
they were added at.

My nutch command is as follows:

bin/crawl urls crawl http://localhost:8983/solr/collection1 4

My nutch-site.xml contains:

<property>
  <name>db.update.additions.allowed</name>
  <value>true</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
 </property>
 <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>
 <property>
  <name>db.injector.overwrite</name>
  <value>true</value>
  <description>Whether existing records in the CrawlDB will be overwritten
  by injected records.
  </description>
 </property>
 <property>
  <name>db.fetch.schedule.class</name>
  <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
  <description>The implementation of fetch schedule. DefaultFetchSchedule
simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.</description>
 </property>

 <property>
  <name>db.fetch.schedule.adaptive.min_interval</name>
  <value>86400.0</value>
  <description>Minimum fetchInterval, in seconds.</description>
 </property>
  <property>
  <name>db.fetch.interval.default</name>
  <value>1209600</value>
  <description>The default number of seconds between re-fetches of a page
(14 days).
  </description>
 </property>

Is what I am trying to do (recrawl any newly added documents at any level)
impossible?

Or (more likely) am I missing something in the config?

Can anyone point me in the right direction?

Many thanks

P

RE: Nutch not crawling deep enough into directory structure

Reply via email to