Re: Nutch not crawling deep enough into directory structure

Paul Rogers Thu, 11 Sep 2014 08:02:29 -0700

Hi Chris

Many thanks for the response.  Sorry it's taken a few days to test things
here.


I have added the properties you suggested:

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>

They were indeed set to the defaults.

I have also updated my command to:

bin/crawl urls crawl http://localhost:8983/solr/collection1 16

I have checked the max.outlinks and it is set as follows:

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
 </property>

which I believe should be OK.

After each change I have deleted the crawl database and run my initial
crawl (everything is added no matter what depth) and then added some
documents additional documents and re run the crawl.

As before, no change unfortunately.

If the files are added at the root or doccontrol level everything is
added/crawled (I can even add a new directory at this level and the files
within it are crawled/added).  But any new documents added into the other
folders (DC-10 Incoming Correspondence or DC-11 Outgoing Correspondence)
are ignored.  Whenever I run the crawl/add it always stops at the second
pass (irrespective of the final parameter on the command line), ie always
finishes with:

Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
Generating a new segment
Generator: starting at 2014-08-29 15:20:23
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...

Would the seed.txt or regex-url.txt affect things in this manner?

seed.txt entry: http://ws0895/doccontrol/

regex-url.txt entry: http://([a-z0-9]*\.)*ws0895/

Any further suggestions?

Many thanks


P

On 8 September 2014 16:15, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Hi Paul,
>
> Try expanding your last parameter (which is the # of crawling rounds).
>
> Also make sure to check these properties:
>
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>true</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> The first can be set to false so that Nutch actually processes inlinks
> from the same host and the second to true, so that Nutch folks external
> links (if necessary).
>
> Also check your max outlinks per page property.
>
> HTH,
> Chris
>
> ________________________________________
> From: Paul Rogers [[email protected]]
> Sent: Monday, September 08, 2014 2:09 PM
> To: [email protected]
> Subject: Nutch not crawling deep enough into directory structure
>
> Hi Guys
>
> Reposting this since I think it got lost in the tail end of the last post.
>
> I have a web site serving a series of documents (pdf's) and am using Nutch
> 1.8 to index them in solr.  The base url is http://localhost/ and the
> documents are stored in a series of directories in the directory
> http://localhost/doccontrol/, e.g.
>
> /
> |_doccontrol
>     |_DC-10 Incoming Correspondence
>     |_DC-11 Outgoing Correspondence
>
> If when I first run nutch the folders DC-10 and DC-11 contain all the files
> to be indexed then nutch crawls everything without a problem - GOOD :-)
>
> If I add a new folder or documents to the root or doc control folder then
> the next time nutch runs it crawls all the new files and indexes them -
> GOOD :-)
>
> However any new files that are added to the DC-10 or DC-11 directories are
> not indexed with nutch's output as follows (summarised):
>
> Injector: starting at 2014-08-29 15:19:59
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: total number of urls rejected by filters: 0
> Injector: total number of urls injected after normalization and filtering:
> 1
> Injector: Merging injected urls into crawl db.
> Injector: overwrite: true
> Injector: update: false
> Injector: finished at 2014-08-29 15:20:02, elapsed: 00:00:02
> Fri Aug 29 15:20:02 EST 2014 : Iteration 1 of 4
> Generating a new segment
> Generator: starting at 2014-08-29 15:20:02
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20140829152005
> Generator: finished at 2014-08-29 15:20:06, elapsed: 00:00:03
> Operating on segment : 20140829152005
> Fetching : 20140829152005
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2014-08-29 15:20:06
> Fetcher: segment: crawl/segments/20140829152005
> Fetcher Timelimit set for : 1409354406733
> Using queue mode : byHost
> Fetcher: threads: 50
> Fetcher: time-out divisor: 2
> QueueFeeder finished: total 1 records + hit by time limit :0
> Using queue mode : byHost
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> fetching http://ws0895/doccontrol/ (queue crawl delay=5000ms)
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Using queue mode : byHost
> -finishing thread FetcherThread, activeThreads=1
> .
> .
> .
> -finishing thread FetcherThread, activeThreads=1
> Using queue mode : byHost
> Fetcher: throughput threshold: -1
> -finishing thread FetcherThread, activeThreads=1
> Fetcher: throughput threshold retries: 5
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2014-08-29 15:20:09, elapsed: 00:00:02
> Parsing : 20140829152005
> ParseSegment: starting at 2014-08-29 15:20:09
> ParseSegment: segment: crawl/segments/20140829152005
> Parsed (3ms):http://ws0895/doccontrol/
> ParseSegment: finished at 2014-08-29 15:20:10, elapsed: 00:00:01
> CrawlDB update
> CrawlDb update: starting at 2014-08-29 15:20:11
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20140829152005]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2014-08-29 15:20:12, elapsed: 00:00:01
> Link inversion
> LinkDb: starting at 2014-08-29 15:20:13
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: internal links will be ignored.
> LinkDb: adding segment: crawl/segments/20140829152005
> LinkDb: merging with existing linkdb: crawl/linkdb
> LinkDb: finished at 2014-08-29 15:20:15, elapsed: 00:00:02
> Dedup on crawldb
> Indexing 20140829152005 on SOLR index ->
> http://localhost:8983/solr/collection1
> Indexer: starting at 2014-08-29 15:20:19
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
>         solr.server.url : URL of the SOLR instance (mandatory)
>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>         solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>         solr.auth : use authentication (default false)
>         solr.auth.username : use authentication (default false)
>         solr.auth : username for authentication
>         solr.auth.password : password for authentication
>
>
> Indexer: finished at 2014-08-29 15:20:20, elapsed: 00:00:01
> Cleanup on SOLR index -> http://localhost:8983/solr/collection1
> Fri Aug 29 15:20:22 EST 2014 : Iteration 2 of 4
> Generating a new segment
> Generator: starting at 2014-08-29 15:20:23
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 50000
> Generator: 0 records selected for fetching, exiting ...
>
> BAD - :-(
>
> What I'd like nutch to do is to index any newly added docs whatever level
> they were added at.
>
> My nutch command is as follows:
>
> bin/crawl urls crawl http://localhost:8983/solr/collection1 4
>
> My nutch-site.xml contains:
>
> <property>
>   <name>db.update.additions.allowed</name>
>   <value>true</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
>  </property>
>  <name>db.max.outlinks.per.page</name>
>   <value>-1</value>
>   <description>The maximum number of outlinks that we'll process for a
> page.
>   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>   will be processed for a page; otherwise, all outlinks will be processed.
>   </description>
>  </property>
>  <property>
>   <name>db.injector.overwrite</name>
>   <value>true</value>
>   <description>Whether existing records in the CrawlDB will be overwritten
>   by injected records.
>   </description>
>  </property>
>  <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>   <description>The implementation of fetch schedule. DefaultFetchSchedule
> simply
>   adds the original fetchInterval to the last fetch time, regardless of
>   page changes.</description>
>  </property>
>
>  <property>
>   <name>db.fetch.schedule.adaptive.min_interval</name>
>   <value>86400.0</value>
>   <description>Minimum fetchInterval, in seconds.</description>
>  </property>
>   <property>
>   <name>db.fetch.interval.default</name>
>   <value>1209600</value>
>   <description>The default number of seconds between re-fetches of a page
> (14 days).
>   </description>
>  </property>
>
> Is what I am trying to do (recrawl any newly added documents at any level)
> impossible?
>
> Or (more likely) am I missing something in the config?
>
> Can anyone point me in the right direction?
>
> Many thanks
>
> P
>

Re: Nutch not crawling deep enough into directory structure

Reply via email to