I'd have suspected db.max.outlinks.per.page but you seem to have set it up
correctly. Are you running Nutch in runtime/local? in which case you
modified nutch-site.xml in runtime/local/conf, right?

nutch readdb -stats will give you the total number of pages known etc....


On 20 July 2011 14:51, Chip Calhoun <ccalh...@aip.org> wrote:

> Hi,
> I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem
> to crawl the entire thing.  I'm probably missing something simple, so I hope
> somebody can help me.
> My urls/nutch file contains a single URL:
> http://www.aip.org/history/ohilist/transcripts.html , which is an
> alphabetical listing of other pages.  It looks like the indexer stops
> partway down this page, meaning that entries later in the alphabet aren't
> indexed.
> My nutch-site.xml has the following content:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>OHI Spider</value>
> </property>
> <property>
>  <name>db.max.outlinks.per.page</name>
>  <value>-1</value>
>  <description>The maximum number of outlinks that we'll process for a page.
>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> outlinks
>  will be processed for a page; otherwise, all outlinks will be processed.
>  </description>
> </property>
> </configuration>
> My regex-urlfilter.txt and crawl-urlfilter.txt both include the following,
> which should allow access to everything I want:
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> # skip everything else
> -.
> I've crawled with the following command:
> runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000
> Note that since we don't have NutchBean anymore, I can't tell whether this
> is actually a Nutch problem or whether something is failing when I port to
> Solr.  What am I missing?
> Thanks,
> Chip

*Open Source Solutions for Text Engineering


Reply via email to