I'd have suspected db.max.outlinks.per.page but you seem to have set it up correctly. Are you running Nutch in runtime/local? in which case you modified nutch-site.xml in runtime/local/conf, right?
nutch readdb -stats will give you the total number of pages known etc.... Julien On 20 July 2011 14:51, Chip Calhoun <ccalh...@aip.org> wrote: > Hi, > > I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem > to crawl the entire thing. I'm probably missing something simple, so I hope > somebody can help me. > > My urls/nutch file contains a single URL: > http://www.aip.org/history/ohilist/transcripts.html , which is an > alphabetical listing of other pages. It looks like the indexer stops > partway down this page, meaning that entries later in the alphabet aren't > indexed. > > My nutch-site.xml has the following content: > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> > <property> > <name>http.agent.name</name> > <value>OHI Spider</value> > </property> > <property> > <name>db.max.outlinks.per.page</name> > <value>-1</value> > <description>The maximum number of outlinks that we'll process for a page. > If this value is nonnegative (>=0), at most db.max.outlinks.per.page > outlinks > will be processed for a page; otherwise, all outlinks will be processed. > </description> > </property> > </configuration> > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, > which should allow access to everything I want: > # accept hosts in MY.DOMAIN.NAME > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ > # skip everything else > -. > > I've crawled with the following command: > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000 > > Note that since we don't have NutchBean anymore, I can't tell whether this > is actually a Nutch problem or whether something is failing when I port to > Solr. What am I missing? > > Thanks, > Chip > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com