Hi, I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to crawl the entire thing. I'm probably missing something simple, so I hope somebody can help me.
My urls/nutch file contains a single URL: http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical listing of other pages. It looks like the indexer stops partway down this page, meaning that entries later in the alphabet aren't indexed. My nutch-site.xml has the following content: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>OHI Spider</value> </property> <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> </configuration> My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, which should allow access to everything I want: # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ # skip everything else -. I've crawled with the following command: runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000 Note that since we don't have NutchBean anymore, I can't tell whether this is actually a Nutch problem or whether something is failing when I port to Solr. What am I missing? Thanks, Chip