Nutch not indexing full collection

Chip Calhoun Wed, 20 Jul 2011 06:52:21 -0700

Hi,

I'm using Nutch 1.3 to crawl a section of our website, and it doesn't seem to 
crawl the entire thing.  I'm probably missing something simple, so I hope 
somebody can help me.


My urls/nutch file contains a single URL: 
http://www.aip.org/history/ohilist/transcripts.html , which is an alphabetical 
listing of other pages.  It looks like the indexer stops partway down this 
page, meaning that entries later in the alphabet aren't indexed.

My nutch-site.xml has the following content:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>http.agent.name</name>
  <value>OHI Spider</value>
</property>
<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>
</configuration>

My regex-urlfilter.txt and crawl-urlfilter.txt both include the following, 
which should allow access to everything I want:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*aip.org/history/ohilist/
# skip everything else
-.

I've crawled with the following command:
runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000

Note that since we don't have NutchBean anymore, I can't tell whether this is 
actually a Nutch problem or whether something is failing when I port to Solr.  
What am I missing?

Thanks,
Chip

Nutch not indexing full collection

Reply via email to