RE: Nutch not indexing full collection

Chip Calhoun Wed, 20 Jul 2011 08:02:25 -0700

I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, and I'm 
pretty sure that's the correct file.  I run my commands while in $NUTCH_HOME/ , 
which means all of my commands begin with "runtime/local/bin/nutch..." .  That 
means my urls directory is $NUTCH_HOME/urls/ and my crawl directory ends up 
being $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and so 
forth), but it does seem to at least be getting my urlfilters from 
$NUTCH_HOME/runtime/local/conf/ .


I get no output when I try runtime/local/bin/nutch readdb -stats , so that's 
weird.

I dimly recall there being a total index size value somewhere in Nutch or Solr 
which has to be increased, but I can no longer find any reference to it.

Chip

-----Original Message-----
From: Julien Nioche [mailto:[email protected]] 
Sent: Wednesday, July 20, 2011 10:06 AM
To: [email protected]
Subject: Re: Nutch not indexing full collection

I'd have suspected db.max.outlinks.per.page but you seem to have set it up 
correctly. Are you running Nutch in runtime/local? in which case you modified 
nutch-site.xml in runtime/local/conf, right?

nutch readdb -stats will give you the total number of pages known etc....

Julien

On 20 July 2011 14:51, Chip Calhoun <[email protected]> wrote:

> Hi,
>
> I'm using Nutch 1.3 to crawl a section of our website, and it doesn't 
> seem to crawl the entire thing.  I'm probably missing something 
> simple, so I hope somebody can help me.
>
> My urls/nutch file contains a single URL:
> http://www.aip.org/history/ohilist/transcripts.html , which is an 
> alphabetical listing of other pages.  It looks like the indexer stops 
> partway down this page, meaning that entries later in the alphabet 
> aren't indexed.
>
> My nutch-site.xml has the following content:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. --> 
> <configuration> <property>  <name>http.agent.name</name>  <value>OHI 
> Spider</value> </property> <property>  
> <name>db.max.outlinks.per.page</name>
>  <value>-1</value>
>  <description>The maximum number of outlinks that we'll process for a page.
>  If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks  will be processed for a page; otherwise, all outlinks will 
> be processed.
>  </description>
> </property>
> </configuration>
>
> My regex-urlfilter.txt and crawl-urlfilter.txt both include the 
> following, which should allow access to everything I want:
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> # skip everything else
> -.
>
> I've crawled with the following command:
> runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000
>
> Note that since we don't have NutchBean anymore, I can't tell whether 
> this is actually a Nutch problem or whether something is failing when 
> I port to Solr.  What am I missing?
>
> Thanks,
> Chip
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

RE: Nutch not indexing full collection

Reply via email to