RE: Nutch not indexing full collection

Chip Calhoun Mon, 25 Jul 2011 10:18:42 -0700

I'm still having trouble.  I've set a windows environment variable, NUTCH_HOME, 
which for me is C:\Apache\nutch-1.3\runtime\local .  I now have my urls and 
crawl directories in that C:\Apache\nutch-1.3\runtime\local folder.  But I'm 
still not crawling files later on my urls list, and apparently I can't search 
for words or phrases toward the end of any of my documents.  Am I 
misremembering that there was a total file size value somewhere in Nutch or 
Solr that needs to be increased?


-----Original Message-----
From: lewis john mcgibbney [mailto:[email protected]] 
Sent: Wednesday, July 20, 2011 5:23 PM
To: [email protected]
Subject: Re: Nutch not indexing full collection

Hi Chip,

I would try running your scripts after setting the environment variable 
$NUTCH_HOME to nutch/runtime/local/NUTCH_HOME

On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <[email protected]> wrote:

> I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, 
> and I'm pretty sure that's the correct file.  I run my commands while 
> in $NUTCH_HOME/ , which means all of my commands begin with 
> "runtime/local/bin/nutch..." .  That means my urls directory is 
> $NUTCH_HOME/urls/ and my crawl directory ends up being 
> $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and 
> so forth), but it does seem to at least be getting my urlfilters from 
> $NUTCH_HOME/runtime/local/conf/ .
>
> I get no output when I try runtime/local/bin/nutch readdb -stats , so 
> that's weird.
>
> I dimly recall there being a total index size value somewhere in Nutch 
> or Solr which has to be increased, but I can no longer find any 
> reference to it.
>
> Chip
>
> -----Original Message-----
> From: Julien Nioche [mailto:[email protected]]
> Sent: Wednesday, July 20, 2011 10:06 AM
> To: [email protected]
> Subject: Re: Nutch not indexing full collection
>
> I'd have suspected db.max.outlinks.per.page but you seem to have set 
> it up correctly. Are you running Nutch in runtime/local? in which case 
> you modified nutch-site.xml in runtime/local/conf, right?
>
> nutch readdb -stats will give you the total number of pages known etc....
>
> Julien
>
> On 20 July 2011 14:51, Chip Calhoun <[email protected]> wrote:
>
> > Hi,
> >
> > I'm using Nutch 1.3 to crawl a section of our website, and it 
> > doesn't seem to crawl the entire thing.  I'm probably missing 
> > something simple, so I hope somebody can help me.
> >
> > My urls/nutch file contains a single URL:
> > http://www.aip.org/history/ohilist/transcripts.html , which is an 
> > alphabetical listing of other pages.  It looks like the indexer 
> > stops partway down this page, meaning that entries later in the 
> > alphabet aren't indexed.
> >
> > My nutch-site.xml has the following content:
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > <!-- Put site-specific property overrides in this file. --> 
> > <configuration> <property>  <name>http.agent.name</name>  <value>OHI 
> > Spider</value> </property> <property> 
> > <name>db.max.outlinks.per.page</name>
> >  <value>-1</value>
> >  <description>The maximum number of outlinks that we'll process for 
> > a
> page.
> >  If this value is nonnegative (>=0), at most 
> > db.max.outlinks.per.page outlinks  will be processed for a page; 
> > otherwise, all outlinks will be processed.
> >  </description>
> > </property>
> > </configuration>
> >
> > My regex-urlfilter.txt and crawl-urlfilter.txt both include the 
> > following, which should allow access to everything I want:
> > # accept hosts in MY.DOMAIN.NAME
> > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/
> > # skip everything else
> > -.
> >
> > I've crawled with the following command:
> > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000
> >
> > Note that since we don't have NutchBean anymore, I can't tell 
> > whether this is actually a Nutch problem or whether something is 
> > failing when I port to Solr.  What am I missing?
> >
> > Thanks,
> > Chip
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



--
*Lewis*

RE: Nutch not indexing full collection

Reply via email to