I'm still having trouble. I've set a windows environment variable, NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local . I now have my urls and crawl directories in that C:\Apache\nutch-1.3\runtime\local folder. But I'm still not crawling files later on my urls list, and apparently I can't search for words or phrases toward the end of any of my documents. Am I misremembering that there was a total file size value somewhere in Nutch or Solr that needs to be increased?
-----Original Message----- From: lewis john mcgibbney [mailto:[email protected]] Sent: Wednesday, July 20, 2011 5:23 PM To: [email protected] Subject: Re: Nutch not indexing full collection Hi Chip, I would try running your scripts after setting the environment variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <[email protected]> wrote: > I've been working with $NUTCH_HOME/runtime/local/conf/nutch-site.xml, > and I'm pretty sure that's the correct file. I run my commands while > in $NUTCH_HOME/ , which means all of my commands begin with > "runtime/local/bin/nutch..." . That means my urls directory is > $NUTCH_HOME/urls/ and my crawl directory ends up being > $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ and > so forth), but it does seem to at least be getting my urlfilters from > $NUTCH_HOME/runtime/local/conf/ . > > I get no output when I try runtime/local/bin/nutch readdb -stats , so > that's weird. > > I dimly recall there being a total index size value somewhere in Nutch > or Solr which has to be increased, but I can no longer find any > reference to it. > > Chip > > -----Original Message----- > From: Julien Nioche [mailto:[email protected]] > Sent: Wednesday, July 20, 2011 10:06 AM > To: [email protected] > Subject: Re: Nutch not indexing full collection > > I'd have suspected db.max.outlinks.per.page but you seem to have set > it up correctly. Are you running Nutch in runtime/local? in which case > you modified nutch-site.xml in runtime/local/conf, right? > > nutch readdb -stats will give you the total number of pages known etc.... > > Julien > > On 20 July 2011 14:51, Chip Calhoun <[email protected]> wrote: > > > Hi, > > > > I'm using Nutch 1.3 to crawl a section of our website, and it > > doesn't seem to crawl the entire thing. I'm probably missing > > something simple, so I hope somebody can help me. > > > > My urls/nutch file contains a single URL: > > http://www.aip.org/history/ohilist/transcripts.html , which is an > > alphabetical listing of other pages. It looks like the indexer > > stops partway down this page, meaning that entries later in the > > alphabet aren't indexed. > > > > My nutch-site.xml has the following content: > > <?xml version="1.0"?> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> <property> <name>http.agent.name</name> <value>OHI > > Spider</value> </property> <property> > > <name>db.max.outlinks.per.page</name> > > <value>-1</value> > > <description>The maximum number of outlinks that we'll process for > > a > page. > > If this value is nonnegative (>=0), at most > > db.max.outlinks.per.page outlinks will be processed for a page; > > otherwise, all outlinks will be processed. > > </description> > > </property> > > </configuration> > > > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the > > following, which should allow access to everything I want: > > # accept hosts in MY.DOMAIN.NAME > > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ > > # skip everything else > > -. > > > > I've crawled with the following command: > > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN 500000 > > > > Note that since we don't have NutchBean anymore, I can't tell > > whether this is actually a Nutch problem or whether something is > > failing when I port to Solr. What am I missing? > > > > Thanks, > > Chip > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > -- *Lewis*

