That did it! For the convenience of anyone who finds this in the list archives later on, I'll paste what it took:
$NUTCH_HOME/runtime/local/conf/nutch-site.xml (full contents): <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>OHI Spider</value> </property> <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property> </configuration> $NUTCH_HOME/runtime/local/conf/schema.xml & $SOLR_HOME/example/solr/conf/schema.xml: Replace this: <field name="content" type="text" stored="false" indexed="true"/> With this: <field name="content" type="text" stored="true" indexed="true"/> $SOLR_HOME/example/solr/conf/solrconfig.xml: Replace this: <maxFieldLength>10000</maxFieldLength> With this: <maxFieldLength>2147483647</maxFieldLength> -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: Monday, August 01, 2011 3:45 PM To: [email protected] Cc: Chip Calhoun Subject: Re: Nutch not indexing full collection Nutch truncates content longer than configured and Solr truncates content exceeding max field length. Maybe check your limits. > I'm still having trouble with this. In addition to the nutch-site-xml > posted below, I have now modified my schema.xml (in both nutch and > solr) to include the following important line: <field name="content" > type="text" > stored="true" indexed="true"/> > > Now, when I search, the full text of each document shows up under <str > name="content">. I'm clearly getting everything. And yet, when I > search for text toward the end of a long document, I still don't get > that document in my search results. > > It sounds like this might be an issue with my Solr setup. Can anyone > think of what I might be missing? > > Chip > > -----Original Message----- > From: Chip Calhoun [mailto:[email protected]] > Sent: Thursday, July 28, 2011 3:29 PM > To: [email protected] > Subject: RE: Nutch not indexing full collection > > Thanks! This has solved half of my problem. I am now indexing > material from every document I want. However, I'm still not indexing > words from toward the end of longer documents. I'm not sure what else > I could be missing. > > The current contents of my nutch-site.xml are: > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> <property> <name>http.agent.name</name> > <value>OHI Spider</value> > </property> > <property> > <name>db.max.outlinks.per.page</name> > <value>-1</value> > <description>The maximum number of outlinks that we'll process for a > page. If this value is nonnegative (>=0), at most > db.max.outlinks.per.page outlinks will be processed for a page; > otherwise, all outlinks will be processed. </description> </property> > <property> > <name>http.content.limit</name> > <value>-1</value> > </property> > </configuration> > > And I'm still indexing with this command: > bin/nutch crawl urls -dir crawl -depth 15 -topN 500000 > > > -----Original Message----- > From: lewis john mcgibbney [mailto:[email protected]] > Sent: Wednesday, July 27, 2011 12:18 PM > To: [email protected] > Subject: Re: Nutch not indexing full collection > > has this been solved? > > If your http.content.limit has not been increased in nutch-site.xml > then you will not be able to store this data and index with Solr. > > On Mon, Jul 25, 2011 at 6:18 PM, Chip Calhoun <[email protected]> wrote: > > I'm still having trouble. I've set a windows environment variable, > > NUTCH_HOME, which for me is C:\Apache\nutch-1.3\runtime\local . I > > now have my urls and crawl directories in that > > C:\Apache\nutch-1.3\runtime\local folder. But I'm still not > > crawling files later on my urls list, and apparently I can't search > > for words or phrases toward the end of any of my documents. Am I > > misremembering that there was a total file size value somewhere in > > Nutch or Solr that needs to be increased? > > > > -----Original Message----- > > From: lewis john mcgibbney [mailto:[email protected]] > > Sent: Wednesday, July 20, 2011 5:23 PM > > To: [email protected] > > Subject: Re: Nutch not indexing full collection > > > > Hi Chip, > > > > I would try running your scripts after setting the environment > > variable $NUTCH_HOME to nutch/runtime/local/NUTCH_HOME > > > > On Wed, Jul 20, 2011 at 4:01 PM, Chip Calhoun <[email protected]> wrote: > > > I've been working with > > > $NUTCH_HOME/runtime/local/conf/nutch-site.xml, > > > and I'm pretty sure that's the correct file. I run my commands > > > while in $NUTCH_HOME/ , which means all of my commands begin with > > > "runtime/local/bin/nutch..." . That means my urls directory is > > > $NUTCH_HOME/urls/ and my crawl directory ends up being > > > $NUTCH_HOME/crawl/ (as opposed to $NUTCH_HOME/runtime/local/urls/ > > > and so forth), but it does seem to at least be getting my > > > urlfilters from $NUTCH_HOME/runtime/local/conf/ . > > > > > > I get no output when I try runtime/local/bin/nutch readdb -stats , > > > so that's weird. > > > > > > I dimly recall there being a total index size value somewhere in > > > Nutch or Solr which has to be increased, but I can no longer find > > > any reference to it. > > > > > > Chip > > > > > > -----Original Message----- > > > From: Julien Nioche [mailto:[email protected]] > > > Sent: Wednesday, July 20, 2011 10:06 AM > > > To: [email protected] > > > Subject: Re: Nutch not indexing full collection > > > > > > I'd have suspected db.max.outlinks.per.page but you seem to have > > > set it up correctly. Are you running Nutch in runtime/local? in > > > which case you modified nutch-site.xml in runtime/local/conf, right? > > > > > > nutch readdb -stats will give you the total number of pages known > > > etc.... > > > > > > Julien > > > > > > On 20 July 2011 14:51, Chip Calhoun <[email protected]> wrote: > > > > Hi, > > > > > > > > I'm using Nutch 1.3 to crawl a section of our website, and it > > > > doesn't seem to crawl the entire thing. I'm probably missing > > > > something simple, so I hope somebody can help me. > > > > > > > > My urls/nutch file contains a single URL: > > > > http://www.aip.org/history/ohilist/transcripts.html , which is > > > > an alphabetical listing of other pages. It looks like the > > > > indexer stops partway down this page, meaning that entries later > > > > in the alphabet aren't indexed. > > > > > > > > My nutch-site.xml has the following content: > > > > <?xml version="1.0"?> > > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > > > <!-- Put site-specific property overrides in this file. --> > > > > <configuration> <property> <name>http.agent.name</name> > > > > <value>OHI Spider</value> </property> <property> > > > > <name>db.max.outlinks.per.page</name> > > > > > > > > <value>-1</value> > > > > <description>The maximum number of outlinks that we'll process > > > > > > > > for a > > > > > > page. > > > > > > > If this value is nonnegative (>=0), at most > > > > > > > > db.max.outlinks.per.page outlinks will be processed for a page; > > > > otherwise, all outlinks will be processed. > > > > > > > > </description> > > > > > > > > </property> > > > > </configuration> > > > > > > > > My regex-urlfilter.txt and crawl-urlfilter.txt both include the > > > > following, which should allow access to everything I want: > > > > # accept hosts in MY.DOMAIN.NAME > > > > +^http://([a-z0-9]*\.)*aip.org/history/ohilist/ > > > > # skip everything else > > > > -. > > > > > > > > I've crawled with the following command: > > > > runtime/local/bin/nutch crawl urls -dir crawl -depth 15 -topN > > > > 500000 > > > > > > > > Note that since we don't have NutchBean anymore, I can't tell > > > > whether this is actually a Nutch problem or whether something is > > > > failing when I port to Solr. What am I missing? > > > > > > > > Thanks, > > > > Chip > > > > > > -- > > > * > > > *Open Source Solutions for Text Engineering > > > > > > http://digitalpebble.blogspot.com/ > > > http://www.digitalpebble.com > > > > -- > > *Lewis* > > -- > *Lewis*

