Hi Sherban, On Mon, Sep 28, 2015 at 10:54 PM, <[email protected]> wrote:
> > I made progress. I downloaded and installed the release candidate from > https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1 > OK great. > > > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basi > c|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|url > filter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basi > c|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-rege > x|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value> > </property> > > The above property is hellishly out of date. Many of these plugins do not exist anymore. You can see the available plugins within the plugins directory for a list. https://github.com/apache/nutch/tree/2.x/src/plugin > > I verified my SOLR is up and running. The SOLR web gui says solr-spec > 5.1.0. Do I have to configure SOLR for nutch indexing? If so, are there > instructions to configure SOLR for nutch? > You need to copy over the schema.xml from Nutch [0] to each Solr core you intend on using then restart your Solr server. [0] https://github.com/apache/nutch/blob/2.x/conf/schema.xml > > > Unrelated question… > How does nutch crawl every link in pages in the seeds.txt file? This is an extremely vague question sorry. Can you be more specific? > Is there a > difference between a URL directory entry vs specific page URL? > No. Well each is treated as an individual WebPage. If we successfully fetch a page from the URL then outlinks are parsed out (along with a bunch of other data) and we then attempt to fetch them. This process runs in cycles. > For example, let’s say http://foo.com/index.html contains 100 links. Will > nutch crawl these 2 seed.txt entries the same way(i.e. crawl each 100 > links)? > http://foo.com/index.html > http://foo.com > Yes. If http://foo.com resolves to http://foo.com/index.html then yes. > > > Thanks again for your help. I’ll give +1 vote for 2.3.1 candidate once > SOLR indexing works ;). > OK grand. It should be noted that the supported Solr version is 4.6.0 Thanks Lewis

