Re: Unable to use notch 2.3 crawl script for MySQL, Mongo, or Cassandra

Lewis John Mcgibbney Tue, 29 Sep 2015 21:38:40 -0700

Hi Sherban,

On Mon, Sep 28, 2015 at 10:54 PM, <[email protected]> wrote:


>
> I made progress. I downloaded and installed the release candidate from
> https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1
>

OK great.


>
>
>         <property>
>                 <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basi
> c|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|url
> filter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basi
> c|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-rege
> x|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
>         </property>
>
>
The above property is hellishly out of date. Many of these plugins do not
exist anymore. You can see the available plugins within the plugins
directory for a list.
https://github.com/apache/nutch/tree/2.x/src/plugin


>
> I verified my SOLR is up and running. The SOLR web gui says solr-spec
> 5.1.0. Do I have to configure SOLR for nutch indexing? If so, are there
> instructions to configure SOLR for nutch?
>

You need to copy over the schema.xml from Nutch [0] to each Solr core you
intend on using then restart your Solr server.

[0] https://github.com/apache/nutch/blob/2.x/conf/schema.xml


>
>
> Unrelated question…
> How does nutch crawl every link in pages in the seeds.txt file?


This is an extremely vague question sorry. Can you be more specific?


> Is there a
> difference between a URL directory entry vs specific page URL?
>

No. Well each is treated as an individual WebPage. If we successfully fetch
a page from the URL then outlinks are parsed out (along with a bunch of
other data) and we then attempt to fetch them. This process runs in cycles.


> For example, let’s say http://foo.com/index.html contains 100 links. Will
> nutch crawl these 2 seed.txt entries the same way(i.e. crawl each 100
> links)?
> http://foo.com/index.html
> http://foo.com
>

Yes. If http://foo.com resolves to http://foo.com/index.html then yes.


>
>
> Thanks again for your help. I’ll give +1 vote for 2.3.1 candidate once
> SOLR indexing works ;).
>

OK grand. It should be noted that the supported Solr version is 4.6.0
 Thanks
Lewis

Re: Unable to use notch 2.3 crawl script for MySQL, Mongo, or Cassandra

Reply via email to