Hello ,

I have configure Nutch 2.2.1 following Nutch2Tutorial
<https://wiki.apache.org/nutch/Nutch2Tutorial>and integrated with Solr 4.7
and  it's working fine. Then I wanted to parse HTML and index meta tags in
solr.
Since Parse-metatags is not supported by default I follow "Parse-metatags
and index-metadata plugin for Nutch 2.x
series<https://issues.apache.org/jira/browse/NUTCH-1478>" and
installed 
patchNUTCH-1478v5.patc.<https://issues.apache.org/jira/secure/attachment/12631702/NUTCH-1478v5.patch>

I think I have install it correctly because i get following out put when I
try to parch a URL

$ ./bin/nutch parsechecker http://nutch.apache.org/
fetching: http://nutch.apache.org/
parsing: http://nutch.apache.org/
contentType: text/html
signature: 030a8fe7684b5357663e041327e3d96b
---------
Url
---------------
http://nutch.apache.org/
---------
Metadata
---------
metatag.forrest-skin-name :     nutch
metatag.forrest-version :     0.10-dev
metatag.generator :     Apache Forrest
metatag.content-type :     text/html; charset=UTF-8

Now I am Trying to index meta data along with other content to Solr, I have
update solr schema.xml with <field name="meta_*" type="string"
stored="true" indexed="true"/> to accept every generated fields.

My questing is how to
1. Index meta data in Solr ? When I execute ./bin/nutch parsechecker
http://nutch.apache.org/ it will extract and give the meta tags on standard
output, how to ask solr to index these metatags.
2. Is it possible to integrate with bit/crawl default script with
modifications
    bin/crawl urls/seed.txt TestCrawl1.3 http://localhost:8983/solr/ 1
    This will index sites content on solr but not the meta data

Can any one please help me , Thanks in Advance.

Reply via email to