I would like to know how to add a field to an index using Nutch 1.6 and Solr
4.0. I have tried using the index-static, index-extra and index-metadata
plugins, all to no avail. I have modified
nutch-default.xml:
<property>
<name>index.static</name>
<value>display_type:page</value>
<description>
A simple plugin called at indexing that adds fields with static data.
You can specify a list of fieldname:fieldcontent per nutch job.
It can be useful when collections can't be created by urlpatterns,
like in subcollection, but on a job-basis.
</description>
</property>
nutch-site.xml:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(anchor|basic|metadata|static)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-suffix</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library.
</description>
</property>
I expected the result in Solr to look similar to the following:
<doc>
<arr name="content">
<str>Untitled Document text goes here and more text and more</str>
</arr>
<str name="title">Untitled Document</str>
<str name="segment">20130603095157</str>
<float name="boost">0.65465367</float>
<str name="digest">30fd854c798cf159085934c50561dccb</str>
<date name="tstamp">2013-06-03T13:52:12.593Z</date>
<str name="id">https://...</str>
<str name="url">https://...</str>
<long name="_version_">1436829905573642240</long>
<str name="display_type">page</str>
</doc>
But I do not see my added field.
I believe index-extra is deprecated, but I thought index-static and
index-metadata should still work.
Must I write a custom plugin? If so, I ultimately would like to change the
value of the added field dependent upon the mime type parsed (e.g.
if (application/msword or application/pdf) {doc.add("display_type", "doc")}
if (text/html) {doc.add("display_type", "pages")}
if (video/mpeg) {{doc.add("display_type", "video")}
Any assistance would be greatly appreciated.
--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-add-field-to-index-tp4067894.html
Sent from the Nutch - User mailing list archive at Nabble.com.