I would like to know how to add a field to an index using Nutch 1.6 and Solr
4.0.  I have tried using the index-static, index-extra and index-metadata
plugins, all to no avail. I have modified 

nutch-default.xml:

<property>
  <name>index.static</name>
  <value>display_type:page</value>
  <description>
  A simple plugin called at indexing that adds fields with static data. 
  You can specify a list of fieldname:fieldcontent per nutch job.
  It can be useful when collections can't be created by urlpatterns, 
  like in subcollection, but on a job-basis.
  </description>
</property>

nutch-site.xml:

<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika|metatags)|index-(anchor|basic|metadata|static)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-suffix</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with
the
  underlying commons-httpclient library.
  </description>
</property>

I expected the result in Solr to look similar to the following:

<doc>
  <arr name="content">
    <str>Untitled Document text goes here and more text and more</str>
  </arr>
  <str name="title">Untitled Document</str>
  <str name="segment">20130603095157</str>
  <float name="boost">0.65465367</float>
  <str name="digest">30fd854c798cf159085934c50561dccb</str>
  <date name="tstamp">2013-06-03T13:52:12.593Z</date>
  <str name="id">https://...</str>
  <str name="url">https://...</str>
  <long name="_version_">1436829905573642240</long>
  <str name="display_type">page</str>
</doc>

But I do not see my added field.

I believe index-extra is deprecated, but I thought index-static and
index-metadata should still work. 

Must I write a custom plugin? If so, I ultimately would like to change the
value of the added field dependent upon the mime type parsed (e.g. 
if (application/msword or application/pdf) {doc.add("display_type", "doc")} 
if (text/html) {doc.add("display_type", "pages")}
if (video/mpeg) {{doc.add("display_type", "video")}

Any assistance would be greatly appreciated.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-add-field-to-index-tp4067894.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to