I have "index-(basic|anchor|more|metadata)" and
"parse-(html|tika|metatags)" included in plugin.includes, but despite:
# bin/nutch parsechecker https:/..... |grep -i date
Date : Tue, 18 Oct 2016 14:37:40 GMT
The 'date' field in Solr for the document is wrong :
Why is this ? Also, as I think 'date' is being inferred from the
'last-modified' header, I'd like it to go in 'lastModified' too...
I saw some reference to setting solrindex-mapping.xml
<field dest="lastModified" source="date"/>
but this dies during IndexingJob with
ERROR: [doc=com.abloz:http/hbase/book.html] multiple values encountered
for non multiValued field lastModified: [Tue Jun 16 10:55:02 UTC 2015,
Tue Jun 16 10:55:02 UTC 2015]
which makes no sense. There aren't two last-modified HTTP headers ? It
does at least confirm the value is going in...
The Solr schema is correct, I think (there's no real world reason for
lastModified to be multi valued!) :
<field name="lastModified" type="date" stored="true" indexed="false"/>
e: t...@extravision.com <mailto:t...@extravision.com>
p: 0161 817 2922
t: @extravision <http://www.twitter.com/extravision>
w: www.extravision.com <http://www.extravision.com/>
Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street,
Manchester, M15 4LD.
Company Reg No: 05017214 VAT: GB 824 5386 19
This e-mail is intended solely for the person to whom it is addressed
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author
and do not necessarily represent those of Extravision Ltd.