I have "index-(basic|anchor|more|metadata)" and "parse-(html|tika|metatags)" included in plugin.includes, but despite:

# bin/nutch parsechecker https:/..... |grep -i date
Date :  Tue, 18 Oct 2016 14:37:40 GMT

The 'date' field in Solr for the document is wrong :

|"date": "1970-01-01T00:00:00Z",|

Why is this ? Also, as I think 'date' is being inferred from the 'last-modified' header, I'd like it to go in 'lastModified' too...

I saw some reference to setting solrindex-mapping.xml
    <field dest="lastModified" source="date"/>
but this dies during IndexingJob with
Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=com.abloz:http/hbase/book.html] multiple values encountered for non multiValued field lastModified: [Tue Jun 16 10:55:02 UTC 2015, Tue Jun 16 10:55:02 UTC 2015]

which makes no sense. There aren't two last-modified HTTP headers ? It does at least confirm the value is going in...

The Solr schema is correct, I think (there's no real world reason for lastModified to be multi valued!) :
     <field name="lastModified" type="date" stored="true" indexed="false"/>

*Tom Chiverton*
Lead Developer
e:      t...@extravision.com <mailto:t...@extravision.com>
p:      0161 817 2922
t:      @extravision <http://www.twitter.com/extravision>
w:      www.extravision.com <http://www.extravision.com/>

Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 4LD.
Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed and may contain confidential or privileged information. Any views or opinions presented in this e-mail are solely of the author and do not necessarily represent those of Extravision Ltd.

Reply via email to