Hello Tom,

To get parse metadata field indexed, you need the indexer-metadata plugin. Use 
the index.parse.md parameter to define the fields you want to have indexed. Use 
indexchecker to test.

Regards,
Markus

 
 
-----Original message-----
> From:Tom Potter <tom.pot...@orangebus.co.uk>
> Sent: Wednesday 13th February 2019 11:51
> To: user@nutch.apache.org
> Subject: Difficulty getting data from Nutch parse data into Solr document
> 
> I'm not sure how to get some of the data from a crawled PDF document into
> my Solr index. When I run the parsechecker tool I can see the date I need
> as an attribute of the Content Metadata (date=2018-08-06T14:14:00Z), but
> I'm not sure how I configure the solrindex-mapping.xml to successfully map
> this to a Solr field.
> 
> I tried adding the below mapping, but it didn't work:
> 
> <field dest="date" source="date"/>
> 
> Below is an example of the result of the parsechecker data showing the date
> attribute in the Content Metadata:
> ---------
> ParseData
> ---------
> 
> Version: 5
> Status: success(1,0)
> Title: XXXXXXX
> Outlinks: 1
>   outlink: toUrl: https://xxx.zzz anchor:
> Content Metadata: Server=Microsoft-IIS/7.5 Connection=close
> Last-Modified=Mon, 06 Aug 2018 15:16:28 GMT Date=Wed, 13 Feb 2019 10:36:52
> GMT nutch.crawl.score=0.0 nutch.fetch.time=1550054216537
> Cache-Control=no-cache, no-store ETag="8727b79f5faf0086a80c86df4cbbac12"
> Content-Disposition=inline; filename=xxxxx.pdf" X-AspNet-Version=4.0.30319
> Content-Length=81903 Content-Type=application/pdf X-Powered-By=ASP.NET
> Parse Metadata: date=2018-08-06T14:14:00Z pdf:PDFVersion=1.5
> xmp:CreatorTool=Microsoft Office Word
> access_permission:modify_annotations=true
> access_permission:can_print_degraded=true dc:creator=XXXXX
> dcterms:created=2018-08-06T14:14:00Z Last-Modified=2018-08-06T14:14:00Z
> dcterms:modified=2018-08-06T14:14:00Z dc:format=application/pdf;
> version=1.5 Last-Save-Date=2018-08-06T14:14:00Z
> access_permission:fill_in_form=true meta:save-date=2018-08-06T14:14:00Z
> pdf:encrypted=false dc:title=xxxxxxxx modified=2018-08-06T14:14:00Z
> Content-Type=application/pdf creator=XXXXXX meta:author=XXXXX
> meta:creation-date=2018-08-06T14:14:00Z created=Mon Aug 06 15:14:00 BST
> 2018 access_permission:extract_for_accessibility=true
> access_permission:assemble_document=true xmpTPg:NPages=7
> Creation-Date=2018-08-06T14:14:00Z access_permission:extract_content=true
> access_permission:can_print=true Author=XXXXXX producer=Aspose.Words for
> .NET 16.2.0.0 access_permission:can_modify=true
> 
> 
> -- 
> 
> 
> *Tom Potter*
> Software Developer  T: 0191 241 3703
> E: tom.pot...@orangebus.co.uk <lou...@orangebus.co.uk> • W:
> www.orangebus.co.uk •
> [image: Orange Bus] <http://www.orangebus.co.uk/> Orange Bus, Milburn
> House, Dean Street, Newcastle Upon Tyne, NE1 1LE
> 
> -- 
> 
> 
> This email and any attachment to it are confidential. Unless you are the 
> intended recipient, you may not use, copy or disclose either the message or 
> any information contained in the message. If you are not the intended 
> recipient, you should delete this email and notify the sender immediately. 
> Any views or opinions expressed in this email are those of the sender 
> unless otherwise stated. All copyright in any Orange Bus and/or Capita 
> material in this email is reserved. All emails may be recorded by Orange 
> Bus  and monitored for legitimate business purposes. Orange Bus and Capita 
> exclude all liability for any loss or damage arising or resulting from the 
> receipt, use or transmission of this email to the fullest extent permitted 
> by law.
> 
> 
> 
> 
> Orange Bus Limited is a company registered in England & Wales 
> under company registration number 4444974. Our registered company address 
> is 30 Berners Street, London, W1T 3LR, United Kingdom. Orange Bus Limited, 
> part of Capita Software, is a subsidiary of Capita Business Services Ltd 
> registered in England & Wales under company number 2299747. 
> 
> 
> 
> 
> *You are 
> receiving this message from Capita Software. Should you wish to see how we 
> may have collected or may use your information, or view ways to exercise 
> your individual rights, see our Privacy Notice 
> <https://www.capitasoftware.com/PrivacyNotice>*
> 

Reply via email to