HI,
I used Tika extractor today and it work but he don’t extract content text of
they documents.
What is the field name of the content_text Tika return ?
"stream_name":"201801010200100000005782L.pdf",
"createdon":"Fri Dec 22 10:37:04 CET 2017",
"id":"file://///srvics01/ways_holding/gestion_ged/gerance/3004/3004100812019699/201801010200100000005782L.pdf",
"pdf_docinfo_created":"2017-12-22T09:37:03Z",
"pdf_docinfo_producer":"Apache FOP Version 1.1",
"xmp_creatortool":"Apache FOP Version 1.1",
"access_permission_fill_in_form":"true",
"meta_creation_date":"2017-12-22T09:37:03Z",
"content_type":["application/pdf",
"text/plain; charset=UTF-8"],
"stream_size":143674,
"dcterms_created":"2017-12-22T09:37:03Z",
"access_permission_can_print":"true",
"access_permission_modify_annotations":"true",
"pdf_pdfversion":"1.4",
"dc_format":"application/pdf; version=1.4",
"x_parsed_by":["org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.txt.TXTParser"],
"access_permission_extract_for_accessibility":"true",
"producer":"Apache FOP Version 1.1",
"lastmodified":"Fri Dec 22 10:37:04 CET 2017",
"pdf_docinfo_creator_tool":"Apache FOP Version 1.1",
"created":"Fri Dec 22 10:37:03 CET 2017",
"resourcename":["201801010200100000005782L.pdf",
"201801010200100000005782L.pdf"],
"filelastmodified":"2017-12-22T09:37:04.070Z",
"creation_date":"2017-12-22T09:37:03Z",
"xmptpg_npages":"1",
"access_permission_can_print_degraded":"true",
"filecreatedon":"2017-12-22T09:37:04.000Z",
"access_permission_can_modify":"true",
"access_permission_extract_content":"true",
"attributes":"32",
"access_permission_assemble_document":"true",
"sharename":"ways_holding",
"pdf_encrypted":"false",
"stream_content_type":"application/pdf",
"stream_source_info":"201801010200100000005782L.pdf",
"content_encoding":["UTF-8"],
"_version_":1588768212845068289}]
}}
Cordialement,
De : Karl Wright [mailto:[email protected]]
Envoyé : vendredi 5 janvier 2018 18:40
À : [email protected]
Objet : Re: OCR Tika to read PDF, txt and doc docx
Hi,
It's pretty straightforward. EITHER you configure your Solr output connection
to use the extracting update handler and Solr Cell (the default), so that Tika
is used on the Solr side, OR you configure to use the standard update handler
and insert the Tika Extractor as a document transformer in your job's pipeline.
Karl
On Fri, Jan 5, 2018 at 12:19 PM, msaunier <[email protected]
<mailto:[email protected]> > wrote:
Sorry, it’s an error. I need the text content of PDF, txt and doc docx to index
in solr.
Thanks for your help.
De : msaunier [mailto:[email protected] <mailto:[email protected]> ]
Envoyé : vendredi 5 janvier 2018 18:05
À : [email protected] <mailto:[email protected]>
Objet : OCR Tika to read PDF, txt and doc docx
Hello,
How can I used/install an OCR to extract the content_html in files with
ManifoldCF ?
I need the HTML content.
Thanks for your help,