Re: OCR Tika to read PDF, txt and doc docx

Karl Wright Fri, 05 Jan 2018 09:53:00 -0800

The Tika transformer replaces the binary stream of the document with the
extracted content of the document.  The mapping of the content stream of
the document and Solr fields is handled completely by Solr.
It sounds like you will need to read up on Solr, and review the
documentation on the solr output connector here:


https://manifoldcf.apache.org/release/release-2.9/en_US/end-user-documentation.html#solroutputconnector

Karl


On Fri, Jan 5, 2018 at 12:46 PM, msaunier <[email protected]> wrote:

> HI,
>
>
>
> I used Tika extractor today and it work but he don’t extract content text
> of they documents.
>
>
>
> What is the field name of the content_text Tika return ?
>
>
>
> "stream_name":"201801010200100000005782L.pdf",
>
>         "createdon":"Fri Dec 22 10:37:04 CET 2017",
>
>           
> "id":"file://///srvics01/ways_holding/gestion_ged/gerance/3004/3004100812019699/201801010200100000005782L.pdf",
>
>         "pdf_docinfo_created":"2017-12-22T09:37:03Z",
>
>         "pdf_docinfo_producer":"Apache FOP Version 1.1",
>
>         "xmp_creatortool":"Apache FOP Version 1.1",
>
>         "access_permission_fill_in_form":"true",
>
>         "meta_creation_date":"2017-12-22T09:37:03Z",
>
>         "content_type":["application/pdf",
>
>           "text/plain; charset=UTF-8"],
>
>         "stream_size":143674,
>
>         "dcterms_created":"2017-12-22T09:37:03Z",
>
>         "access_permission_can_print":"true",
>
>         "access_permission_modify_annotations":"true",
>
>         "pdf_pdfversion":"1.4",
>
>         "dc_format":"application/pdf; version=1.4",
>
>         "x_parsed_by":["org.apache.tika.parser.DefaultParser",
>
>           "org.apache.tika.parser.DefaultParser",
>
>           "org.apache.tika.parser.txt.TXTParser"],
>
>         "access_permission_extract_for_accessibility":"true",
>
>         "producer":"Apache FOP Version 1.1",
>
>         "lastmodified":"Fri Dec 22 10:37:04 CET 2017",
>
>         "pdf_docinfo_creator_tool":"Apache FOP Version 1.1",
>
>         "created":"Fri Dec 22 10:37:03 CET 2017",
>
>         "resourcename":["201801010200100000005782L.pdf",
>
>           "201801010200100000005782L.pdf"],
>
>         "filelastmodified":"2017-12-22T09:37:04.070Z",
>
>         "creation_date":"2017-12-22T09:37:03Z",
>
>         "xmptpg_npages":"1",
>
>         "access_permission_can_print_degraded":"true",
>
>         "filecreatedon":"2017-12-22T09:37:04.000Z",
>
>         "access_permission_can_modify":"true",
>
>         "access_permission_extract_content":"true",
>
>         "attributes":"32",
>
>         "access_permission_assemble_document":"true",
>
>         "sharename":"ways_holding",
>
>         "pdf_encrypted":"false",
>
>         "stream_content_type":"application/pdf",
>
>         "stream_source_info":"201801010200100000005782L.pdf",
>
>         "content_encoding":["UTF-8"],
>
>         "_version_":1588768212845068289}]
>
>   }}
>
>
>
>
>
> Cordialement,
>
>
>
> [image: msaunier]
>
>
>
>
>
>
>
> *De :* Karl Wright [mailto:[email protected]]
> *Envoyé :* vendredi 5 janvier 2018 18:40
> *À :* [email protected]
> *Objet :* Re: OCR Tika to read PDF, txt and doc docx
>
>
>
> Hi,
>
>
>
> It's pretty straightforward.  EITHER you configure your Solr output
> connection to use the extracting update handler and Solr Cell (the
> default), so that Tika is used on the Solr side, OR you configure to use
> the standard update handler and insert the Tika Extractor as a document
> transformer in your job's pipeline.
>
>
>
> Karl
>
>
>
> On Fri, Jan 5, 2018 at 12:19 PM, msaunier <[email protected]> wrote:
>
> Sorry, it’s an error. I need the text *content* of PDF, txt and doc docx
> to index in solr.
>
>
>
> Thanks for your help.
>
>
>
>
>
> *De :* msaunier [mailto:[email protected]]
> *Envoyé :* vendredi 5 janvier 2018 18:05
> *À :* [email protected]
> *Objet :* OCR Tika to read PDF, txt and doc docx
>
>
>
> Hello,
>
>
>
> How can I used/install an OCR to extract the content_html in files with
> ManifoldCF ?
>
> I need the HTML content.
>
>
>
> Thanks for your help,
>
>
>
>
>
>
>
>
>
>
>

Re: OCR Tika to read PDF, txt and doc docx

Reply via email to