Re: Parsing order issue

Tilman Hausherr Tue, 17 Dec 2019 10:02:31 -0800

I already answered... we need the PDF.

But... about the config:


<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <!-- Default Parser for most things, except for 2 mime types, and never
         use the Executable Parser -->
    <parser class="org.apache.tika.parser.DefaultParser">
      <mime-exclude>image/jpeg</mime-exclude>
      <mime-exclude>application/pdf</mime-exclude>

<parser-excludeclass="org.apache.tika.parser.executable.ExecutableParser"/>

    </parser>

    <!-- Use a different parser for PDF -->
    <parser class="org.apache.tika.parser.DefaultParser">
    <property name="sortByPosition" value="true"/>
      <mime>application/pdf</mime>
    </parser>
  </parsers>
</properties>

Is this a correct setting for PDFs in tika? I notice that the sameparser class is used twice.

And the file was named "tika.config", shouldn't it be named"tika-config.xml"?


Tilman

Am 17.12.2019 um 13:33 schrieb Tim Allison:

PDFBox Colleagues,
   Any recommendations?

On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com> wrote:

Dear Tika Dev Team,



Hope this email finds you well.



I have been actively using Tika for pdf file reading. One issue I found is
the parsing order. As shown in attached image, the parsing order of pdf
file is not  based on position of texts.



As suggested in this github link
<https://github.com/chrismattmann/tika-python/issues/266>, I used a
customized config file (see attached), hoping to solve the issue. But this
has not worked out. If any chance, can you please review this issue, and
provide any insights or solutions?



Thanks so much in advance.



Regards,

Luke



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Parsing order issue

Reply via email to