I already answered... we need the PDF.
But... about the config:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<!-- Default Parser for most things, except for 2 mime types, and never
use the Executable Parser -->
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/pdf</mime-exclude>
<parser-exclude
class="org.apache.tika.parser.executable.ExecutableParser"/>
</parser>
<!-- Use a different parser for PDF -->
<parser class="org.apache.tika.parser.DefaultParser">
<property name="sortByPosition" value="true"/>
<mime>application/pdf</mime>
</parser>
</parsers>
</properties>
Is this a correct setting for PDFs in tika? I notice that the same
parser class is used twice.
And the file was named "tika.config", shouldn't it be named
"tika-config.xml"?
Tilman
Am 17.12.2019 um 13:33 schrieb Tim Allison:
PDFBox Colleagues,
Any recommendations?
On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com> wrote:
Dear Tika Dev Team,
Hope this email finds you well.
I have been actively using Tika for pdf file reading. One issue I found is
the parsing order. As shown in attached image, the parsing order of pdf
file is not based on position of texts.
As suggested in this github link
<https://github.com/chrismattmann/tika-python/issues/266>, I used a
customized config file (see attached), hoping to solve the issue. But this
has not worked out. If any chance, can you please review this issue, and
provide any insights or solutions?
Thanks so much in advance.
Regards,
Luke
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org