[2nd attempt]
From my understanding, when you want to use sortbyposition in tika, you
need to have a segment like this:
...
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="sortByPosition" type="bool">true</param>
</params>
</parser>
...
so your whole file would be like:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<!-- Default Parser for most things, except for 2 mime types, and
never
use the Executable Parser -->
<parser class="org.apache.tika.parser.DefaultParser">
<mime-exclude>application/pdf</mime-exclude>
</parser>
<!-- Use a different parser for PDF -->
<parser class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
<params>
<param name="sortByPosition" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
I just tried this file with tika-app. The default didn't sort, using
this did sort. I added " --config=config.xml" at the command line.
Tilman
Am 07.01.2020 um 00:04 schrieb Lu Sun:
Dear PDFBox Dev Team,
After searching through online
<https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, I
am certain that using setSortByPosition(true) would help. However, I am
struggling to get the config file right. Can you please provide any advice
on it?
Thanks so much in advance. Regards, Luke
On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistax...@gmail.com> wrote:
Dear PDFBox Dev Team,
Hope this message finds you well.
Just wanted to raise this for your attention. Please can you provide any
solutions on the parsing order issue? Attached is my config file, an
example of pdf file and my parsing results.
Thanks so much in advance. Wish you and your team a Merry Christmas and
Happy New Year.
Regards,
Luke
On Tue, 17 Dec 2019 at 12:34, Tim Allison <talli...@apache.org> wrote:
PDFBox Colleagues,
Any recommendations?
On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com> wrote:
Dear Tika Dev Team,
Hope this email finds you well.
I have been actively using Tika for pdf file reading. One issue I found
is the parsing order. As shown in attached image, the parsing order of pdf
file is not based on position of texts.
As suggested in this github link
<https://github.com/chrismattmann/tika-python/issues/266>, I used a
customized config file (see attached), hoping to solve the issue. But this
has not worked out. If any chance, can you please review this issue, and
provide any insights or solutions?
Thanks so much in advance.
Regards,
Luke
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org