Hi,
This is a known problem. When you have columns it is usually better to
use the unsorted order. And even that may not work properly if the
sequence in the PDF doesn't make sense.
Somebody with a lot of time should create a software to identify
"blocks" (these are often rectangular, but not always) and then extract
them in sequence.
Tilman
Am 10.01.2020 um 16:53 schrieb Lu Sun:
Hi Tilman,
Sorry for the late response. I implemented the config file, and it
worked well. Sadly, I noticed that this "sortByPosition" cannot parse
well when PDF files have multiple columns in a page, as it follows an
order of left->right, and top->down.
Pls see attached images. Is it possible you can advise me on how to
deal with such case?
Thanks so much.
Luke
On Tue, 7 Jan 2020 at 13:00, Tilman Hausherr <thaush...@t-online.de
<mailto:thaush...@t-online.de>> wrote:
hi,
I answered that one in the mailing lists. You need to subscribe,
or read the archives. I'll see if I can fwd it.
Tilman
------------------------------------------------------------------------
Gesendet mit der Telekom Mail App
<https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
--- Original-Nachricht ---
*Von: *Lu Sun
*Betreff: *Re: Parsing order issue
*Datum: *07.01.2020, 0:04 Uhr
*An: *users@pdfbox.apache.org <mailto:users@pdfbox.apache.org>
*Cc: *talli...@apache.org <mailto:talli...@apache.org>,
<d...@tika.apache.org <mailto:d...@tika.apache.org>>
Dear PDFBox Dev Team,
After searching through online
<https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>,
I
am certain that using setSortByPosition(true) would help. However,
I am
struggling to get the config file right. Can you please provide
any advice
on it?
Thanks so much in advance. Regards, Luke
On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistax...@gmail.com
<mailto:vistax...@gmail.com>> wrote:
> Dear PDFBox Dev Team,
>
> Hope this message finds you well.
>
> Just wanted to raise this for your attention. Please can you
provide any
> solutions on the parsing order issue? Attached is my config file, an
> example of pdf file and my parsing results.
>
> Thanks so much in advance. Wish you and your team a Merry
Christmas and
> Happy New Year.
>
> Regards,
> Luke
>
> On Tue, 17 Dec 2019 at 12:34, Tim Allison <talli...@apache.org
<mailto:talli...@apache.org>> wrote:
>
>> PDFBox Colleagues,
>> Any recommendations?
>>
>> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com
<mailto:vistax...@gmail.com>> wrote:
>>
>>> Dear Tika Dev Team,
>>>
>>>
>>>
>>> Hope this email finds you well.
>>>
>>>
>>>
>>> I have been actively using Tika for pdf file reading. One
issue I found
>>> is the parsing order. As shown in attached image, the parsing
order of pdf
>>> file is not based on position of texts.
>>>
>>>
>>>
>>> As suggested in this github link
>>> <https://github.com/chrismattmann/tika-python/issues/266>,
I used a
>>> customized config file (see attached), hoping to solve the
issue. But this
>>> has not worked out. If any chance, can you please review this
issue, and
>>> provide any insights or solutions?
>>>
>>>
>>>
>>> Thanks so much in advance.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Luke
>>>
>>