Hi Tilman, Sorry for the late response. I implemented the config file, and it worked well. Sadly, I noticed that this "sortByPosition" cannot parse well when PDF files have multiple columns in a page, as it follows an order of left->right, and top->down.
Pls see attached images. Is it possible you can advise me on how to deal with such case? Thanks so much. Luke On Tue, 7 Jan 2020 at 13:00, Tilman Hausherr <thaush...@t-online.de> wrote: > hi, > I answered that one in the mailing lists. You need to subscribe, or read > the archives. I'll see if I can fwd it. > Tilman > > > ------------------------------ > Gesendet mit der Telekom Mail App > <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer> > > > > --- Original-Nachricht --- > *Von: *Lu Sun > *Betreff: *Re: Parsing order issue > *Datum: *07.01.2020, 0:04 Uhr > *An: *users@pdfbox.apache.org > *Cc: *talli...@apache.org, <d...@tika.apache.org> > > > > > Dear PDFBox Dev Team, > > After searching through online > <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, > I > am certain that using setSortByPosition(true) would help. However, I am > struggling to get the config file right. Can you please provide any advice > on it? > > Thanks so much in advance. Regards, Luke > > On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistax...@gmail.com> wrote: > > > Dear PDFBox Dev Team, > > > > Hope this message finds you well. > > > > Just wanted to raise this for your attention. Please can you provide any > > solutions on the parsing order issue? Attached is my config file, an > > example of pdf file and my parsing results. > > > > Thanks so much in advance. Wish you and your team a Merry Christmas and > > Happy New Year. > > > > Regards, > > Luke > > > > On Tue, 17 Dec 2019 at 12:34, Tim Allison <talli...@apache.org> wrote: > > > >> PDFBox Colleagues, > >> Any recommendations? > >> > >> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com> wrote: > >> > >>> Dear Tika Dev Team, > >>> > >>> > >>> > >>> Hope this email finds you well. > >>> > >>> > >>> > >>> I have been actively using Tika for pdf file reading. One issue I found > >>> is the parsing order. As shown in attached image, the parsing order of > pdf > >>> file is not based on position of texts. > >>> > >>> > >>> > >>> As suggested in this github link > >>> <https://github.com/chrismattmann/tika-python/issues/266>, I used a > >>> customized config file (see attached), hoping to solve the issue. But > this > >>> has not worked out. If any chance, can you please review this issue, > and > >>> provide any insights or solutions? > >>> > >>> > >>> > >>> Thanks so much in advance. > >>> > >>> > >>> > >>> Regards, > >>> > >>> Luke > >>> > >> >
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org