Hi Tim, Hope you are all doing well.
Just want to raise this issue to your attention. The issue as shown in the images, is the "sortByPosition" cannot parse well when PDF files have multiple columns in a page. Could you please advise me any solutions? Btw, I didn't see my last email shown on the mailing list, is that because of not being subscribed? Thanks so much in advance. Luke [image: tesco_parsing_order.JPG][image: DeLaRue_parsing_order.JPG] On Fri, 10 Jan 2020 at 15:53, Lu Sun <vistax...@gmail.com> wrote: > Hi Tilman, > > Sorry for the late response. I implemented the config file, and it worked > well. Sadly, I noticed that this "sortByPosition" cannot parse well when > PDF files have multiple columns in a page, as it follows an order of > left->right, and top->down. > > Pls see attached images. Is it possible you can advise me on how to deal > with such case? > > Thanks so much. > Luke > > > On Tue, 7 Jan 2020 at 13:00, Tilman Hausherr <thaush...@t-online.de> > wrote: > >> hi, >> I answered that one in the mailing lists. You need to subscribe, or read >> the archives. I'll see if I can fwd it. >> Tilman >> >> >> ------------------------------ >> Gesendet mit der Telekom Mail App >> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer> >> >> >> >> --- Original-Nachricht --- >> *Von: *Lu Sun >> *Betreff: *Re: Parsing order issue >> *Datum: *07.01.2020, 0:04 Uhr >> *An: *users@pdfbox.apache.org >> *Cc: *talli...@apache.org, <d...@tika.apache.org> >> >> >> >> >> Dear PDFBox Dev Team, >> >> After searching through online >> < >> https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order>, >> I >> am certain that using setSortByPosition(true) would help. However, I am >> struggling to get the config file right. Can you please provide any advice >> on it? >> >> Thanks so much in advance. Regards, Luke >> >> On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistax...@gmail.com> wrote: >> >> > Dear PDFBox Dev Team, >> > >> > Hope this message finds you well. >> > >> > Just wanted to raise this for your attention. Please can you provide any >> > solutions on the parsing order issue? Attached is my config file, an >> > example of pdf file and my parsing results. >> > >> > Thanks so much in advance. Wish you and your team a Merry Christmas and >> > Happy New Year. >> > >> > Regards, >> > Luke >> > >> > On Tue, 17 Dec 2019 at 12:34, Tim Allison <talli...@apache.org> wrote: >> > >> >> PDFBox Colleagues, >> >> Any recommendations? >> >> >> >> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com> wrote: >> >> >> >>> Dear Tika Dev Team, >> >>> >> >>> >> >>> >> >>> Hope this email finds you well. >> >>> >> >>> >> >>> >> >>> I have been actively using Tika for pdf file reading. One issue I >> found >> >>> is the parsing order. As shown in attached image, the parsing order >> of pdf >> >>> file is not based on position of texts. >> >>> >> >>> >> >>> >> >>> As suggested in this github link >> >>> <https://github.com/chrismattmann/tika-python/issues/266>, I used >> a >> >>> customized config file (see attached), hoping to solve the issue. But >> this >> >>> has not worked out. If any chance, can you please review this issue, >> and >> >>> provide any insights or solutions? >> >>> >> >>> >> >>> >> >>> Thanks so much in advance. >> >>> >> >>> >> >>> >> >>> Regards, >> >>> >> >>> Luke >> >>> >> >> >> >