Hi Tilman,

Sorry for the late response. I implemented the config file, and it worked
well. Sadly, I noticed that this "sortByPosition" cannot parse well when
PDF files have multiple columns in a page, as it follows an order of
left->right, and top->down.

 Pls see attached images. Is it possible you can advise me on how to deal
with such case?

Thanks so much.
Luke


On Tue, 7 Jan 2020 at 13:00, Tilman Hausherr <thaush...@t-online.de> wrote:

> hi,
> I answered that one in the mailing lists. You need to subscribe, or read
> the archives. I'll see if I can fwd it.
> Tilman
>
>
> ------------------------------
> Gesendet mit der Telekom Mail App
> <https://kommunikationsdienste.t-online.de/redirects/email_app_android_sendmail_footer>
>
>
>
> --- Original-Nachricht ---
> *Von: *Lu Sun
> *Betreff: *Re: Parsing order issue
> *Datum: *07.01.2020, 0:04 Uhr
> *An: *users@pdfbox.apache.org
> *Cc: *talli...@apache.org, <d...@tika.apache.org>
>
>
>
>
> Dear PDFBox Dev Team,
>
> After searching through online
> <https://stackoverflow.com/search?page=5&tab=Relevance&q=pdfbox%20order&gt;,
> I
> am certain that using setSortByPosition(true) would help. However, I am
> struggling to get the config file right. Can you please provide any advice
> on it?
>
> Thanks so much in advance. Regards, Luke
>
> On Fri, 20 Dec 2019 at 18:06, Lu Sun <vistax...@gmail.com> wrote:
>
> > Dear PDFBox Dev Team,
> >
> > Hope this message finds you well.
> >
> > Just wanted to raise this for your attention. Please can you provide any
> > solutions on the parsing order issue? Attached is my config file, an
> > example of pdf file and my parsing results.
> >
> > Thanks so much in advance. Wish you and your team a Merry Christmas and
> > Happy New Year.
> >
> > Regards,
> > Luke
> >
> > On Tue, 17 Dec 2019 at 12:34, Tim Allison <talli...@apache.org> wrote:
> >
> >> PDFBox Colleagues,
> >> Any recommendations?
> >>
> >> On Mon, Dec 16, 2019 at 7:05 AM Lu Sun <vistax...@gmail.com> wrote:
> >>
> >>> Dear Tika Dev Team,
> >>>
> >>>
> >>>
> >>> Hope this email finds you well.
> >>>
> >>>
> >>>
> >>> I have been actively using Tika for pdf file reading. One issue I found
> >>> is the parsing order. As shown in attached image, the parsing order of
> pdf
> >>> file is not based on position of texts.
> >>>
> >>>
> >>>
> >>> As suggested in this github link
> >>> <https://github.com/chrismattmann/tika-python/issues/266&gt;, I used a
> >>> customized config file (see attached), hoping to solve the issue. But
> this
> >>> has not worked out. If any chance, can you please review this issue,
> and
> >>> provide any insights or solutions?
> >>>
> >>>
> >>>
> >>> Thanks so much in advance.
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Luke
> >>>
> >>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to