Re: Paragraph identification in apache pdf box

Tilman Hausherr Tue, 11 Aug 2020 12:22:30 -0700

Am 11.08.2020 um 10:22 schrieb Aravind Swarana:

Hi Tilman,


Thank you so much for your reply..
I have seen Icecite, it was buggy it seems. Apache pdf box algorithm is far
better. Can you know of something else ?

No.

I doubt icecite is buggy because it was part of a master thesis and alsoa conference paper, but I didn't test it myself.

The PDFTextStripper has some tweaks (indentThreshold, dropThreshold,spacingTolerance, averageCharTolerance) which you could try.


Tilman


Thanks & regards,
Aravind Swarna


On Mon, Aug 10, 2020 at 11:59 PM Aravind Swarana <aravindswar...@gmail.com>
wrote:

Missed Analysis Attachment


Thanks & regards,
Aravind Swarna


On Mon, Aug 10, 2020 at 11:49 PM Aravind Swarana <aravindswar...@gmail.com>
wrote:

Hi,

I wanted to extract text as paragraphs using Apache PDFBox. I came to
know from my reading that extracting text from PDF is not that simple.

I have extracted Paragraphs from pdf using PDFBox API but they are not
that great.

Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose
PDF which is extracting paragraphs with very minimal error.

I'm trying to implement a similar algorithm for Apache PDFBox. Can you
guys suggest any recent Research paper or open source library which has
efficient paragraph Identification algorithms. I'll need to evaluate and
implement them.

So far I found :
https://github.com/elacin/PDFExtract (There were some errors Observed
while evaluating this and not as perfect as Aspose)

https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on
apache pdf box)

I just need some suggestions whether there are any other algorithms I can
look at and implement them ?




Thanks & regards,
Aravind Swarna



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Paragraph identification in apache pdf box

Reply via email to