Re: Paragraph identification in apache pdf box

Tilman Hausherr Mon, 10 Aug 2020 11:33:22 -0700

Maybe icecite?

https://github.com/ckorzen/icecite


Tilman

Am 10.08.2020 um 20:19 schrieb Aravind Swarana:

Hi,

I wanted to extract text as paragraphs using Apache PDFBox. I came to know
from my reading that extracting text from PDF is not that simple.

I have extracted Paragraphs from pdf using PDFBox API but they are not that
great.

Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose PDF
which is extracting paragraphs with very minimal error.

I'm trying to implement a similar algorithm for Apache PDFBox. Can you guys
suggest any recent Research paper or open source library which has
efficient paragraph Identification algorithms. I'll need to evaluate and
implement them.

So far I found :
https://github.com/elacin/PDFExtract (There were some errors Observed while
evaluating this and not as perfect as Aspose)

https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on
apache pdf box)

I just need some suggestions whether there are any other algorithms I can
look at and implement them ?




Thanks & regards,
Aravind Swarna



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Paragraph identification in apache pdf box

Reply via email to