Paragraph identification in apache pdf box

Aravind Swarana Mon, 10 Aug 2020 11:31:10 -0700

Hi,

I wanted to extract text as paragraphs using Apache PDFBox. I came to know
from my reading that extracting text from PDF is not that simple.


I have extracted Paragraphs from pdf using PDFBox API but they are not that
great.

Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose PDF
which is extracting paragraphs with very minimal error.

I'm trying to implement a similar algorithm for Apache PDFBox. Can you guys
suggest any recent Research paper or open source library which has
efficient paragraph Identification algorithms. I'll need to evaluate and
implement them.

So far I found :
https://github.com/elacin/PDFExtract (There were some errors Observed while
evaluating this and not as perfect as Aspose)

https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on
apache pdf box)

I just need some suggestions whether there are any other algorithms I can
look at and implement them ?




Thanks & regards,
Aravind Swarna

Paragraph identification in apache pdf box

Reply via email to