Am 11.08.2020 um 10:15 schrieb Aravind Swarana:
Hi ,
I tried icecite, it is very buggy and Apache pdf box paragraph Identification 
works even better. Any other solutions.. or any one know how Aspose PDF does it 
internally ?


If Aspose works for you, then you should buy / license it. It's probably cheaper than to work out your own algorithm.

No, I don't know how Aspose works.

Tilman




On 2020/08/10 18:32:58, Tilman Hausherr <thaush...@t-online.de> wrote:
Maybe icecite?

https://github.com/ckorzen/icecite

Tilman

Am 10.08.2020 um 20:19 schrieb Aravind Swarana:
Hi,

I wanted to extract text as paragraphs using Apache PDFBox. I came to know
from my reading that extracting text from PDF is not that simple.

I have extracted Paragraphs from pdf using PDFBox API but they are not that
great.

Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose PDF
which is extracting paragraphs with very minimal error.

I'm trying to implement a similar algorithm for Apache PDFBox. Can you guys
suggest any recent Research paper or open source library which has
efficient paragraph Identification algorithms. I'll need to evaluate and
implement them.

So far I found :
https://github.com/elacin/PDFExtract (There were some errors Observed while
evaluating this and not as perfect as Aspose)

https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on
apache pdf box)

I just need some suggestions whether there are any other algorithms I can
look at and implement them ?




Thanks & regards,
Aravind Swarna


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to