Maybe icecite? https://github.com/ckorzen/icecite
Tilman Am 10.08.2020 um 20:19 schrieb Aravind Swarana:
Hi, I wanted to extract text as paragraphs using Apache PDFBox. I came to know from my reading that extracting text from PDF is not that simple. I have extracted Paragraphs from pdf using PDFBox API but they are not that great. Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose PDF which is extracting paragraphs with very minimal error. I'm trying to implement a similar algorithm for Apache PDFBox. Can you guys suggest any recent Research paper or open source library which has efficient paragraph Identification algorithms. I'll need to evaluate and implement them. So far I found : https://github.com/elacin/PDFExtract (There were some errors Observed while evaluating this and not as perfect as Aspose) https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on apache pdf box) I just need some suggestions whether there are any other algorithms I can look at and implement them ? Thanks & regards, Aravind Swarna
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org