Hi Tilman, Thank you so much for your reply.. I have seen Icecite, it was buggy it seems. Apache pdf box algorithm is far better. Can you know of something else ?
Thanks & regards, Aravind Swarna On Mon, Aug 10, 2020 at 11:59 PM Aravind Swarana <aravindswar...@gmail.com> wrote: > Missed Analysis Attachment > > > Thanks & regards, > Aravind Swarna > > > On Mon, Aug 10, 2020 at 11:49 PM Aravind Swarana <aravindswar...@gmail.com> > wrote: > >> Hi, >> >> I wanted to extract text as paragraphs using Apache PDFBox. I came to >> know from my reading that extracting text from PDF is not that simple. >> >> I have extracted Paragraphs from pdf using PDFBox API but they are not >> that great. >> >> Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose >> PDF which is extracting paragraphs with very minimal error. >> >> I'm trying to implement a similar algorithm for Apache PDFBox. Can you >> guys suggest any recent Research paper or open source library which has >> efficient paragraph Identification algorithms. I'll need to evaluate and >> implement them. >> >> So far I found : >> https://github.com/elacin/PDFExtract (There were some errors Observed >> while evaluating this and not as perfect as Aspose) >> >> https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on >> apache pdf box) >> >> I just need some suggestions whether there are any other algorithms I can >> look at and implement them ? >> >> >> >> >> Thanks & regards, >> Aravind Swarna >> >