Hi , I tried icecite, it is very buggy and Apache pdf box paragraph Identification works even better. Any other solutions.. or any one know how Aspose PDF does it internally ?
On 2020/08/10 18:32:58, Tilman Hausherr <thaush...@t-online.de> wrote: > Maybe icecite? > > https://github.com/ckorzen/icecite > > Tilman > > Am 10.08.2020 um 20:19 schrieb Aravind Swarana: > > Hi, > > > > I wanted to extract text as paragraphs using Apache PDFBox. I came to know > > from my reading that extracting text from PDF is not that simple. > > > > I have extracted Paragraphs from pdf using PDFBox API but they are not that > > great. > > > > Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose PDF > > which is extracting paragraphs with very minimal error. > > > > I'm trying to implement a similar algorithm for Apache PDFBox. Can you guys > > suggest any recent Research paper or open source library which has > > efficient paragraph Identification algorithms. I'll need to evaluate and > > implement them. > > > > So far I found : > > https://github.com/elacin/PDFExtract (There were some errors Observed while > > evaluating this and not as perfect as Aspose) > > > > https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on > > apache pdf box) > > > > I just need some suggestions whether there are any other algorithms I can > > look at and implement them ? > > > > > > > > > > Thanks & regards, > > Aravind Swarna > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org > For additional commands, e-mail: users-h...@pdfbox.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org