Re: Paragraph identification in apache pdf box

2020-08-12 Thread Peter Murray-Rust
Our own experience is that paragraph identification and other local structures is dependent on the corpus / type of document. We use GROBID https://github.com/kermitt2/grobid for scholarly papers as it has been trained on them. It's a very active project. The result is TEI-XML , a standard in acade

Re: Paragraph identification in apache pdf box

2020-08-12 Thread Aravind Swarana
Ok, I think buying aspose works..I'll go ahead with that..Thank you On 2020/08/11 19:23:11, Tilman Hausherr wrote: > Am 11.08.2020 um 10:15 schrieb Aravind Swarana: > > Hi , > > I tried icecite, it is very buggy and Apache pdf box paragraph > > Identification works even better. Any other soluti