Re: Paragraph identification in apache pdf box

Aravind Swarana Tue, 11 Aug 2020 01:37:22 -0700

Hi ,
I tried icecite, it is very buggy and Apache pdf box paragraph Identification 
works even better. Any other solutions.. or any one know how Aspose PDF does it 
internally ?


On 2020/08/10 18:32:58, Tilman Hausherr <thaush...@t-online.de> wrote: 
> Maybe icecite?
> 
> https://github.com/ckorzen/icecite
> 
> Tilman
> 
> Am 10.08.2020 um 20:19 schrieb Aravind Swarana:
> > Hi,
> >
> > I wanted to extract text as paragraphs using Apache PDFBox. I came to know
> > from my reading that extracting text from PDF is not that simple.
> >
> > I have extracted Paragraphs from pdf using PDFBox API but they are not that
> > great.
> >
> > Meanwhile I have evaluated a Paid version of PDF Parsing called Aspose PDF
> > which is extracting paragraphs with very minimal error.
> >
> > I'm trying to implement a similar algorithm for Apache PDFBox. Can you guys
> > suggest any recent Research paper or open source library which has
> > efficient paragraph Identification algorithms. I'll need to evaluate and
> > implement them.
> >
> > So far I found :
> > https://github.com/elacin/PDFExtract (There were some errors Observed while
> > evaluating this and not as perfect as Aspose)
> >
> > https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based on
> > apache pdf box)
> >
> > I just need some suggestions whether there are any other algorithms I can
> > look at and implement them ?
> >
> >
> >
> >
> > Thanks & regards,
> > Aravind Swarna
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Paragraph identification in apache pdf box

Reply via email to