Re: Paragraph identification in apache pdf box

Peter Murray-Rust Wed, 12 Aug 2020 03:31:37 -0700

Our own experience is that paragraph identification and other local
structures is dependent on the corpus / type of document. We use GROBID
https://github.com/kermitt2/grobid for scholarly papers as it has been
trained on them. It's a very active project. The result is TEI-XML , a
standard in academia.


I wasn't clear whether you are using thePDF character stream or bitmaps. We
are working on extracting components from scientific diagrams such as
plots, graphs, etc. https://github.com/petermr/ami3  . It's still a work in
progress but we are making good progress.

Tables , lists, and diagrams are harder than flowing text, and scientific
text (with sub and superscripts, styles, unusual symbols) is harder than
(say) newspaper articles.

Happy to hear from people who want to collaborate on this type of activity.


On Wed, Aug 12, 2020 at 10:57 AM Aravind Swarana <aravindswar...@gmail.com>
wrote:

> Ok, I think buying aspose works..I'll go ahead with that..Thank you
>
> On 2020/08/11 19:23:11, Tilman Hausherr <thaush...@t-online.de> wrote:
> > Am 11.08.2020 um 10:15 schrieb Aravind Swarana:
> > > Hi ,
> > > I tried icecite, it is very buggy and Apache pdf box paragraph
> Identification works even better. Any other solutions.. or any one know how
> Aspose PDF does it internally ?
> >
> >
> > If Aspose works for you, then you should buy / license it. It's probably
> > cheaper than to work out your own algorithm.
> >
> > No, I don't know how Aspose works.
> >
> > Tilman
> >
> >
> >
> > >
> > > On 2020/08/10 18:32:58, Tilman Hausherr <thaush...@t-online.de> wrote:
> > >> Maybe icecite?
> > >>
> > >> https://github.com/ckorzen/icecite
> > >>
> > >> Tilman
> > >>
> > >> Am 10.08.2020 um 20:19 schrieb Aravind Swarana:
> > >>> Hi,
> > >>>
> > >>> I wanted to extract text as paragraphs using Apache PDFBox. I came
> to know
> > >>> from my reading that extracting text from PDF is not that simple.
> > >>>
> > >>> I have extracted Paragraphs from pdf using PDFBox API but they are
> not that
> > >>> great.
> > >>>
> > >>> Meanwhile I have evaluated a Paid version of PDF Parsing called
> Aspose PDF
> > >>> which is extracting paragraphs with very minimal error.
> > >>>
> > >>> I'm trying to implement a similar algorithm for Apache PDFBox. Can
> you guys
> > >>> suggest any recent Research paper or open source library which has
> > >>> efficient paragraph Identification algorithms. I'll need to evaluate
> and
> > >>> implement them.
> > >>>
> > >>> So far I found :
> > >>> https://github.com/elacin/PDFExtract (There were some errors
> Observed while
> > >>> evaluating this and not as perfect as Aspose)
> > >>>
> > >>> https://github.com/BMKEG/lapdftext/wiki/System-Overview (Not based
> on
> > >>> apache pdf box)
> > >>>
> > >>> I just need some suggestions whether there are any other algorithms
> I can
> > >>> look at and implement them ?
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Thanks & regards,
> > >>> Aravind Swarna
> > >>>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> > >>
> > >>
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > > For additional commands, e-mail: users-h...@pdfbox.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

-- 
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK

Re: Paragraph identification in apache pdf box

Reply via email to