Peter, being an Organofluorine Chemist, this is precisely what we are seeking - being able to extract PDFs that contain organic structures along with text and tables, we need to place extract this data and transfer into readable Word docs. I guess in this case, since it’s XML, the .docx format is a lot easier to create.
Thanks, Marc On Oct 10, 2014, at 10:05 AM, Peter Murray-Rust <[email protected]> wrote: > On Fri, Oct 10, 2014 at 2:33 PM, Maruan Sahyoun <[email protected]> > wrote: > >> Hi Marc, >> >> text and image extraction is one of the regular use cases. Keeping the >> formatting is also possible but there is a different concept behind the PDF >> format and text processing. E.g. what is a paragraph within a text >> processor might be individually placed characters (glyphs) within a PDF >> file. You might want to look into PDFStreamEngine and it’s subclasses how >> to process graphics and text information of a PDF. >> >> Another sample is PDF2SVG which uses PDFBox [ >> https://bitbucket.org/petermr/pdf2svg/wiki/Home] >> > > Thanks for the link. see also http://www.contentmine.org > > The PDF2SVG project is active and the first part of a pipeline which > includes: > > PDF -> (SVG, PNG) -> (SVG, XHTML, PNG) -> (SVG, XHTML, SVG) (where bitmaps > have been converted to SVG) -> (Shapes, Text) -> Semantic Documents -> > Science > > We are now able to take (most) PDFs and extract primitives which are > heuristically combined to create Characters and Paths, which are combined > to Shapes and Text. This is structured into XHTML, along with > sub/superscripts and styling (italics). In favourable cases we can extract > semantic science (currently evolutionary trees from pixel diagrams in PDFs, > and chemical reactions also from pixels in PDFs). > > > We have to do a significant amount of OCR because (a) diagrams have > characters in pixels and (b) scientific publishers use the worst-ever > non-compliant Fonts in their PDFs. This means we have to guess the > character / codePoint from the outline glyph or pixel map. > > Some of this is good beta, some is raw alpha. We'd be delighted if anyone > is interested in hacking pixels or glyph outlines in PDFs - it's painful > but you get a warm glow of having helped the human race. Same goes for > tables and document structuring... > > BR > > P > > > > > -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069

