Re: problem with pdf eof

Marc Davis Fri, 10 Oct 2014 07:54:02 -0700

Peter, being an Organofluorine Chemist, this is precisely what we are seeking - 
being able to extract PDFs that contain organic structures along with text and 
tables, we need to place extract this data and transfer into readable Word 
docs.  I guess in this case, since it’s XML, the .docx format is a lot easier 
to create.


Thanks,
Marc



On Oct 10, 2014, at 10:05 AM, Peter Murray-Rust <[email protected]> wrote:

> On Fri, Oct 10, 2014 at 2:33 PM, Maruan Sahyoun <[email protected]>
> wrote:
> 
>> Hi Marc,
>> 
>> text and image extraction is one of the regular use cases. Keeping the
>> formatting is also possible but there is a different concept behind the PDF
>> format and text processing. E.g. what is a paragraph within a text
>> processor might be individually placed characters (glyphs) within a PDF
>> file. You might want to look into PDFStreamEngine and it’s subclasses how
>> to process graphics and text information of a PDF.
>> 
>> Another sample is PDF2SVG which uses PDFBox [
>> https://bitbucket.org/petermr/pdf2svg/wiki/Home]
>> 
> 
> Thanks for the link. see also http://www.contentmine.org
> 
> The PDF2SVG project is active and the first part of a pipeline which
> includes:
> 
> PDF -> (SVG, PNG) -> (SVG, XHTML, PNG) -> (SVG, XHTML, SVG) (where bitmaps
> have been converted to SVG) -> (Shapes, Text) -> Semantic Documents ->
> Science
> 
> We are now able to take (most) PDFs and extract primitives which are
> heuristically combined to create Characters and Paths, which are combined
> to Shapes and Text. This is structured into XHTML, along with
> sub/superscripts and styling (italics). In favourable cases we can extract
> semantic science (currently evolutionary trees from pixel diagrams in PDFs,
> and chemical reactions also from pixels in PDFs).
> 
> 
> We have to do a significant amount of OCR because (a) diagrams have
> characters in pixels and (b) scientific publishers use the worst-ever
> non-compliant Fonts in their PDFs. This means we have to guess the
> character / codePoint from the outline glyph or pixel map.
> 
> Some of this is good beta, some is raw alpha. We'd be delighted if anyone
> is interested in hacking pixels or glyph outlines in PDFs - it's painful
> but you get a warm glow of having helped the human race. Same goes for
> tables and document structuring...
> 
> BR
> 
> P
> 
> 
> 
> 
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

Re: problem with pdf eof

Reply via email to