Re: [libreoffice-users] A word of warning about PDF text

Dominique Michel Fri, 31 Jan 2014 18:56:31 -0800

Le Sat, 1 Feb 2014 01:18:22 +0100,
Dominique Michel <[email protected]> a écrit :


> Le Fri, 31 Jan 2014 13:22:41 +1000,
> Peter West <[email protected]> a écrit :
> 
> > A word of warning about text retrieved from PDF documents.
> > 
> > Recovering text blocks from PDFs is inherently risky.  PDF is a
> > page definition format, and so it has no notion of the semantics of
> > the text it contains. It places bits of text at certain positions
> > on the page. You can create a whole page of text by taking the
> > individual characters and their attributes and position on the
> > page, shuffling them, and writing them to the file.  That will
> > produce a readable file, but try extracting the text from that
> > file. Unless you have a very, very smart text extractor that
> > reverse-engineers the process of creating the page, then calculates
> > the _visual_ order of the text elements, you will end up with
> > gibberish.
> > 
> > _Most_ pdf text, _most_ of the time, is laid on the page in visual 
> > order, but in even the best-behaved files, you are likely to be
> > surprised.
> > 
> > If you don't _know_ that your PDF text extractor program is
> > completely visually accurate by design, don't tell your boss that
> > you can easily extract that PDF text, without allowing time for
> > proof-reading every page. You will get burned.
> 
> It is why I open the pdf file into a separated program and use the
> mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I
> have full control on how the text will appear when I select it.
> 
> And I use other programs like pdfimages, pdftppm and convert to
> extract the images directly from the pdf. They can be turned or
> mirrored, it is why convert is useful too. When they are split in
> small pieces, pdftoppm give me an exact copy of each page of the pdf,
> each page into a ppm file, which is converted in jpeg. In that case,
> gimp is useful to extract only the images from these files and cut
> the text.
> 
> The script I use for the images is joined. To use it, place it
> somewhere in your path, control it is executable, go into the
> directory where your pdf file is, and run 'pdf2jpg'. It will only
> issue a help message. Be aware it will extract all the pdf files in
> that directory on the fly. Be also aware that, if the final output is
> jpeg files, ppm files are automatically used as middle men when
> needed, the conversion will be much slower and they can use a lot of
> space on the disk.
> 
> So, if you want to extract pictures from a 100MB pdf file, count at
> least 2GB of temporary disk usage to be safe in all cases. (estimation
> from memory, so make you own tests if you don't have a lot of free
> disk space)
> 
> Also, with some distributions, you may have to adjust the name of the
> pdfimages and pdftoppm commands in the script. They are part of
> poppler on gentoo (poppler-utils or something like that on Debian),
> in the past, they was part of xpdf.
> 
> Dominique

The script didn't make it. Here it is:
http://fvwm-crystal.sourceforge.net/other/pdf2jpg

Dominique

> 
> > 
> > I don't know how LO extracts PDF text; perhaps it is very
> > sophisticated. I have my doubts.
> > 
> 

-- 
To unsubscribe e-mail to: [email protected]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted

Re: [libreoffice-users] A word of warning about PDF text

Reply via email to