Re: [libreoffice-users] A word of warning about PDF text

2014-01-31 Thread Cley Faye
2014-01-31 Peter West li...@pbw.id.au:

 A word of warning about text retrieved from PDF documents.

 Recovering text blocks from PDFs is inherently risky.  PDF is a page
 definition format, and so it has no notion of the semantics of the text it
 contains. It places bits of text at certain positions on the page. You can
 create a whole page of text by taking the individual characters and their
 attributes and position on the page, shuffling them, and writing them to
 the file.  That will produce a readable file, but try extracting the text
 from that file. Unless you have a very, very smart text extractor that
 reverse-engineers the process of creating the page, then calculates the
 _visual_ order of the text elements, you will end up with gibberish.

 _Most_ pdf text, _most_ of the time, is laid on the page in visual order,
 but in even the best-behaved files, you are likely to be surprised.

 If you don't _know_ that your PDF text extractor program is completely
 visually accurate by design, don't tell your boss that you can easily
 extract that PDF text, without allowing time for proof-reading every page.
 You will get burned.

 I don't know how LO extracts PDF text; perhaps it is very sophisticated. I
 have my doubts.


You are right about the fact that a PDF is not meant to be opened for
modification/text recovery. However it is hardly relevant here, as LO is
not (as far as I know...) marketted as a PDF extractor.

While it is possible to open PDF with Draw, even the simplest file will
show you that it is not meant for full and easy recovery: embedded fonts
are not used, some graphics are off by a few pixels (sometime more), and
yes, text get split into an unexpected number of parts, even when the PDF
content is layered correctly in the final file.
For example you can get a single line of text split in three text elements,
or have a single text elements with (seemingly) random spaces inserted in
the middle of words. General page layout is also an issue: a very simple
PDF, containing only a single page of text, show up as two pages on Draw,
with the footer of the first page at the beginning of the second one.

But I do not think any of this is relevant as long as users know that
opening PDF is at most useful for recovering some select elements. Unless
the documentation state otherwise, it is fine, as it works very well for
this specific usage. Opening a PDF in Draw just does this: show the various
elements present in the PDF.

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted


Re: [libreoffice-users] A word of warning about PDF text

2014-01-31 Thread Dominique Michel
Le Fri, 31 Jan 2014 13:22:41 +1000,
Peter West li...@pbw.id.au a écrit :

 A word of warning about text retrieved from PDF documents.
 
 Recovering text blocks from PDFs is inherently risky.  PDF is a page 
 definition format, and so it has no notion of the semantics of the
 text it contains. It places bits of text at certain positions on the
 page. You can create a whole page of text by taking the individual
 characters and their attributes and position on the page, shuffling
 them, and writing them to the file.  That will produce a readable
 file, but try extracting the text from that file. Unless you have a
 very, very smart text extractor that reverse-engineers the process of
 creating the page, then calculates the _visual_ order of the text
 elements, you will end up with gibberish.
 
 _Most_ pdf text, _most_ of the time, is laid on the page in visual 
 order, but in even the best-behaved files, you are likely to be
 surprised.
 
 If you don't _know_ that your PDF text extractor program is
 completely visually accurate by design, don't tell your boss that
 you can easily extract that PDF text, without allowing time for
 proof-reading every page. You will get burned.

It is why I open the pdf file into a separated program and use the
mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I
have full control on how the text will appear when I select it.

And I use other programs like pdfimages, pdftppm and convert to
extract the images directly from the pdf. They can be turned or
mirrored, it is why convert is useful too. When they are split in small
pieces, pdftoppm give me an exact copy of each page of the pdf, each
page into a ppm file, which is converted in jpeg. In that case, gimp is
useful to extract only the images from these files and cut the text.

The script I use for the images is joined. To use it, place it
somewhere in your path, control it is executable, go into the
directory where your pdf file is, and run 'pdf2jpg'. It will only issue
a help message. Be aware it will extract all the pdf files in that
directory on the fly. Be also aware that, if the final output is jpeg
files, ppm files are automatically used as middle men when needed,
the conversion will be much slower and they can use a lot of space on
the disk.

So, if you want to extract pictures from a 100MB pdf file, count at
least 2GB of temporary disk usage to be safe in all cases. (estimation
from memory, so make you own tests if you don't have a lot of free disk
space)

Also, with some distributions, you may have to adjust the name of the
pdfimages and pdftoppm commands in the script. They are part of poppler
on gentoo (poppler-utils or something like that on Debian), in the past,
they was part of xpdf.

Dominique

 
 I don't know how LO extracts PDF text; perhaps it is very
 sophisticated. I have my doubts.
 

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted



Re: [libreoffice-users] A word of warning about PDF text

2014-01-31 Thread Dominique Michel
Le Sat, 1 Feb 2014 01:18:22 +0100,
Dominique Michel dominique.mic...@vtxnet.ch a écrit :

 Le Fri, 31 Jan 2014 13:22:41 +1000,
 Peter West li...@pbw.id.au a écrit :
 
  A word of warning about text retrieved from PDF documents.
  
  Recovering text blocks from PDFs is inherently risky.  PDF is a
  page definition format, and so it has no notion of the semantics of
  the text it contains. It places bits of text at certain positions
  on the page. You can create a whole page of text by taking the
  individual characters and their attributes and position on the
  page, shuffling them, and writing them to the file.  That will
  produce a readable file, but try extracting the text from that
  file. Unless you have a very, very smart text extractor that
  reverse-engineers the process of creating the page, then calculates
  the _visual_ order of the text elements, you will end up with
  gibberish.
  
  _Most_ pdf text, _most_ of the time, is laid on the page in visual 
  order, but in even the best-behaved files, you are likely to be
  surprised.
  
  If you don't _know_ that your PDF text extractor program is
  completely visually accurate by design, don't tell your boss that
  you can easily extract that PDF text, without allowing time for
  proof-reading every page. You will get burned.
 
 It is why I open the pdf file into a separated program and use the
 mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I
 have full control on how the text will appear when I select it.
 
 And I use other programs like pdfimages, pdftppm and convert to
 extract the images directly from the pdf. They can be turned or
 mirrored, it is why convert is useful too. When they are split in
 small pieces, pdftoppm give me an exact copy of each page of the pdf,
 each page into a ppm file, which is converted in jpeg. In that case,
 gimp is useful to extract only the images from these files and cut
 the text.
 
 The script I use for the images is joined. To use it, place it
 somewhere in your path, control it is executable, go into the
 directory where your pdf file is, and run 'pdf2jpg'. It will only
 issue a help message. Be aware it will extract all the pdf files in
 that directory on the fly. Be also aware that, if the final output is
 jpeg files, ppm files are automatically used as middle men when
 needed, the conversion will be much slower and they can use a lot of
 space on the disk.
 
 So, if you want to extract pictures from a 100MB pdf file, count at
 least 2GB of temporary disk usage to be safe in all cases. (estimation
 from memory, so make you own tests if you don't have a lot of free
 disk space)
 
 Also, with some distributions, you may have to adjust the name of the
 pdfimages and pdftoppm commands in the script. They are part of
 poppler on gentoo (poppler-utils or something like that on Debian),
 in the past, they was part of xpdf.
 
 Dominique

The script didn't make it. Here it is:
http://fvwm-crystal.sourceforge.net/other/pdf2jpg

Dominique

 
  
  I don't know how LO extracts PDF text; perhaps it is very
  sophisticated. I have my doubts.
  
 

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted



[libreoffice-users] A word of warning about PDF text

2014-01-30 Thread Peter West

A word of warning about text retrieved from PDF documents.

Recovering text blocks from PDFs is inherently risky.  PDF is a page 
definition format, and so it has no notion of the semantics of the text 
it contains. It places bits of text at certain positions on the page. 
You can create a whole page of text by taking the individual characters 
and their attributes and position on the page, shuffling them, and 
writing them to the file.  That will produce a readable file, but try 
extracting the text from that file. Unless you have a very, very smart 
text extractor that reverse-engineers the process of creating the page, 
then calculates the _visual_ order of the text elements, you will end up 
with gibberish.


_Most_ pdf text, _most_ of the time, is laid on the page in visual 
order, but in even the best-behaved files, you are likely to be surprised.


If you don't _know_ that your PDF text extractor program is completely 
visually accurate by design, don't tell your boss that you can easily 
extract that PDF text, without allowing time for proof-reading every 
page. You will get burned.


I don't know how LO extracts PDF text; perhaps it is very sophisticated. 
I have my doubts.


--
Peter West
Other seed fell among thorns, and the thorns grew up and choked it...

--
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted