Le Sat, 1 Feb 2014 01:18:22 +0100, Dominique Michel <[email protected]> a écrit :
> Le Fri, 31 Jan 2014 13:22:41 +1000, > Peter West <[email protected]> a écrit : > > > A word of warning about text retrieved from PDF documents. > > > > Recovering text blocks from PDFs is inherently risky. PDF is a > > page definition format, and so it has no notion of the semantics of > > the text it contains. It places bits of text at certain positions > > on the page. You can create a whole page of text by taking the > > individual characters and their attributes and position on the > > page, shuffling them, and writing them to the file. That will > > produce a readable file, but try extracting the text from that > > file. Unless you have a very, very smart text extractor that > > reverse-engineers the process of creating the page, then calculates > > the _visual_ order of the text elements, you will end up with > > gibberish. > > > > _Most_ pdf text, _most_ of the time, is laid on the page in visual > > order, but in even the best-behaved files, you are likely to be > > surprised. > > > > If you don't _know_ that your PDF text extractor program is > > completely visually accurate by design, don't tell your boss that > > you can easily extract that PDF text, without allowing time for > > proof-reading every page. You will get burned. > > It is why I open the pdf file into a separated program and use the > mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I > have full control on how the text will appear when I select it. > > And I use other programs like pdfimages, pdftppm and convert to > extract the images directly from the pdf. They can be turned or > mirrored, it is why convert is useful too. When they are split in > small pieces, pdftoppm give me an exact copy of each page of the pdf, > each page into a ppm file, which is converted in jpeg. In that case, > gimp is useful to extract only the images from these files and cut > the text. > > The script I use for the images is joined. To use it, place it > somewhere in your path, control it is executable, go into the > directory where your pdf file is, and run 'pdf2jpg'. It will only > issue a help message. Be aware it will extract all the pdf files in > that directory on the fly. Be also aware that, if the final output is > jpeg files, ppm files are automatically used as middle men when > needed, the conversion will be much slower and they can use a lot of > space on the disk. > > So, if you want to extract pictures from a 100MB pdf file, count at > least 2GB of temporary disk usage to be safe in all cases. (estimation > from memory, so make you own tests if you don't have a lot of free > disk space) > > Also, with some distributions, you may have to adjust the name of the > pdfimages and pdftoppm commands in the script. They are part of > poppler on gentoo (poppler-utils or something like that on Debian), > in the past, they was part of xpdf. > > Dominique The script didn't make it. Here it is: http://fvwm-crystal.sourceforge.net/other/pdf2jpg Dominique > > > > > I don't know how LO extracts PDF text; perhaps it is very > > sophisticated. I have my doubts. > > > -- To unsubscribe e-mail to: [email protected] Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/users/ All messages sent to this list will be publicly archived and cannot be deleted
