Re: [Dorset] Manipulating PDF Files in Linux
You can try pdftohtml[1] to get it into HTML format, from there it should be easier to convert into a document format you want using something like pandoc[2]. [1]: http://pdftohtml.sourceforge.net/ [2]: http://johnmacfarlane.net/pandoc/README.html On 18 July 2014 14:47, Terry Coles d-...@hadrian-way.co.uk wrote: Hi, Does anyone know how I can use tools available in Linux to convert a PDF file to MS Word .doc or .docx format (or even to LibreOffice .odt)? I thought I could do it using LibreOffice, but it reads the PDF content as if it is a series of graphical objects with text labels. As a consequence, I can only save it as .odg or export it to a graphical format. The problem is that we have a number of specifications in PDF format. We need to get them into an editable form (preferably word) because they need translating. At work I tried the real thing (Adobe Writer), but it seriously mangles the format, even when it works. The originals seem to have been created using a number of different tools; some were created in MS Word 2010, some PDFCreator (presumably from a Word Source, some with Acrobat Distiller and some by conversion from Postscript. Adobe Writer was only able to save three out of five documents and they were not very good. -- Terry Coles -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue -- Andrew Montgomery-Hurrell Professional Geek Blog: http://darkliquid.co.uk Twitter: http://twitter.com/darkliquid Fiction: http://www.protagonize.com/author/darkliquid -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
Hi Terry On 18/07/14 14:47, Terry Coles wrote: Hi, Does anyone know how I can use tools available in Linux to convert a PDF file to MS Word .doc or .docx format (or even to LibreOffice .odt)? Closest I'm aware of is pdftotext (also pdf2text, pdf2txt etc). But of course you'll lose the formatting. There's also pdf2ps from which maybe you can use http://www.coolutils.com/PS-to-DOC or something similar Cheers Tim I thought I could do it using LibreOffice, but it reads the PDF content as if it is a series of graphical objects with text labels. As a consequence, I can only save it as .odg or export it to a graphical format. The problem is that we have a number of specifications in PDF format. We need to get them into an editable form (preferably word) because they need translating. At work I tried the real thing (Adobe Writer), but it seriously mangles the format, even when it works. The originals seem to have been created using a number of different tools; some were created in MS Word 2010, some PDFCreator (presumably from a Word Source, some with Acrobat Distiller and some by conversion from Postscript. Adobe Writer was only able to save three out of five documents and they were not very good. -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
I've bee using pdftohtml (get the latest version from poppler.freedesktop.org) with the '-complex -xml' options, to generate an XML file (which I am then processing with a Perl prog to make an ePub) - depending on the PDF, it does a pretty good job, and you may be able to import the XML directly? On 18 July 2014 15:05, TimA t...@ls83.eclipse.co.uk wrote: Hi Terry On 18/07/14 14:47, Terry Coles wrote: Hi, Does anyone know how I can use tools available in Linux to convert a PDF file to MS Word .doc or .docx format (or even to LibreOffice .odt)? Closest I'm aware of is pdftotext (also pdf2text, pdf2txt etc). But of course you'll lose the formatting. There's also pdf2ps from which maybe you can use http://www.coolutils.com/PS-to-DOC or something similar Cheers Tim I thought I could do it using LibreOffice, but it reads the PDF content as if it is a series of graphical objects with text labels. As a consequence, I can only save it as .odg or export it to a graphical format. The problem is that we have a number of specifications in PDF format. We need to get them into an editable form (preferably word) because they need translating. At work I tried the real thing (Adobe Writer), but it seriously mangles the format, even when it works. The originals seem to have been created using a number of different tools; some were created in MS Word 2010, some PDFCreator (presumably from a Word Source, some with Acrobat Distiller and some by conversion from Postscript. Adobe Writer was only able to save three out of five documents and they were not very good. -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue -- best regards, 웃 Victor Churchill, Bournemouth -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
On Friday 18 Jul 2014 15:02:50 Andrew Montgomery-Hurrell wrote: You can try pdftohtml[1] to get it into HTML format, from there it should be easier to convert into a document format you want using something like pandoc[2]. [1]: http://pdftohtml.sourceforge.net/ [2]: http://johnmacfarlane.net/pandoc/README.html That was incredibly fast :-) Unfortunately, for some reason the conversion inserted lots of unprintable chars; eg : PRÉAMBULE pdftotext did a better job of accurately converting the text, but lost all the formatting :-( -- Terry Coles -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
On Friday 18 Jul 2014 15:35:58 Terry Coles wrote: On Friday 18 Jul 2014 15:02:50 Andrew Montgomery-Hurrell wrote: You can try pdftohtml[1] to get it into HTML format, from there it should be easier to convert into a document format you want using something like pandoc[2]. [1]: http://pdftohtml.sourceforge.net/ [2]: http://johnmacfarlane.net/pandoc/README.html That was incredibly fast :-) Unfortunately, for some reason the conversion inserted lots of unprintable chars; eg : PRÉAMBULE pdftotext did a better job of accurately converting the text, but lost all the formatting :-( What worked was: pdftohtml -c -s pdffile It did a lovely job of retaining both the formatting and the text. When I get to work on Monday, I'll see if the translation company can work with the HTML. If not, I'll see what happens if I open the HTML in Word (LibreOffice demolished it). -- Terry Coles -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
Terry, I've used pdftotext to good effect. pdftotext -layout will do a lot of what you want, but tabular matter is always a difficulty because it isn't a simple sequence of words. You say 'because [the documents] need translating'; i.e. to another language or languages? or have I misunderstood? Are there many illustrations and of what character? Are they needed in the translated version(s) and if so will they need altering? There may be some mileage (and also some work) in converting the documents to a notation in which the logical structure (rather than the actual layout on the page) is indicated by mark-up. I'm thinking mainly of LaTeX and its friends, though [X]HTML (used properly) has this character too. Then all you have to do is to swap the text of each English paragraph or other unit of text (caption, for instance) for a Spanish (etc) text and the final layout adjusts itself to fit. If the target language is right-to-left or ideographic it's harder but still within the scope of TeX. I've read about pandoc and I'll be interested to hear from someone who's tried it. I have mixed feelings about markdown. Good luck. John -- John Palmer Preston near Weymouth, Dorset, England e-mail: jo...@bcs.org.uk (plain text preferred) website: http://www.palmyra.me.uk/ -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
On Friday 18 Jul 2014 16:35:11 John Palmer wrote: Terry, I've used pdftotext to good effect. pdftotext -layout will do a lot of what you want, but tabular matter is always a difficulty because it isn't a simple sequence of words. Yes. Tables and images were a problem when I tried pdftotext. You say 'because [the documents] need translating'; i.e. to another language or languages? or have I misunderstood? Yes. From French. Are there many illustrations and of what character? Are they needed in the translated version(s) and if so will they need altering? Yes. But I'm more concerned with getting at the text inside the Tables without too much re-interpretation of what is what. There may be some mileage (and also some work) in converting the documents to a notation in which the logical structure (rather than the actual layout on the page) is indicated by mark-up. I'm thinking mainly of LaTeX and its friends, though [X]HTML (used properly) has this character too. Then all you have to do is to swap the text of each English paragraph or other unit of text (caption, for instance) for a Spanish (etc) text and the final layout adjusts itself to fit. If the target language is right-to-left or ideographic it's harder but still within the scope of TeX. Yes. As mentioned in my earlier post, I was able to get pdftohtml to do an excellent job. -- Terry Coles -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
You say 'because [the documents] need translating'; i.e. to another language or languages? or have I misunderstood? Google translate will have a go with documents... (ie pdfs / word etc), if translation is what you're after. It won't tackle text in images though -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
On Friday 18 Jul 2014 17:13:26 Stephen Wolff wrote: Google translate will have a go with documents... (ie pdfs / word etc), if translation is what you're after. Google Translate is pretty good and I've already used it to translate the Tables of Contents. The trouble is that these are specifications, so we can't afford any ambiguities hence the intention to use a professional translation service. -- Terry Coles -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue
Re: [Dorset] Manipulating PDF Files in Linux
Hi Terry, Does anyone know how I can use tools available in Linux to convert a PDF file to MS Word .doc or .docx format (or even to LibreOffice .odt)? Sounds like you've done what you wanted, but another option is to edit the PDF down to a mark-up form you're happy with. PDFs tend to be fairly illegible because they're concerned about file size. They can be turned into a pure-text form with http://qpdf.sourceforge.net/ After editing, it can be turned back into a more typical PDF. QPDF can do quite a bit few other things with PDFs too. mupdf will also decompress streams with its -d option, but QPDF is more aimed at the decompress → edit → compress path. Cheers, Ralph. -- Next meeting: Bournemouth, Tuesday, 2014-08-05 20:00 Meets, Mailing list, IRC, LinkedIn, ... http://dorset.lug.org.uk/ New thread on mailing list: mailto:dorset@mailman.lug.org.uk How to Report Bugs Effectively: http://goo.gl/4Xue