Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Andrew Montgomery-Hurrell
You can try pdftohtml[1] to get it into HTML format, from there it should
be easier to convert into a document format you want using something like
pandoc[2].

[1]: http://pdftohtml.sourceforge.net/
[2]: http://johnmacfarlane.net/pandoc/README.html


On 18 July 2014 14:47, Terry Coles d-...@hadrian-way.co.uk wrote:

 Hi,

 Does anyone know how I can use tools available in Linux to convert a PDF
 file
 to MS Word .doc or .docx format (or even to LibreOffice .odt)?

 I thought I could do it using LibreOffice, but it reads the PDF content as
 if it
 is a series of graphical objects with text labels.  As a consequence, I can
 only save it as .odg or export it to a graphical format.

 The problem is that we have a number of specifications in PDF format.  We
 need
 to get them into an editable form (preferably word) because they need
 translating.

 At work I tried the real thing (Adobe Writer), but it seriously mangles the
 format, even when it works.

 The originals seem to have been created using a number of different tools;
 some
 were created in MS Word 2010, some PDFCreator (presumably from a Word
 Source,
 some with Acrobat Distiller and some by conversion from Postscript.  Adobe
 Writer was only able to save three out of five documents and they were not
 very
 good.

 --

 Terry Coles



 --
 Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
 Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
 New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
 How to Report Bugs Effectively:  http://goo.gl/4Xue




-- 
Andrew Montgomery-Hurrell
Professional Geek
Blog: http://darkliquid.co.uk
Twitter: http://twitter.com/darkliquid
Fiction: http://www.protagonize.com/author/darkliquid
-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue


Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread TimA

Hi Terry

On 18/07/14 14:47, Terry Coles wrote:

Hi,

Does anyone know how I can use tools available in Linux to convert a PDF file
to MS Word .doc or .docx format (or even to LibreOffice .odt)?


Closest I'm aware of is pdftotext (also pdf2text, pdf2txt etc). But of 
course you'll lose the formatting. There's also pdf2ps from which maybe 
you can use


http://www.coolutils.com/PS-to-DOC

or something similar

Cheers

Tim



I thought I could do it using LibreOffice, but it reads the PDF content as if it
is a series of graphical objects with text labels.  As a consequence, I can
only save it as .odg or export it to a graphical format.

The problem is that we have a number of specifications in PDF format.  We need
to get them into an editable form (preferably word) because they need
translating.

At work I tried the real thing (Adobe Writer), but it seriously mangles the
format, even when it works.

The originals seem to have been created using a number of different tools; some
were created in MS Word 2010, some PDFCreator (presumably from a Word Source,
some with Acrobat Distiller and some by conversion from Postscript.  Adobe
Writer was only able to save three out of five documents and they were not very
good.





--
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue


Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Victor Churchill
I've bee using pdftohtml (get the latest version from
poppler.freedesktop.org) with the '-complex -xml' options, to generate an
XML file (which I am then processing with a Perl prog to make an ePub) -
depending on the PDF, it does a pretty good job, and you may be able to
import the XML directly?




On 18 July 2014 15:05, TimA t...@ls83.eclipse.co.uk wrote:

 Hi Terry


 On 18/07/14 14:47, Terry Coles wrote:

 Hi,

 Does anyone know how I can use tools available in Linux to convert a PDF
 file
 to MS Word .doc or .docx format (or even to LibreOffice .odt)?


 Closest I'm aware of is pdftotext (also pdf2text, pdf2txt etc). But of
 course you'll lose the formatting. There's also pdf2ps from which maybe you
 can use

 http://www.coolutils.com/PS-to-DOC

 or something similar

 Cheers

 Tim



 I thought I could do it using LibreOffice, but it reads the PDF content
 as if it
 is a series of graphical objects with text labels.  As a consequence, I
 can
 only save it as .odg or export it to a graphical format.

 The problem is that we have a number of specifications in PDF format.  We
 need
 to get them into an editable form (preferably word) because they need
 translating.

 At work I tried the real thing (Adobe Writer), but it seriously mangles
 the
 format, even when it works.

 The originals seem to have been created using a number of different
 tools; some
 were created in MS Word 2010, some PDFCreator (presumably from a Word
 Source,
 some with Acrobat Distiller and some by conversion from Postscript.  Adobe
 Writer was only able to save three out of five documents and they were
 not very
 good.




 --
 Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
 Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
 New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
 How to Report Bugs Effectively:  http://goo.gl/4Xue




-- 
best regards,
웃
Victor Churchill,
Bournemouth
-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue

Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Terry Coles
On Friday 18 Jul 2014 15:02:50 Andrew Montgomery-Hurrell wrote:
 You can try pdftohtml[1] to get it into HTML format, from there it should
 be easier to convert into a document format you want using something like
 pandoc[2].
 
 [1]: http://pdftohtml.sourceforge.net/
 [2]: http://johnmacfarlane.net/pandoc/README.html

That was incredibly fast :-)

Unfortunately, for some reason the conversion inserted lots of unprintable 
chars; eg :

PRÉAMBULE

pdftotext did a better job of accurately converting the text, but lost all the 
formatting :-(

-- 

Terry Coles



-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue

Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Terry Coles
On Friday 18 Jul 2014 15:35:58 Terry Coles wrote:
 On Friday 18 Jul 2014 15:02:50 Andrew Montgomery-Hurrell wrote:
  You can try pdftohtml[1] to get it into HTML format, from there it should
  be easier to convert into a document format you want using something like
  pandoc[2].
  
  [1]: http://pdftohtml.sourceforge.net/
  [2]: http://johnmacfarlane.net/pandoc/README.html
 
 That was incredibly fast :-)
 
 Unfortunately, for some reason the conversion inserted lots of unprintable
 chars; eg :
 
   PRÉAMBULE
 
 pdftotext did a better job of accurately converting the text, but lost all
 the formatting :-(

What worked was:

pdftohtml -c -s pdffile

It did a lovely job of retaining both the formatting and the text.  When I get 
to work on Monday, I'll see if the translation company can work with the HTML.  
If not, I'll see what happens if I open the HTML in Word (LibreOffice 
demolished 
it).

-- 

Terry Coles



-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue

Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread John Palmer
Terry, I've used pdftotext to good effect.
pdftotext -layout will do a lot of what you want, but tabular matter is
always a difficulty because it isn't a simple sequence of words.

You say 'because [the documents] need translating'; i.e. to another
language or languages? or have I misunderstood?
Are there many illustrations and of what character?  Are they needed in
the translated version(s) and if so will they need altering?

There may be some mileage (and also some work) in converting the
documents to a notation in which the logical structure (rather than the
actual layout on the page) is indicated by mark-up.  I'm thinking mainly
of LaTeX and its friends, though [X]HTML (used properly) has this
character too.  Then all you have to do is to swap the text of each
English paragraph or other unit of text (caption, for instance) for a
Spanish (etc) text and the final layout adjusts itself to fit.
If the target language is right-to-left or ideographic it's harder but
still within the scope of TeX.

I've read about pandoc and I'll be interested to hear from someone who's
tried it.  I have mixed feelings about markdown.
Good luck.
John

-- 
John Palmer
Preston near Weymouth, Dorset, England
e-mail:  jo...@bcs.org.uk (plain text preferred)
website: http://www.palmyra.me.uk/


-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue


Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Terry Coles
On Friday 18 Jul 2014 16:35:11 John Palmer wrote:
 Terry, I've used pdftotext to good effect.
 pdftotext -layout will do a lot of what you want, but tabular matter is
 always a difficulty because it isn't a simple sequence of words.

Yes.  Tables and images were a problem when I tried pdftotext.
 
 You say 'because [the documents] need translating'; i.e. to another
 language or languages? or have I misunderstood?

Yes.  From French.

 Are there many illustrations and of what character?  Are they needed in
 the translated version(s) and if so will they need altering?

Yes.  But I'm more concerned with getting at the text inside the Tables 
without too much re-interpretation of what is what.

 There may be some mileage (and also some work) in converting the
 documents to a notation in which the logical structure (rather than the
 actual layout on the page) is indicated by mark-up.  I'm thinking mainly
 of LaTeX and its friends, though [X]HTML (used properly) has this
 character too.  Then all you have to do is to swap the text of each
 English paragraph or other unit of text (caption, for instance) for a
 Spanish (etc) text and the final layout adjusts itself to fit.
 If the target language is right-to-left or ideographic it's harder but
 still within the scope of TeX.

Yes.  As mentioned in my earlier post, I was able to get pdftohtml to do an 
excellent job.

-- 

Terry Coles



-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue


Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Stephen Wolff

 You say 'because [the documents] need translating'; i.e. to another
 language or languages? or have I misunderstood?
Google translate will have a go with documents... (ie pdfs / word etc),
if translation is what you're after.

It won't tackle text in images though

-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue


Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Terry Coles
On Friday 18 Jul 2014 17:13:26 Stephen Wolff wrote:
 Google translate will have a go with documents... (ie pdfs / word etc),
 if translation is what you're after.

Google Translate is pretty good and I've already used it to translate the 
Tables of Contents.

The trouble is that these are specifications, so we can't afford any 
ambiguities 
hence the intention to use a professional translation service.

-- 

Terry Coles



-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue


Re: [Dorset] Manipulating PDF Files in Linux

2014-07-18 Thread Ralph Corderoy
Hi Terry,

 Does anyone know how I can use tools available in Linux to convert a
 PDF file to MS Word .doc or .docx format (or even to LibreOffice
 .odt)?

Sounds like you've done what you wanted, but another option is to edit
the PDF down to a mark-up form you're happy with.  PDFs tend to be
fairly illegible because they're concerned about file size.  They can be
turned into a pure-text form with http://qpdf.sourceforge.net/  After
editing, it can be turned back into a more typical PDF.  QPDF can do
quite a bit few other things with PDFs too.

mupdf will also decompress streams with its -d option, but QPDF is more
aimed at the decompress → edit → compress path.

Cheers, Ralph.

-- 
Next meeting:  Bournemouth, Tuesday, 2014-08-05 20:00
Meets, Mailing list, IRC, LinkedIn, ...  http://dorset.lug.org.uk/
New thread on mailing list:  mailto:dorset@mailman.lug.org.uk
How to Report Bugs Effectively:  http://goo.gl/4Xue