On 04/07/2022 17:59, Ian Bertram wrote:
....
-->
I have been sent a graphic heavy document in ODT format. However it looks as if 
it has been badly converted from a pdf file. The layout is scrambled, headers 
don’t align properly and there are a host of other issues. It is also in 
columns. Is there a simple way to strip out everything bar the words? I have 
tried saving it as a txt file, but this loses a lot of the paragraph numbering 
and introduces other layout issues. Saving in rtf format is even worse.

You might try this.

Open the odt with an archive manager - it's just a zip file. Extract content.xml, then run this perl program, which will extract the text from that:

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;

my $filename = 'content.xml';

binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");

my $dom = XML::LibXML->load_xml(location => $filename) or die "open?\n";

foreach my $para ($dom->findnodes('/office:document-content/office:body/office:text/text:p')) {
        my $b = $para->to_literal;
        print $b, "\n";
}


Works for me but YMMV. BEWARE email line wrap in the 'foreach' line. The single line should say foreach ....... p')) {



--
Mike Scott (unet2 <at> [deletethis] scottsonline.org.uk)
Harlow Essex England
"The only way is Brexit" -- anon.

--
To unsubscribe e-mail to: [email protected]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy

Reply via email to