On 04/07/2022 17:59, Ian Bertram wrote:
....
-->
I have been sent a graphic heavy document in ODT format. However it looks as if
it has been badly converted from a pdf file. The layout is scrambled, headers
don’t align properly and there are a host of other issues. It is also in
columns. Is there a simple way to strip out everything bar the words? I have
tried saving it as a txt file, but this loses a lot of the paragraph numbering
and introduces other layout issues. Saving in rtf format is even worse.
You might try this.
Open the odt with an archive manager - it's just a zip file. Extract
content.xml, then run this perl program, which will extract the text
from that:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $filename = 'content.xml';
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
my $dom = XML::LibXML->load_xml(location => $filename) or die "open?\n";
foreach my $para
($dom->findnodes('/office:document-content/office:body/office:text/text:p'))
{
my $b = $para->to_literal;
print $b, "\n";
}
Works for me but YMMV. BEWARE email line wrap in the 'foreach' line. The
single line should say foreach ....... p')) {
--
Mike Scott (unet2 <at> [deletethis] scottsonline.org.uk)
Harlow Essex England
"The only way is Brexit" -- anon.
--
To unsubscribe e-mail to: [email protected]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy