On Wednesday, January 10, 2007 9:26 PM [GMT+1=CET],
Thiesmeyer <[EMAIL PROTECTED]> wrote:

I sent a message on June 9/06 inquiring about OO file format and got a
helpful reply from CP Hennessy.  I believe that, with considerable
effort, we can enable our proofreading software to read OO document
files.  But before we embark on such an enterprise, I have a general
question and a related problem.

Background:  Since 1990, our software product, Editor has been
reading and analyzing document files from Word (through v. 11,
2003--not yet v. 12, 2007), WordPerfect (through v. X3, 2005), Works
(through v. 7, 2002, before they went), RTF, and HTML.  We use filter
routines that ignore all formatting information and graphics in such
a file and produce a plain ANSI/ASCII text image in an internal
buffer for analysis.  Our output is plain text.

My general question: is there a filter that can remove all formatting
information from an OO document to produce a plain text copy of the
document?

The related problem:  A customer sends the following message:  "I
would . . . like to report, as a potential problem, that RTF and Word
documents exported from OpenOffice cannot be read by Editor. I have
to [first] open them and save them in either Word or Wordpad."  Is
there an obvious reason why software that can read Word and RTF
documents produced by Microsoft products cannot read Word or RTF
documents formatted by Open Office?  Does OO change the headers or
the file extensions in some way?

Thanks for your help.

John Thiesmeyer
Serenity Software
[EMAIL PROTECTED]
www.serenity-software.com

This message (with attachments,
if any) was checked for viruses
before it was sent.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

As far as I know OpenOffice uses the Open Document Format defined by the International Standards Organisation. The folk at [email protected] have almost certainly got tools that they could sell/give (hopefully sell, they need the money!) you for parsing these documents but, given that the contents are stored as XML (see below) it shouldn't be too hard to do

Your message caused me to examine an OpenOffice document. It seems that each OpenOffice Writer document is a zipped collection of files one of which is named "content.xml". Stripping the tags from this should (a) be fairly simple and (b) give you what you want. I discovered this on my Win XP Pro system by renaming a ".odt" (OO Writer document) so that it had a ".zip" extension. Windows then graciously allowed me to open the file in WinZip. I then opened content.xml in my text editor and could see the details. I'd hazard a guess that a 10 line perl script is all you'd need. Perhaps 10 is a bit on the high side ;-) I seem to remember from when I played on this field that there are any number of freebie XML parsers out there and I'm sure there's a published API for Winzip or 7zip or some such that will let you extract the relevant file from the OO document.

As to your other question I have no real idea. Microsoft's Word documents are stored in a closed (copyright/patented?) format with many details unpublished. I imagine that the DMCA precludes reverse engineering of many of its features and OO probably therefore has to guess. I've heard of OO producing Word documents that don't look "right" in Word but not of it producing ones that are unreadable.

Again, for more technical details I suggest you contact the OO developers at [email protected]

Harold Fuchs
London, England
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to