Re: [users] OO file reading: a question and a problem

Harold Fuchs Thu, 11 Jan 2007 14:19:34 -0800

On Wednesday, January 10, 2007 9:26 PM [GMT+1=CET],
Thiesmeyer <[EMAIL PROTECTED]> wrote:

I sent a message on June 9/06 inquiring about OO file format and got a
helpful reply from CP Hennessy.  I believe that, with considerable
effort, we can enable our proofreading software to read OO document
files.  But before we embark on such an enterprise, I have a general
question and a related problem.

Background:  Since 1990, our software product, Editor has been
reading and analyzing document files from Word (through v. 11,
2003--not yet v. 12, 2007), WordPerfect (through v. X3, 2005), Works
(through v. 7, 2002, before they went), RTF, and HTML.  We use filter
routines that ignore all formatting information and graphics in such
a file and produce a plain ANSI/ASCII text image in an internal
buffer for analysis.  Our output is plain text.

My general question: is there a filter that can remove all formatting
information from an OO document to produce a plain text copy of the
document?

The related problem:  A customer sends the following message:  "I
would . . . like to report, as a potential problem, that RTF and Word
documents exported from OpenOffice cannot be read by Editor. I have
to [first] open them and save them in either Word or Wordpad."  Is
there an obvious reason why software that can read Word and RTF
documents produced by Microsoft products cannot read Word or RTF
documents formatted by Open Office?  Does OO change the headers or
the file extensions in some way?

Thanks for your help.

John Thiesmeyer
Serenity Software
[EMAIL PROTECTED]
www.serenity-software.com

This message (with attachments,
if any) was checked for viruses
before it was sent.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

As far as I know OpenOffice uses the Open Document Format defined by theInternational Standards Organisation. The folk at [email protected] havealmost certainly got tools that they could sell/give (hopefully sell, theyneed the money!) you for parsing these documents but, given that thecontents are stored as XML (see below) it shouldn't be too hard to do

Your message caused me to examine an OpenOffice document. It seems that eachOpenOffice Writer document is a zipped collection of files one of which isnamed "content.xml". Stripping the tags from this should (a) be fairlysimple and (b) give you what you want. I discovered this on my Win XP Prosystem by renaming a ".odt" (OO Writer document) so that it had a ".zip"extension. Windows then graciously allowed me to open the file in WinZip. Ithen opened content.xml in my text editor and could see the details. I'dhazard a guess that a 10 line perl script is all you'd need. Perhaps 10 is abit on the high side ;-) I seem to remember from when I played on this fieldthat there are any number of freebie XML parsers out there and I'm surethere's a published API for Winzip or 7zip or some such that will let youextract the relevant file from the OO document.

As to your other question I have no real idea. Microsoft's Word documentsare stored in a closed (copyright/patented?) format with many detailsunpublished. I imagine that the DMCA precludes reverse engineering of manyof its features and OO probably therefore has to guess. I've heard of OOproducing Word documents that don't look "right" in Word but not of itproducing ones that are unreadable.

Again, for more technical details I suggest you contact the OO developers at[email protected]


Harold Fuchs

London, England

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [users] OO file reading: a question and a problem

Reply via email to