Does anyone out have experience with: - manipulating RTF files? - or writing OpenOffice macros in Python?
I need to pre-process approximately 10,000 medical reports so they can be imported into an EMR. (They were originally saved as Word .docs; I'd like to give hearty thanks to the authors of "ooconvert" ( http://sourceforge.net/projects/ooconvert/), which enabled me to do a batch conversion with much less effort than I was expecting...) Anyway, the files as they now exist each contain a single section, with a header and footer on each page. The EMR's import function wants to see a slug of summary information as the first thing on the first page, which means that the header needs to be suppressed; however, I expect that the users will want to reprint these things in the future, so I don't want to delete the header entirely. In short, what I want to do is insert a new section at the top of the file. My tasks are: - figure out which codes I need to insert to create a new section with no header and then re-enable it at the end - figure out where in the file to do the inserting (I thought I already had this worked out, but apparently not quite) THEN - figure out how to find the correct insertion point programmatically - either agnostically, by finding a particular pattern of symbols that occur in the right location, or by actually parsing the RTF hierarchy to figure out where the meta-document ends and the document begins. The agnostic solution would be much easier - and this is a one-off, so I'm not building for the ages here - but it really looks like homogeneous tag soup to me. I have, of course, tried inserting the section myself and then compared the before-and-after files... but all I've got so far is a headache... (Not quite true - I think I'm close - but I'm getting angrier with Microsoft with every moment I spend looking at this stuff. Hyper-optimized Perl is a freakin' marvel of clarity next to this... ) {\footerr \ltrpar \pard\plain \ltrpar\s22\qc \li0\ri0\nowidctlpar\tqc\tx4153\tqr\tx8306\wrapdefault\faauto\rin0\lin0\itap0 \rtlch\fcs1 \af0\afs20\alang1025 \ltrch\fcs0 \fs24\lang3081\langfe255\cgrid\langnp3081\langfenp255 {\rtlch\fcs1 \af0 \ltrch\fcs0 \f1\fs16\insrsid5703726 \par } \pard \ltrpar\s22\qc \li0\ri0\nowidctlpar\tqc\tx4153\tqr\tx8306\wrapdefault\faauto\rin0\lin0\itap0\pararsid5703726 {\rtlch\fcs1 \af0\afs24 \ltrch\fcs0 \fs16\lang3081\langfe1033\loch\af1\hich\af43\langfenp1033\insrsid5703726 GAAAAAAAHHH! It occurs to me that there might be another way - maybe I can automate OpenOffice to open each file and insert the blank section and some dummy text, and then, in Python, find the dummy text and replace it with the summary slug? Maybe even do the whole job with a macro? And never have to poke through RTF again? So I was juggling the RTF spec (the RTFM?), a text editor, Word (so I can make sure the thing still looks right), and the import utility - when it suddenly struck me that someone out there may have done this before. (And yes, I've definitely Googled, but my Google-fu may be weak today.) If anyone has insight into RTF Zen, or has some tips on batch macros in oO, I'd be obliged... Marc -- www.fsrtechnologies.com
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor