Forgot to Reply All. ---------- Forwarded message ---------- From: Marc Tompkins <marc.tompk...@gmail.com> Date: Sat, Feb 14, 2009 at 11:35 AM Subject: Re: [Tutor] Extract image from RTF file To: Bryan Fodness <bryan.fodn...@gmail.com>
On Sat, Feb 14, 2009 at 8:40 AM, Bryan Fodness <bryan.fodn...@gmail.com>wrote: > I have a large amount of RTF files where the only thing in them is an > image. I would like to extract them an save them as a png. > Eventually, I would like to also grab some text that is on the image. > I think PIL has something for this. > > Does anyone have any suggestion on how to start this? > I'm no kind of expert, but I do have a pointer or two... RTF files are text with lots and lots of funky-looking formatting, but generally not "binary" in the sense of requiring special handling (although, now that I just read about how pictures are stored in them, it seems there might be some exceptions...) There's a Python library for dealing with RTF files ( http://www.nava.de/2005/04/06/pyrtf/) but I haven't tried it; if you're comfortable opening text files and handling their contents, it might be simpler to roll your own for this task. You'll want to look at the Microsoft RTF specification, the latest version of which (1.6) is available here: http://msdn.microsoft.com/en-us/library/aa140277(office.10).aspx<http://msdn.microsoft.com/en-us/library/aa140277%28office.10%29.aspx> In particular, you'll be interested in the section on Pictures, which I'll excerpt here: Pictures An RTF file can include pictures created with other applications. These pictures can be in hexadecimal (the default) or binary format. Pictures are destinations, and begin with the \*pict* control word. The *\pict* keyword is preceded by* \*\shppict* destination control keyword as described in the following example. A picture destination has the following syntax: <pict> '{' *\pict* (<brdr>? & <shading>? & <picttype> & <pictsize> & <metafileinfo>?) <data> '}' <picttype> |* \emfblip* |* \pngblip* |*\jpegblip | \macpict * | *\pmmetafile* | *\wmetafile* | *\dibitmap* <bitmapinfo> | *\wbitmap * <bitmapinfo> <bitmapinfo> *\wbmbitspixel *& *\wbmplanes* & *\wbmwidthbytes* <pictsize> (\*picw* & *\pich*) \*picwgoal*? & \*pichgoal*? *\picscalex*? & * \picscaley*? & *\picscaled*? & *\piccropt*? & *\piccropb*? & *\piccropr*? & *\piccropl*? <metafileinfo> *\picbmp *& *\picbpp* <data> (\*bin* #BDATA) | #SDATA Basically, it looks like you can search for "{\pict", then search for the closing "}". Everything in between will be your picture, plus metadata that tells you how to decode it. Now that you've caught your rabbit... I'm out of advice; I've never used PIL (though I used to listen to them all the time.) -- www.fsrtechnologies.com -- www.fsrtechnologies.com
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor