> Jason Haar wrote:
>
>> Speaking of image/rtf/word attachment spam; is there any work going on
>> to standardize this so that the textual output of such attachments could
>> be fed back into SA?

On 24.06.09 19:33, Jonas Eckerman wrote:
> Just as a note:
>
> I'm currently working on a modular plugin for extracting text and add it  
> to SA message parts.

if possible, extract images too, so the fuzzyocr and similar plugins would
be able to look at that too.

IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
if you manage the above, it shouldn't be hard to extract PDF's too :)

(and then extracting text/images from PDF's too)

> The plugin can use either external tools or it's own simple plugin  
> modules. How to extract text from parts is configurable, and based on  
> mime types and file names, so new formats can be added by simply  
> configuring for new external tolls or creating a new plugin module.
>
> My *far* from finished module currently manages to extract text from  
> Word documents (using antiword), OpenXML text documents (using a simple  
> plugin) and RTF (using unrtf).
>
> I haven't tested where and how the extracted text is available to  
> SpamAssassin yet (as noted, it's *far* from finished), but I am using     
>   "set_rendered" method as in the example, so it should work. ;-)

great!
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
If Barbie is so popular, why do you have to buy her friends? 

Reply via email to