We can use antiword to render text from MSWord files, and unrtf to render text 
from RTF files.  What is the best tool to render text from PDF files?

(We are running Solaris 9)

L

> -----Original Message-----
> From: Jonas Eckerman [mailto:jonas_li...@frukt.org]
> Sent: Wednesday, June 24, 2009 1:34 PM
> To: users@spamassassin.apache.org
> Subject: Plugin extracting text from docs (was: new spam using large
> images)
> 
> Jason Haar wrote:
> 
> > Speaking of image/rtf/word attachment spam; is there any work going
> on
> > to standardize this so that the textual output of such attachments
> could
> > be fed back into SA?
> 
> Just as a note:
> 
> I'm currently working on a modular plugin for extracting text and add
> it
> to SA message parts.
> 
> The plugin can use either external tools or it's own simple plugin
> modules. How to extract text from parts is configurable, and based on
> mime types and file names, so new formats can be added by simply
> configuring for new external tolls or creating a new plugin module.
> 
> My *far* from finished module currently manages to extract text from
> Word documents (using antiword), OpenXML text documents (using a simple
> plugin) and RTF (using unrtf).
> 
> I haven't tested where and how the extracted text is available to
> SpamAssassin yet (as noted, it's *far* from finished), but I am using
>        "set_rendered" method as in the example, so it should work. ;-)
> 
> Regards
> /Jonas
> --
> Jonas Eckerman
> Fruktträdet & Förbundet Sveriges Dövblinda
> http://www.fsdb.org/
> http://www.frukt.org/
> http://whatever.frukt.org/

Reply via email to