RE: Plugin extracting text from docs (was: new spam using large images)
We can use antiword to render text from MSWord files, and unrtf to render text from RTF files. What is the best tool to render text from PDF files? (We are running Solaris 9) L -Original Message- From: Jonas Eckerman [mailto:jonas_li...@frukt.org] Sent: Wednesday, June 24, 2009 1:34 PM To: users@spamassassin.apache.org Subject: Plugin extracting text from docs (was: new spam using large images) Jason Haar wrote: Speaking of image/rtf/word attachment spam; is there any work going on to standardize this so that the textual output of such attachments could be fed back into SA? Just as a note: I'm currently working on a modular plugin for extracting text and add it to SA message parts. The plugin can use either external tools or it's own simple plugin modules. How to extract text from parts is configurable, and based on mime types and file names, so new formats can be added by simply configuring for new external tolls or creating a new plugin module. My *far* from finished module currently manages to extract text from Word documents (using antiword), OpenXML text documents (using a simple plugin) and RTF (using unrtf). I haven't tested where and how the extracted text is available to SpamAssassin yet (as noted, it's *far* from finished), but I am using set_rendered method as in the example, so it should work. ;-) Regards /Jonas -- Jonas Eckerman Fruktträdet Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/
RE: Plugin extracting text from docs (was: new spam using large images)
We can use antiword to render text from MSWord files, and unrtf to render text from RTF files. What is the best tool to render text from PDF files? (We are running Solaris 9) FWIK, antiword is the best tradeoff between speed and conversion quality. The best converter I know of, even for batch use, is actually OpenOffice with its uno interface, but it isn't that easy to handle from perl since it uses some kind of Java jndi in order to exchange word files and converted text with any implementation of a conversion controller. Also, it tends to consume a lot of memory since current versions keep growing core size for each document you convert (even when you close them...). Antiword seems more resource conscious in this... Giampaolo L -Original Message- From: Jonas Eckerman [mailto:jonas_li...@frukt.org] Sent: Wednesday, June 24, 2009 1:34 PM To: users@spamassassin.apache.org Subject: Plugin extracting text from docs (was: new spam using large images) Jason Haar wrote: Speaking of image/rtf/word attachment spam; is there any work going on to standardize this so that the textual output of such attachments could be fed back into SA? Just as a note: I'm currently working on a modular plugin for extracting text and add it to SA message parts. The plugin can use either external tools or it's own simple plugin modules. How to extract text from parts is configurable, and based on mime types and file names, so new formats can be added by simply configuring for new external tolls or creating a new plugin module. My *far* from finished module currently manages to extract text from Word documents (using antiword), OpenXML text documents (using a simple plugin) and RTF (using unrtf). I haven't tested where and how the extracted text is available to SpamAssassin yet (as noted, it's *far* from finished), but I am using set_rendered method as in the example, so it should work. ;- ) Regards /Jonas -- Jonas Eckerman Fruktträdet Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.frukt.org/ http://whatever.frukt.org/
Re: Plugin extracting text from docs (was: new spam using large images)
Jason Haar wrote: Speaking of image/rtf/word attachment spam; is there any work going on to standardize this so that the textual output of such attachments could be fed back into SA? On 24.06.09 19:33, Jonas Eckerman wrote: Just as a note: I'm currently working on a modular plugin for extracting text and add it to SA message parts. if possible, extract images too, so the fuzzyocr and similar plugins would be able to look at that too. IIRC spammers did even put PDF's to .doc files to make the stuff harder, but if you manage the above, it shouldn't be hard to extract PDF's too :) (and then extracting text/images from PDF's too) The plugin can use either external tools or it's own simple plugin modules. How to extract text from parts is configurable, and based on mime types and file names, so new formats can be added by simply configuring for new external tolls or creating a new plugin module. My *far* from finished module currently manages to extract text from Word documents (using antiword), OpenXML text documents (using a simple plugin) and RTF (using unrtf). I haven't tested where and how the extracted text is available to SpamAssassin yet (as noted, it's *far* from finished), but I am using set_rendered method as in the example, so it should work. ;-) great! -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. If Barbie is so popular, why do you have to buy her friends?