RE: Plugin extracting text from docs (was: new spam using large images)

2009-07-01 Thread Rosenbaum, Larry M.
We can use antiword to render text from MSWord files, and unrtf to render text 
from RTF files.  What is the best tool to render text from PDF files?

(We are running Solaris 9)

L

 -Original Message-
 From: Jonas Eckerman [mailto:jonas_li...@frukt.org]
 Sent: Wednesday, June 24, 2009 1:34 PM
 To: users@spamassassin.apache.org
 Subject: Plugin extracting text from docs (was: new spam using large
 images)
 
 Jason Haar wrote:
 
  Speaking of image/rtf/word attachment spam; is there any work going
 on
  to standardize this so that the textual output of such attachments
 could
  be fed back into SA?
 
 Just as a note:
 
 I'm currently working on a modular plugin for extracting text and add
 it
 to SA message parts.
 
 The plugin can use either external tools or it's own simple plugin
 modules. How to extract text from parts is configurable, and based on
 mime types and file names, so new formats can be added by simply
 configuring for new external tolls or creating a new plugin module.
 
 My *far* from finished module currently manages to extract text from
 Word documents (using antiword), OpenXML text documents (using a simple
 plugin) and RTF (using unrtf).
 
 I haven't tested where and how the extracted text is available to
 SpamAssassin yet (as noted, it's *far* from finished), but I am using
set_rendered method as in the example, so it should work. ;-)
 
 Regards
 /Jonas
 --
 Jonas Eckerman
 Fruktträdet  Förbundet Sveriges Dövblinda
 http://www.fsdb.org/
 http://www.frukt.org/
 http://whatever.frukt.org/


RE: Plugin extracting text from docs (was: new spam using large images)

2009-07-01 Thread Giampaolo Tomassoni
 We can use antiword to render text from MSWord files, and unrtf to
 render text from RTF files.  What is the best tool to render text from
 PDF files?
 
 (We are running Solaris 9)

FWIK, antiword is the best tradeoff between speed and conversion quality.

The best converter I know of, even for batch use, is actually OpenOffice with 
its uno interface, but it isn't that easy to handle from perl since it uses 
some kind of Java jndi in order to exchange word files and converted text with 
any implementation of a conversion controller. Also, it tends to consume a lot 
of memory since current versions keep growing core size for each document you 
convert (even when you close them...).

Antiword seems more resource conscious in this...

Giampaolo



 
 L
 
  -Original Message-
  From: Jonas Eckerman [mailto:jonas_li...@frukt.org]
  Sent: Wednesday, June 24, 2009 1:34 PM
  To: users@spamassassin.apache.org
  Subject: Plugin extracting text from docs (was: new spam using large
  images)
 
  Jason Haar wrote:
 
   Speaking of image/rtf/word attachment spam; is there any work going
  on
   to standardize this so that the textual output of such attachments
  could
   be fed back into SA?
 
  Just as a note:
 
  I'm currently working on a modular plugin for extracting text and add
  it
  to SA message parts.
 
  The plugin can use either external tools or it's own simple plugin
  modules. How to extract text from parts is configurable, and based on
  mime types and file names, so new formats can be added by simply
  configuring for new external tolls or creating a new plugin module.
 
  My *far* from finished module currently manages to extract text from
  Word documents (using antiword), OpenXML text documents (using a
 simple
  plugin) and RTF (using unrtf).
 
  I haven't tested where and how the extracted text is available to
  SpamAssassin yet (as noted, it's *far* from finished), but I am using
 set_rendered method as in the example, so it should work. ;-
 )
 
  Regards
  /Jonas
  --
  Jonas Eckerman
  Fruktträdet  Förbundet Sveriges Dövblinda
  http://www.fsdb.org/
  http://www.frukt.org/
  http://whatever.frukt.org/



Re: Plugin extracting text from docs (was: new spam using large images)

2009-06-25 Thread Matus UHLAR - fantomas
 Jason Haar wrote:

 Speaking of image/rtf/word attachment spam; is there any work going on
 to standardize this so that the textual output of such attachments could
 be fed back into SA?

On 24.06.09 19:33, Jonas Eckerman wrote:
 Just as a note:

 I'm currently working on a modular plugin for extracting text and add it  
 to SA message parts.

if possible, extract images too, so the fuzzyocr and similar plugins would
be able to look at that too.

IIRC spammers did even put PDF's to .doc files to make the stuff harder, but
if you manage the above, it shouldn't be hard to extract PDF's too :)

(and then extracting text/images from PDF's too)

 The plugin can use either external tools or it's own simple plugin  
 modules. How to extract text from parts is configurable, and based on  
 mime types and file names, so new formats can be added by simply  
 configuring for new external tolls or creating a new plugin module.

 My *far* from finished module currently manages to extract text from  
 Word documents (using antiword), OpenXML text documents (using a simple  
 plugin) and RTF (using unrtf).

 I haven't tested where and how the extracted text is available to  
 SpamAssassin yet (as noted, it's *far* from finished), but I am using 
   set_rendered method as in the example, so it should work. ;-)

great!
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
If Barbie is so popular, why do you have to buy her friends?