Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv26750/spambayes

Modified Files:
        ImageStripper.py Options.py OptionsClass.py 
Log Message:
Add scale and charset options (ocrad_scale and ocrad_charset, respectively)
to pass to the ocrad command.  Antonio Diaz Diaz, the author of Ocrad,
suggested scaling up the images.  Ocrad does indeed seem to perform better
with the scaled images.  Scaling by a factor of two seems to do
significantly better than not scaling in my 5x5 N-fold test setup.  Scaling
by a factor of three might even be better, improving false negative
percentages in four of the five sets, but it made the false positive score
worse in one of the five sets, so I left the default scale at 2.

I added the charset flag as well and defaulted to ascii.  So far the
spammers seem to be "GIFting" us with plain English, so searching for
accented characters seems like it would just distract Ocrad.  This has yet
to be tested though.


Index: ImageStripper.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/ImageStripper.py,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** ImageStripper.py    13 Aug 2006 16:27:49 -0000      1.3
--- ImageStripper.py    14 Aug 2006 02:58:11 -0000      1.4
***************
*** 232,235 ****
--- 232,237 ----
          textbits = []
          tokens = Set()
+         scale = options["Tokenizer", "ocrad_scale"] or 1
+         charset = options["Tokenizer", "ocrad_charset"]
          for pnmfile in pnmfiles:
              fhash = md5.new(open(pnmfile).read()).hexdigest()
***************
*** 239,243 ****
              else:
                  self.misses += 1
!                 ocr = os.popen("ocrad -x %s < %s 2>/dev/null" % (orf, 
pnmfile))
                  ctext = ocr.read().lower()
                  ocr.close()
--- 241,246 ----
              else:
                  self.misses += 1
!                 ocr = os.popen("ocrad -s %s -c %s -x %s < %s 2>/dev/null" %
!                                (scale, charset, orf, pnmfile))
                  ctext = ocr.read().lower()
                  ocr.close()

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.137
retrieving revision 1.138
diff -C2 -d -r1.137 -r1.138
*** Options.py  10 Aug 2006 04:07:59 -0000      1.137
--- Options.py  14 Aug 2006 02:58:11 -0000      1.138
***************
*** 139,142 ****
--- 139,154 ----
       PATH, RESTORE),
  
+     ("ocrad_scale", _("Scale factor to use with ocrad."), 2,
+      _("""Specifies the scale factor to apply when running ocrad.  While
+      you can specify a negative scale it probably won't help.  Scaling up
+      by a factor of 2 or 3 seems to work well for the sort of spam images
+      encountered by SpamBayes."""),
+      INTEGER, RESTORE),
+ 
+     ("ocrad_charset", _("Charset to apply with ocrad."), "ascii",
+      _("""Specifies the charset to use when running ocrad.  Valid values
+      are 'ascii', 'iso-8859-9' and 'iso-8859-15'."""),
+      OCRAD_CHARSET, RESTORE),
+ 
      ("max_image_size", _("Max image size to try OCR-ing"), 100000,
       _("""When crack_images is enabled, this specifies the largest

Index: OptionsClass.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v
retrieving revision 1.32
retrieving revision 1.33
diff -C2 -d -r1.32 -r1.33
*** OptionsClass.py     22 Jun 2006 10:36:58 -0000      1.32
--- OptionsClass.py     14 Aug 2006 02:58:11 -0000      1.33
***************
*** 119,122 ****
--- 119,123 ----
             'IMAP_FOLDER', 'IMAP_ASTRING',
             'RESTORE', 'DO_NOT_RESTORE', 'IP_LIST',
+            'OCRAD_CHARSET',
            ]
  
***************
*** 871,872 ****
--- 872,875 ----
  RESTORE = True
  DO_NOT_RESTORE = False
+ 
+ OCRAD_CHARSET = r"ascii|iso-8859-9|iso-8859-15"

_______________________________________________
Spambayes-checkins mailing list
Spambayes-checkins@python.org
http://mail.python.org/mailman/listinfo/spambayes-checkins

Reply via email to