Re: pdf tools clarification? - PDFText

James MacLean Mon, 16 Jul 2007 14:53:12 -0700

JT DeLys wrote, on 16/07/07 06:36 PM:

Hi,


    With PDFText2, the found text is added (rendered) to the main
    tests that SpamAssassin does.


Do you mean to those tests defined in 80_additional.cf? or others?

It means any test you do on the body of e-mail will test against this.for example, in your local.cf you might have :


body STOCK_TEST /stock/i
describe STOCK_TEST Found the word stock
score STOCK_TEST 4.5

When PDFText2 is loaded, it's rendered text will be tested for the wordstock just like everything else that SpamAssassin offers for your teststo match against. You might consider it to be the more SpamAssassinnatural way of matching against PDF text :).


    PDFText2 can also use gocr to do OCR on any PDF images. I'm not
    sold on that as the first one I tested it on gave back :


Is that different capability/functionality than FuzzyOCR is undertaking?

Well, I am going to say similar, yet different :). PDFText2 currentlydoes an OCR of the images and adds them to the rendered text. The OCRedtext may not be very accurate and will not match that well.

FuzzyOCR, if I understand what I have seen so far and the author will bemuch better then I to respond, takes the OCR rendered from any one ofthe available OCR engines and uses String::Approx (and maybe othertools) to match against a word list you supply specifically forfuzzyOCR. Much better chance of getting a hit on images.


--
Thanks,

JTDeLys

Quite welcome,
JES

Re: pdf tools clarification? - PDFText

Reply via email to