Re: How SA reactes to a bunch of garbage characters

2016-06-27 Thread Olivier
Hi,

As promissed, ehere is one week log of FuzzyOcr

http://pastebin.com/XwwdXkTV

The result are not too good.

Olivier
-- 


Re: How SA reactes to a bunch of garbage characters

2016-06-15 Thread Olivier
Matus, 

>>To a part that would do regexp rules, but not Bayes? I don't know if it
>>is possible.
>
> someone who knoes SA internals will have to answer this one, but I doubt
> it's useful, see below.

I will give a look at Bayes OCR, it does inject the text OCR'ed from an
image into the body of the message. I'll see how it looks.

>>> the PDF is technically something different: PDF (often) contains plain text,
>>> that does not have to be OCRed and this it will not be misinterpreted.
>>
>>But isn't it troubling the Bayes process if we inject the mail body +
>>the part extracted from PDF? Should we not better submit only the
>>original message? I have no answer on that.
>
> that is just what I would like to know: If OCR produces results good enough
> for BAYES and other rules.

Will make some results available.

Best regards,

Olivier


Re: How SA reactes to a bunch of garbage characters

2016-06-15 Thread Olivier
RW,

> I stopped using OCR a long time ago because I didn't find that image
> spam was particularly hard to catch. These days I find that spams with
> images are mostly either pictures of Russian girls or spoofed corporate
> logos. 

Then you need something able to detect the amount of flesh on a picture
:) A student here had tried to work on something like that, not sure he
ever managed to do something usable.

I am also useing iamegeCerberus that try to classify images on some meta
data like size, position of text, etc. But it is not doing any ocr.

Olivier


Re: How SA reactes to a bunch of garbage characters

2016-06-14 Thread RW
On Tue, 14 Jun 2016 08:56:50 -0400
Joe Quinn wrote:

> On 6/14/2016 8:33 AM, Matus UHLAR - fantomas wrote:
> > that is just what I would like to know: If OCR produces results
> > good enough
> > for BAYES and other rules.
> >
> > I don't think there's difference between bayes and other rules.
> > It's also possible that BAYES would have better results with misread
> > characters than other rules.  
> I've dealt with OCR in the past, and have always had to go back 
> afterwards and manually proofread the results. I expect the impact on 
> Bayes would be a massively increased dictionary of rare words that 
> result from poor "keming" in the image.

Personally I find that a typical spam adds ~30 new tokens, most of
which will be ephemeral. If image spam is a small minority of spam it's
not likely to make a huge difference. It's also not the worst offender.
A few weeks ago I was getting spam that placed an Asian character
between each letter, and that was averaging ~600 new tokens per spam.


I stopped using OCR a long time ago because I didn't find that image
spam was particularly hard to catch. These days I find that spams with
images are mostly either pictures of Russian girls or spoofed corporate
logos. 

Is OCR really all that useful?



 Some PDFs are written in 
> extractable text instead of images, but those tend to use 
> fractional-width spaces for kerning so it's not always easy to figure 
> out what's a real word there either.
> 
> That said, Google seems to use OCR on images in their filtering
> (quoth Wikipedia), so maybe it works when you have a sufficiently
> enormous data set that the OCR glitches are no longer rare and a
> decent inference can be made from them.


Re: How SA reactes to a bunch of garbage characters

2016-06-14 Thread Joe Quinn

On 6/14/2016 8:33 AM, Matus UHLAR - fantomas wrote:
that is just what I would like to know: If OCR produces results good 
enough

for BAYES and other rules.

I don't think there's difference between bayes and other rules.
It's also possible that BAYES would have better results with misread
characters than other rules.
I've dealt with OCR in the past, and have always had to go back 
afterwards and manually proofread the results. I expect the impact on 
Bayes would be a massively increased dictionary of rare words that 
result from poor "keming" in the image. Some PDFs are written in 
extractable text instead of images, but those tend to use 
fractional-width spaces for kerning so it's not always easy to figure 
out what's a real word there either.


That said, Google seems to use OCR on images in their filtering (quoth 
Wikipedia), so maybe it works when you have a sufficiently enormous data 
set that the OCR glitches are no longer rare and a decent inference can 
be made from them.


Re: How SA reactes to a bunch of garbage characters

2016-06-14 Thread Matus UHLAR - fantomas

Sure the OCR results are not very precise. But could we imagine that
they are pushed in a part of the message that will not go through Bayes?

where do you want to push the ORC'ed test, if not back to SA to check other
rules like bayes?


On 14.06.16 13:50, Olivier wrote:

To a part that would do regexp rules, but not Bayes? I don't know if it
is possible.


someone who knoes SA internals will have to answer this one, but I doubt
it's useful, see below.


the PDF is technically something different: PDF (often) contains plain text,
that does not have to be OCRed and this it will not be misinterpreted.


But isn't it troubling the Bayes process if we inject the mail body +
the part extracted from PDF? Should we not better submit only the
original message? I have no answer on that.


that is just what I would like to know: If OCR produces results good enough
for BAYES and other rules.

I don't think there's difference between bayes and other rules.
It's also possible that BAYES would have better results with misread
characters than other rules.


I would skip gocr and ocrad, since tesseract behaves great now...
(the debian fuzzyocr package requires all of them, dunno why)


I'll take your advice, I jus noticed that tesseract was not enabled by
default! I use FreeBSD, could it be required at install only, but
disabled later in your configuration of FuzzyOcr?


I believe so. if you have spamples, try running all OCR on them to
decide which are usefull...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Honk if you love peace and quiet. 


Re: How SA reactes to a bunch of garbage characters

2016-06-13 Thread Olivier
Matus,

>>Sure the OCR results are not very precise. But could we imagine that
>>they are pushed in a part of the message that will not go through Bayes?
> where do you want to push the ORC'ed test, if not back to SA to check other
> rules like bayes?

To a part that would do regexp rules, but not Bayes? I don't know if it
is possible.

> the PDF is technically something different: PDF (often) contains plain text,
> that does not have to be OCRed and this it will not be misinterpreted.

But isn't it troubling the Bayes process if we inject the mail body +
the part extracted from PDF? Should we not better submit only the
original message? I have no answer on that.

> I would skip gocr and ocrad, since tesseract behaves great now...
> (the debian fuzzyocr package requires all of them, dunno why)

I'll take your advice, I jus noticed that tesseract was not enabled by
default! I use FreeBSD, could it be required at install only, but
disabled later in your configuration of FuzzyOcr?

Best regards,

Olivier

-- 


Re: How SA reactes to a bunch of garbage characters

2016-06-13 Thread Matus UHLAR - fantomas

On 09.06.16 10:43, Olivier wrote:

For years I am having FuzzyOcr pluging running, but it helps little,
because it has it's own list of words to keep updated.

I am wondering if, instead of using that own list of words, the result
was injected back into the body of the main message.


I raised this issue some years ago. The result was that pushing OCR-ed data
bach to SA for evaluating BAYES and other rules could cause troubles,
because freely availabel OCR SW was not very presice.


On 13.06.16 10:43, Olivier wrote:

Sure the OCR results are not very precise. But could we imagine that
they are pushed in a part of the message that will not go through Bayes?


where do you want to push the ORC'ed test, if not back to SA to check other
rules like bayes?


If we inject text extracted from PDF, for example, that also modify the
message and influences the Bayes tests. So maybe even PDF extraction
should not be submitted to Bayes and SA would have a mechanism for that
purpose (other than launching a completely separate SA process on that
extracted part).


the PDF is technically something different: PDF (often) contains plain text,
that does not have to be OCRed and this it will not be misinterpreted.


Most of the time, what will be injected back is plain garbade:
w_T___l_e?_

But other time the result is interesting like a proper English sentence
full of spam.


what exactly do you use for OCR? 10 years ago I made a comparison between
gocr, ocrad and tesseract, where gocr gave best results.


I have gocr, ocrad and tesseract configured.


I would skip gocr and ocrad, since tesseract behaves great now...
(the debian fuzzyocr package requires all of them, dunno why)


Now, since google sponsors tesseract development, the scaning looks much
much better, and I started thinking about tryint that again.


So how SA will react if I reinject the garbage? Wil lit just ignore it?


would be nice to see trhe results.
I'm mostly afraid about FUZZY_* rules...


I changed the config of FuzzyOcr so I lost the log of extracted data. I
will post that in detail after a few days and I have collected some
samples.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I feel like I'm diagonally parked in a parallel universe. 


Re: How SA reactes to a bunch of garbage characters

2016-06-12 Thread Olivier
Matus,

Thank you for your reply.

> On 09.06.16 10:43, Olivier wrote:
>>For years I am having FuzzyOcr pluging running, but it helps little,
>>because it has it's own list of words to keep updated.
>>
>>I am wondering if, instead of using that own list of words, the result
>>was injected back into the body of the main message.
>
> I raised this issue some years ago. The result was that pushing OCR-ed data
> bach to SA for evaluating BAYES and other rules could cause troubles,
> because freely availabel OCR SW was not very presice.

Sure the OCR results are not very precise. But could we imagine that
they are pushed in a part of the message that will not go through Bayes?

If we inject text extracted from PDF, for example, that also modify the
message and influences the Bayes tests. So maybe even PDF extraction
should not be submitted to Bayes and SA would have a mechanism for that
purpose (other than launching a completely separate SA process on that
extracted part).

>>Most of the time, what will be injected back is plain garbade:
>>w_T___l_e?_
>>
>>But other time the result is interesting like a proper English sentence
>>full of spam.
>
> what exactly do you use for OCR? 10 years ago I made a comparison between
> gocr, ocrad and tesseract, where gocr gave best results.

I have gocr, ocrad and tesseract configured.

> Now, since google sponsors tesseract development, the scaning looks much
> much better, and I started thinking about tryint that again.
>
>>So how SA will react if I reinject the garbage? Wil lit just ignore it?
>
> would be nice to see trhe results.
> I'm mostly afraid about FUZZY_* rules...

I changed the config of FuzzyOcr so I lost the log of extracted data. I
will post that in detail after a few days and I have collected some
samples.

Best regards,

Olivier

-- 


Re: How SA reactes to a bunch of garbage characters

2016-06-10 Thread Matus UHLAR - fantomas

On 09.06.16 10:43, Olivier wrote:

For years I am having FuzzyOcr pluging running, but it helps little,
because it has it's own list of words to keep updated.

I am wondering if, instead of using that own list of words, the result
was injected back into the body of the main message.


I raised this issue some years ago. The result was that pushing OCR-ed data
bach to SA for evaluating BAYES and other rules could cause troubles,
because freely availabel OCR SW was not very presice.


Most of the time, what will be injected back is plain garbade:
w_T___l_e?_

But other time the result is interesting like a proper English sentence
full of spam.


what exactly do you use for OCR? 10 years ago I made a comparison between
gocr, ocrad and tesseract, where gocr gave best results.

Now, since google sponsors tesseract development, the scaning looks much
much better, and I started thinking about tryint that again.


So how SA will react if I reinject the garbage? Wil lit just ignore it?


would be nice to see trhe results.
I'm mostly afraid about FUZZY_* rules...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Save the whales. Collect the whole set.


How SA reactes to a bunch of garbage characters

2016-06-08 Thread Olivier
Hi,

For years I am having FuzzyOcr pluging running, but it helps little,
because it has it's own list of words to keep updated.

I am wondering if, instead of using that own list of words, the result
was injected back into the body of the main message.

Most of the time, what will be injected back is plain garbade:
w_T___l_e?_

But other time the result is interesting like a proper English sentence
full of spam.

So how SA will react if I reinject the garbage? Wil lit just ignore it?

Best regards,

Olivier
--