John D. Hardin wrote:

>On Mon, 20 Nov 2006, twofers wrote:
>
>  
>
>>I would like to know what local rule I could invoke to tag email that the 
>>subject is not in english.
>>   
>>  header       NOT_IN_ENGLISH     Subject !~ /English/i
>>  describe     NOT_IN_ENGLISH     Subject Contains Non English Characters
>>  score         NOT_IN_ENGLISH     3.5
>>   
>>  What regexp could I use?
>>    
>>
>
>I haven't tested this, but it may work:
>
>header       NOT_IN_ENGLISH     Subject =~ /[\x80-\xFF]{3}/
>
>That should hit on a string of at least three charaters with the high
>bit set.
>
>You may need to drop it down to {2} to get good detection.
>
>Don't score it very high.
>  
>

Of course, that would exclude messages with ISO Latin 1 (8859.1)
characters like Yen, Pound Sterling, Trademark, etc. Plus, there are
words in English that when properly written do contain accents,
such as resume, dais, cliche, cooperation, etc.

Excluding words with pounds and yen in the Subject line might be
a good thing, however...

-Philip

Reply via email to