John D. Hardin wrote: >On Mon, 20 Nov 2006, twofers wrote: > > > >>I would like to know what local rule I could invoke to tag email that the >>subject is not in english. >> >> header NOT_IN_ENGLISH Subject !~ /English/i >> describe NOT_IN_ENGLISH Subject Contains Non English Characters >> score NOT_IN_ENGLISH 3.5 >> >> What regexp could I use? >> >> > >I haven't tested this, but it may work: > >header NOT_IN_ENGLISH Subject =~ /[\x80-\xFF]{3}/ > >That should hit on a string of at least three charaters with the high >bit set. > >You may need to drop it down to {2} to get good detection. > >Don't score it very high. > >
Of course, that would exclude messages with ISO Latin 1 (8859.1) characters like Yen, Pound Sterling, Trademark, etc. Plus, there are words in English that when properly written do contain accents, such as resume, dais, cliche, cooperation, etc. Excluding words with pounds and yen in the Subject line might be a good thing, however... -Philip