Accurately deprecating charsets

Philip Prindeville Fri, 17 Nov 2006 15:06:19 -0800

I'll ask again...  Can someone who handles a fair mix of
email content (i.e. not just western European languages)
do a triage (individually) of the rules below for ham versus
spam?


I'd suspect that very little genuine ham contains "IBM852"
or "Unicode" or "CP12[0-8]" these days.

Thanks,

-Philip



Robert Nicholson wrote:

> so what is the conclusion to this issue?
>
> why when I set ok_locales to it th en does it allow any Charset with
> "Windows" in the name
> to bypass that setting?
>
> Why is it that is_charset_ok_for_locales written to give exceptions
>
> sub is_charset_ok_for_locales {
>   my ($cs, @locales) = @_;
>
>   $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g;
>   $cs =~ s/^3D//gs;             # broken by quoted-printable
>   $cs =~ s/:.*$//gs;            # trim off multiple charsets, just use 1st
>
>   study $cs;
>   #warn "JMD $cs";
>
>   # always OK (the net speaks mostly roman charsets)
>   return 1 if ($cs eq 'USASCII');
>   return 1 if ($cs =~ /^ISO8859/);
>   return 1 if ($cs =~ /^ISO10646/);
>   return 1 if ($cs =~ /^UTF/);
>   return 1 if ($cs =~ /^UCS/);
>   return 1 if ($cs =~ /^CP125/);
>   return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
>   return 1 if ($cs eq 'IBM852');
>   return 1 if ($cs =~ /^UNICODE11UTF[78]/);     # wtf? never heard of it
>   return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting
> to 8bit
>   return 1 if ($cs eq 'ISO');   # Magellan, sending as 'charset=iso
> 8859-15'. grr
>
>   foreach my $locale (@locales) {
>     if (!defined($locale) || $locale eq 'C') { $locale = 'en'; }
>     $locale =~ s/^([a-z][a-z]).*$/$1/;  # zh_TW... => zh
>
>     my $ok_for_loc = $charsets_for_locale{$locale};
>     next if (!defined $ok_for_loc);
>
>     if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) {
>       return 1;
>     }
>   }
>
>   return 0;
> }

Accurately deprecating charsets

Reply via email to