I'll ask again... Can someone who handles a fair mix of email content (i.e. not just western European languages) do a triage (individually) of the rules below for ham versus spam?
I'd suspect that very little genuine ham contains "IBM852" or "Unicode" or "CP12[0-8]" these days. Thanks, -Philip Robert Nicholson wrote: > so what is the conclusion to this issue? > > why when I set ok_locales to it th en does it allow any Charset with > "Windows" in the name > to bypass that setting? > > Why is it that is_charset_ok_for_locales written to give exceptions > > sub is_charset_ok_for_locales { > my ($cs, @locales) = @_; > > $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g; > $cs =~ s/^3D//gs; # broken by quoted-printable > $cs =~ s/:.*$//gs; # trim off multiple charsets, just use 1st > > study $cs; > #warn "JMD $cs"; > > # always OK (the net speaks mostly roman charsets) > return 1 if ($cs eq 'USASCII'); > return 1 if ($cs =~ /^ISO8859/); > return 1 if ($cs =~ /^ISO10646/); > return 1 if ($cs =~ /^UTF/); > return 1 if ($cs =~ /^UCS/); > return 1 if ($cs =~ /^CP125/); > return 1 if ($cs =~ /^WINDOWS/); # argh, Windows > return 1 if ($cs eq 'IBM852'); > return 1 if ($cs =~ /^UNICODE11UTF[78]/); # wtf? never heard of it > return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting > to 8bit > return 1 if ($cs eq 'ISO'); # Magellan, sending as 'charset=iso > 8859-15'. grr > > foreach my $locale (@locales) { > if (!defined($locale) || $locale eq 'C') { $locale = 'en'; } > $locale =~ s/^([a-z][a-z]).*$/$1/; # zh_TW... => zh > > my $ok_for_loc = $charsets_for_locale{$locale}; > next if (!defined $ok_for_loc); > > if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) { > return 1; > } > } > > return 0; > }