I would say that this issue in general (and this file in particular) is more than overdue for a revisiting.
I haven't seen UCS, CP125?, or IBM852 for a long time. Likewise for "UNICODE" or "XUNKNOWN". As for "ISO" (tout court) from Magellan... that's broken, and if it hasn't been fixed by now, then it's their problem, not our. Easier to whitelist the few users still clinging to broken mailers than to continue to compromise spamproofness. As for Windows... I would change the test from: $cs =~ /^WINDOWS/ to: $cs eq 'WINDOWS-1252' instead. There is no reason to use any of the other Windows character sets: they offer nothing that UTF doesn't already have. Being liberal in what you accept is good if interoperability is your goal. If security and integrity, however, are primal, then being paranoid in what you accept might actually be more appropriate. Is there anyone out there (preferably in Central/Eastern Europe) that handles a high volume of traffic that can tell us if any of these encodings are still in legitimate use? Like "ISO10646" or "UCS" or ISO-8859-8 or CP125?, etc. The alternative is to add checks per language for each of the Windows-125[0-8] types. Yes, you can encode English in Windows-1256... but a sane mailer would detect that a message all fits into 7-bits and use USASCII instead. If it doesn't, then it's broken and needs to be fixed. I'm not against reinventing the wheel when a new design is offered that's better. But I'm not convinced that Windows-1252 is an improvement over Latin-1. For instance, the glyphs "oe" and "OE" aren't a unique letter: they are a presentation (i.e. ligature) that renders (displays) differently from writing "o" and "e" separately... but it is in fact just the two letters "o" and "e" that are being represented (similarly for "ij" in Dutch, etc) without kerning between them. The bottom line is you don't need specific characters for "oe" and "ij", etc. You just need a rendering engine that understands when using a ligature is appropriate (same as with "ss" in German, or "ff", "fl", etc. in English). Making these distinct characters was folly. But I digress. Just out of curiosity, what are the charsets_for_locale{'en'} anyway? If it were up to me, I'd limit it to USASCII, ISO-8859-1, and UTF-8. Period. Likewise, for Japanese, how many UA's use anything other than ISO2022JP? This is the blessed standard. Anything else is out-of-date and requires a fix. -Philip Robert Nicholson wrote: > so what is the conclusion to this issue? > > why when I set ok_locales to it th en does it allow any Charset with > "Windows" in the name > to bypass that setting? > > Why is it that is_charset_ok_for_locales written to give exceptions > > sub is_charset_ok_for_locales { > my ($cs, @locales) = @_; > > $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g; > $cs =~ s/^3D//gs; # broken by quoted-printable > $cs =~ s/:.*$//gs; # trim off multiple charsets, just use 1st > > study $cs; > #warn "JMD $cs"; > > # always OK (the net speaks mostly roman charsets) > return 1 if ($cs eq 'USASCII'); > return 1 if ($cs =~ /^ISO8859/); > return 1 if ($cs =~ /^ISO10646/); > return 1 if ($cs =~ /^UTF/); > return 1 if ($cs =~ /^UCS/); > return 1 if ($cs =~ /^CP125/); > return 1 if ($cs =~ /^WINDOWS/); # argh, Windows > return 1 if ($cs eq 'IBM852'); > return 1 if ($cs =~ /^UNICODE11UTF[78]/); # wtf? never heard of it > return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting > to 8bit > return 1 if ($cs eq 'ISO'); # Magellan, sending as 'charset=iso > 8859-15'. grr > > foreach my $locale (@locales) { > if (!defined($locale) || $locale eq 'C') { $locale = 'en'; } > $locale =~ s/^([a-z][a-z]).*$/$1/; # zh_TW... => zh > > my $ok_for_loc = $charsets_for_locale{$locale}; > next if (!defined $ok_for_loc); > > if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) { > return 1; > } > } > > return 0; > } > > On Nov 13, 2006, at 8:30 PM, Giampaolo Tomassoni wrote: > >>> # don't allow windows-1252 text attachments... >>> >>> mimeheader __CTYPE_MH_WIN1252 Content-Type =~ >>> >>> /charset=(\"windows-125[0-8]\"|windows-125[0-8])/i >>> >>> meta WIN_CHARSET ((__CTYPE_MH_HTML || >>> >>> __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252) >>> >>> describe WIN_CHARSET Content-Type is Windows-specific text >>> >>> score WIN_CHARSET 0.01 >>> >