Re: ????? ??? ??????

Philip Prindeville Thu, 16 Nov 2006 14:24:35 -0800

I would say that this issue in general (and this file in particular) is
more than overdue for a revisiting.

I haven't seen UCS, CP125?, or IBM852 for a long time.  Likewise
for "UNICODE" or "XUNKNOWN".

As for "ISO" (tout court) from Magellan... that's broken, and if it
hasn't been fixed by now, then it's their problem, not our.  Easier to
whitelist the few users still clinging to broken mailers than to
continue to compromise spamproofness.

As for Windows...  I would change the test from:

$cs =~ /^WINDOWS/

to:

$cs eq 'WINDOWS-1252'

instead.  There is no reason to use any of the other
Windows character sets:  they offer nothing that UTF doesn't
already have.

Being liberal in what you accept is good if interoperability is
your goal.  If security and integrity, however, are primal, then
being paranoid in what you accept might actually be more
appropriate.

Is there anyone out there (preferably in Central/Eastern Europe)
that handles a high volume of traffic that can tell us if
any of these encodings are still in legitimate use?  Like "ISO10646"
or "UCS" or ISO-8859-8 or CP125?, etc.

The alternative is to add checks per language for each of the
Windows-125[0-8] types.  Yes, you can encode English in
Windows-1256... but a sane mailer would detect that a message
all fits into 7-bits and use USASCII instead.

If it doesn't, then it's broken and needs to be fixed.

I'm not against reinventing the wheel when a new design is
offered that's better.  But I'm not convinced that Windows-1252
is an improvement over Latin-1.  For instance, the glyphs "oe"
and "OE" aren't a unique letter:  they are a presentation (i.e.
ligature) that renders (displays) differently from writing "o" and
"e" separately... but it is in fact just the two letters "o" and "e"
that are being represented (similarly for "ij" in Dutch, etc)
without kerning between them.

The bottom line is you don't need specific characters for
"oe" and "ij", etc.  You just need a rendering engine that
understands when using a ligature is appropriate (same
as with "ss" in German, or "ff", "fl", etc. in English).

Making these distinct characters was folly.

But I digress.

Just out of curiosity, what are the charsets_for_locale{'en'}
anyway?  If it were up to me, I'd limit it to USASCII,
ISO-8859-1, and UTF-8.  Period.

Likewise, for Japanese, how many UA's use anything other
than ISO2022JP?  This is the blessed standard.  Anything else
is out-of-date and requires a fix.

-Philip

Robert Nicholson wrote:

> so what is the conclusion to this issue?
>
> why when I set ok_locales to it th en does it allow any Charset with
> "Windows" in the name
> to bypass that setting?
>
> Why is it that is_charset_ok_for_locales written to give exceptions
>
> sub is_charset_ok_for_locales {
>   my ($cs, @locales) = @_;
>
>   $cs = uc $cs; $cs =~ s/[^A-Z0-9]//g;
>   $cs =~ s/^3D//gs;             # broken by quoted-printable
>   $cs =~ s/:.*$//gs;            # trim off multiple charsets, just use 1st
>
>   study $cs;
>   #warn "JMD $cs";
>
>   # always OK (the net speaks mostly roman charsets)
>   return 1 if ($cs eq 'USASCII');
>   return 1 if ($cs =~ /^ISO8859/);
>   return 1 if ($cs =~ /^ISO10646/);
>   return 1 if ($cs =~ /^UTF/);
>   return 1 if ($cs =~ /^UCS/);
>   return 1 if ($cs =~ /^CP125/);
>   return 1 if ($cs =~ /^WINDOWS/);      # argh, Windows
>   return 1 if ($cs eq 'IBM852');
>   return 1 if ($cs =~ /^UNICODE11UTF[78]/);     # wtf? never heard of it
>   return 1 if ($cs eq 'XUNKNOWN'); # added by sendmail when converting
> to 8bit
>   return 1 if ($cs eq 'ISO');   # Magellan, sending as 'charset=iso
> 8859-15'. grr
>
>   foreach my $locale (@locales) {
>     if (!defined($locale) || $locale eq 'C') { $locale = 'en'; }
>     $locale =~ s/^([a-z][a-z]).*$/$1/;  # zh_TW... => zh
>
>     my $ok_for_loc = $charsets_for_locale{$locale};
>     next if (!defined $ok_for_loc);
>
>     if ($ok_for_loc =~ /(?:^| )\Q${cs}\E(?:$| )/) {
>       return 1;
>     }
>   }
>
>   return 0;
> }
>
> On Nov 13, 2006, at 8:30 PM, Giampaolo Tomassoni wrote:
>
>>> # don't allow windows-1252 text attachments...
>>>
>>> mimeheader __CTYPE_MH_WIN1252   Content-Type =~ 
>>>
>>> /charset=(\"windows-125[0-8]\"|windows-125[0-8])/i
>>>
>>> meta WIN_CHARSET                ((__CTYPE_MH_HTML || 
>>>
>>> __CTYPE_MH_TEXT_PLAIN) && __CTYPE_MH_WIN1252)
>>>
>>> describe WIN_CHARSET            Content-Type is Windows-specific text
>>>
>>> score WIN_CHARSET               0.01
>>>
>

Re: ????? ??? ??????

Reply via email to