Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Aleksander Matuszak
Ken Hornstein writes: I've been grappling with to do when we have issues with character set conversion. Unfortunately, I have a lot of experience and troubles with character set conversion. Specifically, I have two issues: - What to do if the character set is unsupported. Should we

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Ken Hornstein
Unfortunately, I have a lot of experience and troubles with character set conversion. Well, if you just bit the bullet and switched to UTF-8, you wouldn't have all of these problems! :-) Should we return the original bytes? It is not the best idea. Some sequences of bytes are control

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Jerrad Pierce
amIn my personal opinion a very good choice is conversion into amhtml-entities, like aogon; or lstrok; . It remains quite readable and amis still unique enough to convert it back in case of need. krUm, ouch. Unless there's a common library that already implements krthat behavior, that's not on

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Ken Hornstein
krUm, ouch. Unless there's a common library that already implements krthat behavior, that's not on the table at all. Supposedly Recode does: http://recode.progiciels-bpi.ca/index.html A super-quick scan of our systems does not show that as something that comes out of the box installed on our

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Lyndon Nerenberg
This gets very icky, very quickly :-P My feeling is that if you don't recognize the source character set, you cannot possibly convert it to a display format in any secure manner. By default I think we should not display the content, but instead spit out a diagnostic, with the option to re-run

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Jerrad Pierce
Recode need not be required, it could just be an option. iconv currently isn't afterall, although they seem to complement each other. Recode is part of the core distrib of my older Ubuntu 10.02. Selective recoding would probably require calls for the substrings of interest. As an aside, recode's

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Lyndon Nerenberg
On Feb 28, 2014, at 12:01 PM, Ken Hornstein k...@pobox.com wrote: If we make sure we're converting all non-printable characters into something else, I'm unclear as to how that could happen. But if it can happen, please educate me! It's a case of fooling the GB* and multibyte converters into

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Lyndon Nerenberg
On Feb 28, 2014, at 12:01 PM, Ken Hornstein k...@pobox.com wrote: We'd still have to deal with what happens when you want to convert U+1F4A9 to ISO-8859-1. That's not an illegal parse of the input, it's a composting problem. Not the same thing at all. signature.asc Description: Message

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Ken Hornstein
Recode need not be required, it could just be an option. iconv currently isn't afterall, although they seem to complement each other. Recode is part of the core distrib of my older Ubuntu 10.02. Fair enough ... but iconv() is part of POSIX, so assuming that it's available is reasonable (if you

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Lyndon Nerenberg
On Feb 28, 2014, at 12:24 PM, Ken Hornstein k...@pobox.com wrote: Fair enough ... but iconv() is part of POSIX, so assuming that it's available is reasonable (if you don't have iconv(), we basically give up in terms of handling different character sets). Sadly, iconv() in practice is a

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Ken Hornstein
We'd still have to deal with what happens when you want to convert U+1F4A9 to ISO-8859-1. That's not an illegal parse of the input, it's a composting problem. Not the same thing at all. Sigh, IT'S THE SAME THING. iconv() returns EILSEQ at a particular point in your conversion buffer. What do

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Ken Hornstein
Sigh, IT'S THE SAME THING. iconv() returns EILSEQ at a particular point in your conversion buffer. What do you do next? In your example, emit a Pile Of Poo. I know you're being flippant ... but it's a serious question. Right now, iconv() returns EILSEQ if you cannot convert an input

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Lyndon Nerenberg
On Feb 28, 2014, at 1:01 PM, Ken Hornstein k...@pobox.com wrote: Based on _what you want to happen_, what, exactly, should be done from a programming perspective? Bail? Yes! Bail! Don't be a vector for someone to do nasties! If people want to see invalid content, they have cat(1) at hand.

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Ken Hornstein
Look, software cannot read minds. People would like it to, but I don't work for the NSA, so I don't buy into that concept. We have standards. For a reason. To eliminate ambiguity. MIME has been around for how many years now? There is no excuse in this day and age for any software to generate

Re: [Nmh-workers] General question - unsupported charset conversion

2014-02-28 Thread Ken Hornstein
That is right. On the other hand, you never prevent malformed MIME parameters. Remember that we're not talking about malformed MIME parameters; we're talking about entirely valid ones. It is not a problem in case of one or two missing or substituted symbols in long text. We can guess what is the