Re: [Scons-dev] Merge PR #235 before release
Sounds like we should use sys.getfilesystemencoding(). +1 The only trick might be if you had several filesystems each of which had different filesystemencoding()... May just be best to back out the patch for now. -Bill On Thu, May 28, 2015 at 3:54 AM, Gary Oberbrunner wrote: > If you're interested in this problem, I suggest reading > https://docs.python.org/2/howto/unicode.html which has all the details > (including how to ignore decode errors), and of course check out the > python3 branch of scons where a lot of unicode handling has been done (but > much is still left to do iirc). I don't think pretending strings are in > the cp437 encoding is a particularly good plan. ISO 8859-1 or Windows > CP1252 would probably give better results in some cases but you still need > to ignore errors in the decode. And of course if the string actually is > utf-8 with non-ascii chars, either of these encodings will return a string > of the wrong length, not just wrong characters; and re-encoding it for > output or storage will completely mangle it. > > Of course we _can_ know the encoding of the filenames in the filesystem, > that's what sys.getfilesystemencoding() is for (see the unicode link > above). Reading file contents and handling stdout/stderr from SCons > subprocesses is much more of a challenge. > > > On Thu, May 28, 2015 at 3:28 AM, anatoly techtonik > wrote: > >> I found a way to convert any binary string to Unicode without crashing - >> http://stackoverflow.com/a/27527728/239247 That would correctly >> convert all `ascii` characters (and will probably make it possible to use >> ANSI graphics if unicode font supports that), but it will not work for >> other >> utf-8 characters. >> >> Python 3 adds some surrogateescape, but that is not present in Python 2. >> >> http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2 >> I don't know why they called it "surrogate" - it is a freaky word. >> >> On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L >> wrote: >> > I would agree with this. >> > >> > >> > >> > In general the OS today store file data ( ie the file system data not >> the >> > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is >> not >> > always the case it could be big5 or some other locale encoding. On >> Linux >> > there are means to see what the “native” encoding is to use it. >> > >> > >> > >> > I should note that the idea of converting binary to Unicode does not >> really >> > exist. The point of a binary string to is to hold random data ( ie like >> a >> > double in the raw form 64-bit vs the dec values of 1.2385). One can >> assume >> > that it is a certain code page encoding and convert from that. And like >> I >> > stated above there are api to see what the locale code page encoding is >> and >> > that can be used to convert the code to the local ANSI/OEM encoding. >> This is >> > different from a binary string. >> > >> > >> > >> > Jason >> > >> > >> > >> > >> > >> > >> > >> > From: Scons-dev [mailto:scons-dev-boun...@scons.org] On Behalf Of Gary >> > Oberbrunner >> > Sent: Wednesday, May 27, 2015 7:43 AM >> > To: SCons developer list >> > Subject: Re: [Scons-dev] Merge PR #235 before release >> > >> > >> > >> > >> > >> > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik > > >> > wrote: >> > >> > What I need is a bulletproof way to convert from anything to unicode. >> This >> > requires some kind of escaping to go forward and back. Some helper >> > methods like u2b() (unicode to binary) and b2u(). I am quite surprised >> that >> > so far I found nothing for this "simple" case. >> > >> > >> > That's because in general the encoding of the "binary" string is >> unknown. >> > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else? You >> > can't decode such a string to Unicode without knowing the encoding. >> Check >> > out the python-3 branch where we've been working through some of those >> > issues. Your u2b is "easy" if you assume you want the binary to be >> utf-8 >> > encoded, which is normally safe; this conversion is guaranteed to work. >> > Your b2u is not so easy. You can't just assume utf-8 as you might >> think; if >> > the string has invalid utf-8 bytes it'll raise an error or generate >> dummy >> > chars depending on the args you pass to str.decode(). At least it'll >> get >> > mangled if it's in a different encoding than you expect. >> > >> > >> > >> > -- >> > >> > Gary >> > >> > >> > ___ >> > Scons-dev mailing list >> > Scons-dev@scons.org >> > https://pairlist2.pair.net/mailman/listinfo/scons-dev >> > >> >> >> >> -- >> anatoly t. >> ___ >> Scons-dev mailing list >> Scons-dev@scons.org >> https://pairlist2.pair.net/mailman/listinfo/scons-dev >> > > > > -- > Gary > > ___ > Scons-dev mailing list > Scons-dev@scons.org > https://pairlist2.pair.net/mailman/listinfo/scons-dev > > _
Re: [Scons-dev] Upgrading Mailman to 3.0?
Anatoly, No. We don't host mailman. The webhost does. -Bill On Thu, May 28, 2015 at 5:20 AM, anatoly techtonik wrote: > On Thu, May 28, 2015 at 10:26 AM, Dirk Bächle wrote: > > On 28.05.2015 09:01, anatoly techtonik wrote: > >> > >> I just wonder if we can try newer Mailman to power SCons > >> communication. > > > > I hadn't noticed that our mailing list communication is so bad that we > need > > to power it up. And for me, power in discussions and texts and documents > > still comes mainly from their content...not from the tools that transport > > the latter. ;) > > Well, I'd prefer stuff like http://try.discourse.org/ for communication. > List seem a little dated. Now I am subscribed and rather active, but for > one > who is not so deeply involved, using all that oldschool stuff may be hard. > > >> That may bring greater good than newer web site. > > > > It may, or it may not. It's a coin toss, but might also only be a > > fifty-fifty chance. :) > > Well, a proper hypothesis and assessment tests might improve the > chance. =) > > >> I > >> expect to finally find search button there. > >> > > > > I'm unsure about what you're trying to say here: Do you simply *wish* > for a > > "find" button to be there, or do you actually *know* that Mailman has > one? > > Well, if Mailman 3 won't have the search, then we can switch to discourse. > -- > anatoly t. > ___ > Scons-dev mailing list > Scons-dev@scons.org > https://pairlist2.pair.net/mailman/listinfo/scons-dev > ___ Scons-dev mailing list Scons-dev@scons.org https://pairlist2.pair.net/mailman/listinfo/scons-dev
Re: [Scons-dev] Upgrading Mailman to 3.0?
On Thu, May 28, 2015 at 10:26 AM, Dirk Bächle wrote: > On 28.05.2015 09:01, anatoly techtonik wrote: >> >> I just wonder if we can try newer Mailman to power SCons >> communication. > > I hadn't noticed that our mailing list communication is so bad that we need > to power it up. And for me, power in discussions and texts and documents > still comes mainly from their content...not from the tools that transport > the latter. ;) Well, I'd prefer stuff like http://try.discourse.org/ for communication. List seem a little dated. Now I am subscribed and rather active, but for one who is not so deeply involved, using all that oldschool stuff may be hard. >> That may bring greater good than newer web site. > > It may, or it may not. It's a coin toss, but might also only be a > fifty-fifty chance. :) Well, a proper hypothesis and assessment tests might improve the chance. =) >> I >> expect to finally find search button there. >> > > I'm unsure about what you're trying to say here: Do you simply *wish* for a > "find" button to be there, or do you actually *know* that Mailman has one? Well, if Mailman 3 won't have the search, then we can switch to discourse. -- anatoly t. ___ Scons-dev mailing list Scons-dev@scons.org https://pairlist2.pair.net/mailman/listinfo/scons-dev
Re: [Scons-dev] Merge PR #235 before release
If you're interested in this problem, I suggest reading https://docs.python.org/2/howto/unicode.html which has all the details (including how to ignore decode errors), and of course check out the python3 branch of scons where a lot of unicode handling has been done (but much is still left to do iirc). I don't think pretending strings are in the cp437 encoding is a particularly good plan. ISO 8859-1 or Windows CP1252 would probably give better results in some cases but you still need to ignore errors in the decode. And of course if the string actually is utf-8 with non-ascii chars, either of these encodings will return a string of the wrong length, not just wrong characters; and re-encoding it for output or storage will completely mangle it. Of course we _can_ know the encoding of the filenames in the filesystem, that's what sys.getfilesystemencoding() is for (see the unicode link above). Reading file contents and handling stdout/stderr from SCons subprocesses is much more of a challenge. On Thu, May 28, 2015 at 3:28 AM, anatoly techtonik wrote: > I found a way to convert any binary string to Unicode without crashing - > http://stackoverflow.com/a/27527728/239247 That would correctly > convert all `ascii` characters (and will probably make it possible to use > ANSI graphics if unicode font supports that), but it will not work for > other > utf-8 characters. > > Python 3 adds some surrogateescape, but that is not present in Python 2. > > http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2 > I don't know why they called it "surrogate" - it is a freaky word. > > On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L > wrote: > > I would agree with this. > > > > > > > > In general the OS today store file data ( ie the file system data not the > > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is > not > > always the case it could be big5 or some other locale encoding. On Linux > > there are means to see what the “native” encoding is to use it. > > > > > > > > I should note that the idea of converting binary to Unicode does not > really > > exist. The point of a binary string to is to hold random data ( ie like a > > double in the raw form 64-bit vs the dec values of 1.2385). One can > assume > > that it is a certain code page encoding and convert from that. And like I > > stated above there are api to see what the locale code page encoding is > and > > that can be used to convert the code to the local ANSI/OEM encoding. > This is > > different from a binary string. > > > > > > > > Jason > > > > > > > > > > > > > > > > From: Scons-dev [mailto:scons-dev-boun...@scons.org] On Behalf Of Gary > > Oberbrunner > > Sent: Wednesday, May 27, 2015 7:43 AM > > To: SCons developer list > > Subject: Re: [Scons-dev] Merge PR #235 before release > > > > > > > > > > > > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik > > wrote: > > > > What I need is a bulletproof way to convert from anything to unicode. > This > > requires some kind of escaping to go forward and back. Some helper > > methods like u2b() (unicode to binary) and b2u(). I am quite surprised > that > > so far I found nothing for this "simple" case. > > > > > > That's because in general the encoding of the "binary" string is unknown. > > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else? You > > can't decode such a string to Unicode without knowing the encoding. > Check > > out the python-3 branch where we've been working through some of those > > issues. Your u2b is "easy" if you assume you want the binary to be utf-8 > > encoded, which is normally safe; this conversion is guaranteed to work. > > Your b2u is not so easy. You can't just assume utf-8 as you might > think; if > > the string has invalid utf-8 bytes it'll raise an error or generate dummy > > chars depending on the args you pass to str.decode(). At least it'll get > > mangled if it's in a different encoding than you expect. > > > > > > > > -- > > > > Gary > > > > > > ___ > > Scons-dev mailing list > > Scons-dev@scons.org > > https://pairlist2.pair.net/mailman/listinfo/scons-dev > > > > > > -- > anatoly t. > ___ > Scons-dev mailing list > Scons-dev@scons.org > https://pairlist2.pair.net/mailman/listinfo/scons-dev > -- Gary ___ Scons-dev mailing list Scons-dev@scons.org https://pairlist2.pair.net/mailman/listinfo/scons-dev
Re: [Scons-dev] Merge PR #235 before release
I found a way to convert any binary string to Unicode without crashing - http://stackoverflow.com/a/27527728/239247 That would correctly convert all `ascii` characters (and will probably make it possible to use ANSI graphics if unicode font supports that), but it will not work for other utf-8 characters. Python 3 adds some surrogateescape, but that is not present in Python 2. http://stackoverflow.com/questions/19649463/how-to-do-surrogateescape-in-python2 I don't know why they called it "surrogate" - it is a freaky word. On Wed, May 27, 2015 at 4:33 PM, Kenny, Jason L wrote: > I would agree with this. > > > > In general the OS today store file data ( ie the file system data not the > data in the file) in Unicode ( be it utf-16 or utf-8). On Linux this is not > always the case it could be big5 or some other locale encoding. On Linux > there are means to see what the “native” encoding is to use it. > > > > I should note that the idea of converting binary to Unicode does not really > exist. The point of a binary string to is to hold random data ( ie like a > double in the raw form 64-bit vs the dec values of 1.2385). One can assume > that it is a certain code page encoding and convert from that. And like I > stated above there are api to see what the locale code page encoding is and > that can be used to convert the code to the local ANSI/OEM encoding. This is > different from a binary string. > > > > Jason > > > > > > > > From: Scons-dev [mailto:scons-dev-boun...@scons.org] On Behalf Of Gary > Oberbrunner > Sent: Wednesday, May 27, 2015 7:43 AM > To: SCons developer list > Subject: Re: [Scons-dev] Merge PR #235 before release > > > > > > On Wed, May 27, 2015 at 6:52 AM, anatoly techtonik > wrote: > > What I need is a bulletproof way to convert from anything to unicode. This > requires some kind of escaping to go forward and back. Some helper > methods like u2b() (unicode to binary) and b2u(). I am quite surprised that > so far I found nothing for this "simple" case. > > > That's because in general the encoding of the "binary" string is unknown. > Is it ascii, utf-8, Windows CP-1252, shift-JIS, or something else? You > can't decode such a string to Unicode without knowing the encoding. Check > out the python-3 branch where we've been working through some of those > issues. Your u2b is "easy" if you assume you want the binary to be utf-8 > encoded, which is normally safe; this conversion is guaranteed to work. > Your b2u is not so easy. You can't just assume utf-8 as you might think; if > the string has invalid utf-8 bytes it'll raise an error or generate dummy > chars depending on the args you pass to str.decode(). At least it'll get > mangled if it's in a different encoding than you expect. > > > > -- > > Gary > > > ___ > Scons-dev mailing list > Scons-dev@scons.org > https://pairlist2.pair.net/mailman/listinfo/scons-dev > -- anatoly t. ___ Scons-dev mailing list Scons-dev@scons.org https://pairlist2.pair.net/mailman/listinfo/scons-dev
Re: [Scons-dev] Upgrading Mailman to 3.0?
Hi Anatoly, On 28.05.2015 09:01, anatoly techtonik wrote: Hi, I just wonder if we can try newer Mailman to power SCons communication. I hadn't noticed that our mailing list communication is so bad that we need to power it up. And for me, power in discussions and texts and documents still comes mainly from their content...not from the tools that transport the latter. ;) That may bring greater good than newer web site. It may, or it may not. It's a coin toss, but might also only be a fifty-fifty chance. :) I expect to finally find search button there. I'm unsure about what you're trying to say here: Do you simply *wish* for a "find" button to be there, or do you actually *know* that Mailman has one? Best regards, Dirk ___ Scons-dev mailing list Scons-dev@scons.org https://pairlist2.pair.net/mailman/listinfo/scons-dev
[Scons-dev] Upgrading Mailman to 3.0?
Hi, I just wonder if we can try newer Mailman to power SCons communication. That may bring greater good than newer web site. I expect to finally find search button there. http://wiki.list.org/Mailman3 -- anatoly t. ___ Scons-dev mailing list Scons-dev@scons.org https://pairlist2.pair.net/mailman/listinfo/scons-dev