Re: [go-nuts] understanding utf-8 for a newbie
On Sun, May 7, 2017 at 8:39 PM peterGowrote: > "[Rob Pike and Ken Thompson] they made sure it was backwards compatible with ASCII." > ASCII is 7-bits. So is any UTF-8 encoded ASCII. -- -j -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [go-nuts] understanding utf-8 for a newbie
Sam, "[Rob Pike and Ken Thompson] they made sure it was backwards compatible with ASCII." ASCII is 7-bits. Peter On Sunday, May 7, 2017 at 11:29:53 AM UTC-4, Sam Whited wrote: > > On Sun, May 7, 2017 at 9:44 AM, rob solomon> wrote: > > I now understand that the bytes may be different. > > It's also worth noting that when Ken Thompson and Rob Pike (yes, the > same Rob Pike and Ken Thompson that created Go) created UTF-8, they > made sure it was backwards compatible with ASCII. Any characters that > are representable in ASCII will be the exact same bytes when encoded > to UTF-8. I'd be suprised if Windows didn't understand UTF-8 these > days, so it may be that you really don't need to "convert" your file > at all. > > Here's a fun introduction to Unicode (with a brief discussion of > encoding methods), if you're interested: > > http://reedbeta.com/blog/programmers-intro-to-unicode/ > > —Sam > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [go-nuts] understanding utf-8 for a newbie
Sam, "I'd be suprised if Windows didn't understand UTF-8 these days," Be surprised! For Unicode, Microsoft Windows uses UTF-16. Peter On Sunday, May 7, 2017 at 11:29:53 AM UTC-4, Sam Whited wrote: > > On Sun, May 7, 2017 at 9:44 AM, rob solomon> wrote: > > I now understand that the bytes may be different. > > It's also worth noting that when Ken Thompson and Rob Pike (yes, the > same Rob Pike and Ken Thompson that created Go) created UTF-8, they > made sure it was backwards compatible with ASCII. Any characters that > are representable in ASCII will be the exact same bytes when encoded > to UTF-8. I'd be suprised if Windows didn't understand UTF-8 these > days, so it may be that you really don't need to "convert" your file > at all. > > Here's a fun introduction to Unicode (with a brief discussion of > encoding methods), if you're interested: > > http://reedbeta.com/blog/programmers-intro-to-unicode/ > > —Sam > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [go-nuts] understanding utf-8 for a newbie
On Sun, May 7, 2017 at 9:44 AM, rob solomonwrote: > I now understand that the bytes may be different. It's also worth noting that when Ken Thompson and Rob Pike (yes, the same Rob Pike and Ken Thompson that created Go) created UTF-8, they made sure it was backwards compatible with ASCII. Any characters that are representable in ASCII will be the exact same bytes when encoded to UTF-8. I'd be suprised if Windows didn't understand UTF-8 these days, so it may be that you really don't need to "convert" your file at all. Here's a fun introduction to Unicode (with a brief discussion of encoding methods), if you're interested: http://reedbeta.com/blog/programmers-intro-to-unicode/ —Sam -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[go-nuts] understanding utf-8 for a newbie
Thanks to those who answered. I grew up in the EBCDIC vs ASCII era, and I've always expected that the bytes in the file were the same as those that represented a character. I now understand that the bytes may be different. Thanks guys. -- rob solomon -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [go-nuts] understanding utf-8 for a newbie
On Fri, May 5, 2017 at 8:11 PM, rob solomonwrote: > I decided to first change ", ' and emdash characters. Using hexdump -C in > Ubuntu, the runes in the file are: > > open quote = 0xE2809C > > close quote = 0xE2809D > > apostrophe = 0xE28099 > > emdash = 0xE28094 The output of hexdump will be the actual bytes of the file; these are the UTF-8 encoded values. > However, when I write a simple program to display these runes from the file, > using the routines in unicode/utf8, I get very different values. I do not > understand this. > > open quote = 0x201C > > close quote = 0x201D > > apostrophe = 0x2019 > > emdash = 0x2014. These are called Unicode codepoints. In Unicode lots of different things like letters, numbers, emoji, etc. are assigned numbers (Go's type for storing codepoints is called "rune"). These numbers are then encoded using an encoding such as UTF-8 to make the final output which you saw when you used hexdump. The Unicode codepoint of an em dash is always U+2014 (sometimes they're written this way, prefixed by `U+'), but the encoding might be different depending on what system you're on or what file format you're using. Here is an example of encoding a rune with the value 0x2014 as UTF-8, which gives the number you observed in your hexdump output: https://play.golang.org/p/ddIfzobKD4 —Sam -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: [go-nuts] understanding utf-8 for a newbie
Hexdump shows the actual bytes in the file—the UTF-8 encoding of the runes (Unicode code points). Apparently you are reading them with utf8.DecodeRune or something like that; those return the code points, without the UTF-8 encoding. Andy -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[go-nuts] understanding utf-8 for a newbie
Hi. I decided to write a small program in Go to convert utf8 to simple ASCII. This need arose by my copying a file created in Ubuntu 16.04 amd64, and used on a win10 computer. I decided to first change ", ' and emdash characters. Using hexdump -C in Ubuntu, the runes in the file are: open quote = 0xE2809C close quote = 0xE2809D apostrophe = 0xE28099 emdash = 0xE28094 However, when I write a simple program to display these runes from the file, using the routines in unicode/utf8, I get very different values. I do not understand this. open quote = 0x201C close quote = 0x201D apostrophe = 0x2019 emdash = 0x2014. Why are the runes returned by utf8.DecodeRuneInString different from what hexdump shows when inspecting the file directly? --rob solomon -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.