Re: [go-nuts] understanding utf-8 for a newbie

2017-05-07 Thread Jan Mercl
On Sun, May 7, 2017 at 8:39 PM peterGo  wrote:

> "[Rob Pike and Ken Thompson] they made sure it was backwards compatible
with ASCII."

> ASCII is 7-bits.

So is any UTF-8 encoded ASCII.

-- 

-j

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] understanding utf-8 for a newbie

2017-05-07 Thread peterGo
Sam,

"[Rob Pike and Ken Thompson] they made sure it was backwards compatible 
with ASCII."

ASCII is 7-bits.

Peter

On Sunday, May 7, 2017 at 11:29:53 AM UTC-4, Sam Whited wrote:
>
> On Sun, May 7, 2017 at 9:44 AM, rob solomon  > wrote: 
> > I now understand that the bytes may be different. 
>
> It's also worth noting that when Ken Thompson and Rob Pike (yes, the 
> same Rob Pike and Ken Thompson that created Go) created UTF-8, they 
> made sure it was backwards compatible with ASCII. Any characters that 
> are representable in ASCII will be the exact same bytes when encoded 
> to UTF-8. I'd be suprised if Windows didn't understand UTF-8 these 
> days, so it may be that you really don't need to "convert" your file 
> at all. 
>
> Here's a fun introduction to Unicode (with a brief discussion of 
> encoding methods), if you're interested: 
>
> http://reedbeta.com/blog/programmers-intro-to-unicode/ 
>
> —Sam 
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] understanding utf-8 for a newbie

2017-05-07 Thread peterGo
Sam,

"I'd be suprised if Windows didn't understand UTF-8 these days,"

Be surprised! For Unicode, Microsoft Windows uses UTF-16.

Peter

On Sunday, May 7, 2017 at 11:29:53 AM UTC-4, Sam Whited wrote:
>
> On Sun, May 7, 2017 at 9:44 AM, rob solomon  > wrote: 
> > I now understand that the bytes may be different. 
>
> It's also worth noting that when Ken Thompson and Rob Pike (yes, the 
> same Rob Pike and Ken Thompson that created Go) created UTF-8, they 
> made sure it was backwards compatible with ASCII. Any characters that 
> are representable in ASCII will be the exact same bytes when encoded 
> to UTF-8. I'd be suprised if Windows didn't understand UTF-8 these 
> days, so it may be that you really don't need to "convert" your file 
> at all. 
>
> Here's a fun introduction to Unicode (with a brief discussion of 
> encoding methods), if you're interested: 
>
> http://reedbeta.com/blog/programmers-intro-to-unicode/ 
>
> —Sam 
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] understanding utf-8 for a newbie

2017-05-07 Thread Sam Whited
On Sun, May 7, 2017 at 9:44 AM, rob solomon  wrote:
> I now understand that the bytes may be different.

It's also worth noting that when Ken Thompson and Rob Pike (yes, the
same Rob Pike and Ken Thompson that created Go) created UTF-8, they
made sure it was backwards compatible with ASCII. Any characters that
are representable in ASCII will be the exact same bytes when encoded
to UTF-8. I'd be suprised if Windows didn't understand UTF-8 these
days, so it may be that you really don't need to "convert" your file
at all.

Here's a fun introduction to Unicode (with a brief discussion of
encoding methods), if you're interested:

http://reedbeta.com/blog/programmers-intro-to-unicode/

—Sam

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] understanding utf-8 for a newbie

2017-05-07 Thread rob solomon

Thanks to those who answered.

I grew up in the EBCDIC vs ASCII era, and I've always expected that the 
bytes in the file were the same as those that represented a character.


I now understand that the bytes may be different.

Thanks guys.

-- rob solomon

--
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] understanding utf-8 for a newbie

2017-05-06 Thread Sam Whited
On Fri, May 5, 2017 at 8:11 PM, rob solomon  wrote:
> I decided to first change ", ' and emdash characters.  Using hexdump -C in
> Ubuntu, the runes in the file are:
>
> open quote = 0xE2809C
>
> close quote = 0xE2809D
>
> apostrophe = 0xE28099
>
> emdash = 0xE28094

The output of hexdump will be the actual bytes of the file; these are
the UTF-8 encoded values.

> However, when I write a simple program to display these runes from the file,
> using the routines in unicode/utf8, I get very different values.  I do not
> understand this.
>
> open quote = 0x201C
>
> close quote = 0x201D
>
> apostrophe = 0x2019
>
> emdash = 0x2014.

These are called Unicode codepoints. In Unicode lots of different
things like letters, numbers, emoji, etc. are assigned numbers  (Go's
type for storing codepoints is called "rune"). These numbers are then
encoded using an encoding such as UTF-8 to make the final output which
you saw when you used hexdump. The Unicode codepoint of an em dash is
always U+2014 (sometimes they're written this way, prefixed by `U+'),
but the encoding might be different depending on what system you're on
or what file format you're using.

Here is an example of encoding a rune with the value 0x2014 as UTF-8,
which gives the number you observed in your hexdump output:
https://play.golang.org/p/ddIfzobKD4

—Sam

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [go-nuts] understanding utf-8 for a newbie

2017-05-05 Thread Andy Balholm
Hexdump shows the actual bytes in the file—the UTF-8 encoding of the runes 
(Unicode code points). Apparently you are reading them with utf8.DecodeRune or 
something like that; those return the code points, without the UTF-8 encoding.

Andy

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[go-nuts] understanding utf-8 for a newbie

2017-05-05 Thread rob solomon
Hi.  I decided to write a small program in Go to convert utf8 to simple 
ASCII.  This need arose by my copying a file created in Ubuntu 16.04 
amd64, and used on a win10 computer.


I decided to first change ", ' and emdash characters.  Using hexdump -C 
in Ubuntu, the runes in the file are:


open quote = 0xE2809C

close quote = 0xE2809D

apostrophe = 0xE28099

emdash = 0xE28094


However, when I write a simple program to display these runes from the 
file, using the routines in unicode/utf8, I get very different values.  
I do not understand this.


open quote = 0x201C

close quote = 0x201D

apostrophe = 0x2019

emdash = 0x2014.


Why are the runes returned by utf8.DecodeRuneInString different from 
what hexdump shows when inspecting the file directly?


--rob solomon

--
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.