Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

2003-06-16 Thread Ian Barwick
On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote:
 This is my solution / bug report / RFC cross-posted from [GENERAL]
 regarding insertion of hexadecimal characters from the command line.
 ---

 Okay.  I have NO IDEA why this works.  If someone could enlighten me as to
 the math involved I'd appreciate it.  First, a little background:

 The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of
 representing most unicode characters in two bytes, and most latin
 characters in one byte.

 The only way I have found to insert a euro symbol into the database from
 the command line psql client is this: INSERT INTO mytable
 VALUES('\342\202\254');

 I don't know why this works.  In hex, those octal values are:
   E2 82 AC

My apologies, I forgot to mention converting to UTF-8 in my original
reply.

 Additionally, according to the psql online documentation and man page:
 Anything contained in single quotes is furthermore subject to C-like
 substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits
 (the character with the given decimal, octal, or hexadecimal code).

 Those digits *should* be interpreted as decimal digits, but they aren't. 
 The man page for psql is either incorrect, or the implementation is buggy.

The docs are easy to misunderstand if you are scanning them in a hurry.
This section is referring to substitutions in psql's own meta commands,
not SQL statements, e.g. this:

\echo '\0xe2\0x82\0xac'

will display the Euro sign (assuming your terminal can print it).


Ian Barwick
[EMAIL PROTECTED]


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

2003-06-13 Thread Roland Glenn McIntosh
This is my solution / bug report / RFC cross-posted from [GENERAL] regarding insertion 
of hexadecimal characters from the command line.
---

Okay.  I have NO IDEA why this works.  If someone could enlighten me as to the math 
involved I'd appreciate it.  First, a little background:

The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of representing most 
unicode characters in two bytes, and most latin characters in one byte.

The only way I have found to insert a euro symbol into the database from the command 
line psql client is this:
INSERT INTO mytable VALUES('\342\202\254');

I don't know why this works.  In hex, those octal values are:
E2 82 AC

I don't know why my 20 byte turned into two bytes of E2 and 82.  Furthermore, I was 
under the impression that a UTF-8 encoding of the Euro sign only took two bytes.  
Corroborating this assumption, upon dumping that table with pg_dump and examining the 
resultant file in a hex editor, I see this in that character position: AC 20

Additionally, according to the psql online documentation and man page:
Anything contained in single quotes is furthermore subject to C-like substitutions 
for \n (new line), \t (tab), \digits, \0digits, and \0xdigits (the character with the 
given decimal, octal, or hexadecimal code).

Those digits *should* be interpreted as decimal digits, but they aren't.  The man page 
for psql is either incorrect, or the implementation is buggy.

I did try the '\0x20AC' method, and '\0x20\0xAC' without success.
It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm 
reading my UTF-8 string out of it like this, via JDBC:
String value = new String( resultset.getBytes(1), UTF-8);

Can anyone help me make sense of this mumbo jumbo?
-Roland 


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

2003-06-13 Thread Jeroen T. Vermeulen
On Fri, Jun 13, 2003 at 11:28:36AM -0400, Roland Glenn McIntosh wrote:
 
 The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of representing 
 most unicode characters in two bytes, and most latin characters in one byte.
 
More precisely, UTF-8 encodes ASCII characters in one byte.  All other
latin-1 characters take 2 bytes IIRC, with the rest taking up to 4 bytes.


 I don't know why my 20 byte turned into two bytes of E2 and 82.  

Haven't got the spec handy, but UTF-8 uses the most-significant bit(s) of
each byte as a continuation field.  If the upper bit is zero, the char
is a plain 7-bit ASCII value.  If it's 1, the byte is part of a multibyte
sequence with a few most-significant bits indicating the sequence's length
and the byte's position in it (IIRC it's something like a countdown to the
end of the sequence).

In a nutshell, you can't just take bits away from your Unicode value and
call it UTF-8; it's a variable-length encoding and it needs some extra
room for the length information to go.

Furthermore, I don't think the Euro symbol is in latin-1 at all.  It was
added in latin-9 (iso 8859-15) and so it's not likely to have gotten a
retroactive spot in the bottom 256 character values.  Hence it will take
UTF-8 more bytes to encode it.


 Furthermore, I was under the impression that a UTF-8 encoding of the Euro sign only 
 took two bytes.  Corroborating this assumption, upon dumping that table with pg_dump 
 and examining the resultant file in a hex editor, I see this in that character 
 position: AC 20
 
How does that corroborate the assumption?  You're looking at the Unicode
value now, in a fixed-length 16-bit encoding.

 
 I did try the '\0x20AC' method, and '\0x20\0xAC' without success.
 It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm 
 reading my UTF-8 string out of it like this, via JDBC:

You can't fit UTF-8 into ASCII.  UTF-8 is an eight-byte encoding; ASCII
is a 7-bit character set.


Jeroen


---(end of broadcast)---
TIP 9: most folks find a random_page_cost between 1 or 2 is ideal


Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

2003-06-13 Thread Ian Barwick
On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote:
 This is my solution / bug report / RFC cross-posted from [GENERAL]
 regarding insertion of hexadecimal characters from the command line.
 ---

 Okay.  I have NO IDEA why this works.  If someone could enlighten me as to
 the math involved I'd appreciate it.  First, a little background:

 The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of
 representing most unicode characters in two bytes, and most latin
 characters in one byte.

 The only way I have found to insert a euro symbol into the database from
 the command line psql client is this: INSERT INTO mytable
 VALUES('\342\202\254');

 I don't know why this works.  In hex, those octal values are:
   E2 82 AC

My apologies, I forgot to mention converting to UTF-8 in my original
reply.

 Additionally, according to the psql online documentation and man page:
 Anything contained in single quotes is furthermore subject to C-like
 substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits
 (the character with the given decimal, octal, or hexadecimal code).

 Those digits *should* be interpreted as decimal digits, but they aren't. 
 The man page for psql is either incorrect, or the implementation is buggy.

The docs are easy to misunderstand if you are scanning them in a hurry.
This section is referring to substitutions in psql's own meta commands,
not SQL statements, e.g. this:

\echo '\0xe2\0x82\0xac'

will display the Euro sign (assuming your terminal can print it).


Ian Barwick
[EMAIL PROTECTED]



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html