Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote: This is my solution / bug report / RFC cross-posted from [GENERAL] regarding insertion of hexadecimal characters from the command line. --- Okay. I have NO IDEA why this works. If someone could enlighten me as to the math involved I'd appreciate it. First, a little background: The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of representing most unicode characters in two bytes, and most latin characters in one byte. The only way I have found to insert a euro symbol into the database from the command line psql client is this: INSERT INTO mytable VALUES('\342\202\254'); I don't know why this works. In hex, those octal values are: E2 82 AC My apologies, I forgot to mention converting to UTF-8 in my original reply. Additionally, according to the psql online documentation and man page: Anything contained in single quotes is furthermore subject to C-like substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits (the character with the given decimal, octal, or hexadecimal code). Those digits *should* be interpreted as decimal digits, but they aren't. The man page for psql is either incorrect, or the implementation is buggy. The docs are easy to misunderstand if you are scanning them in a hurry. This section is referring to substitutions in psql's own meta commands, not SQL statements, e.g. this: \echo '\0xe2\0x82\0xac' will display the Euro sign (assuming your terminal can print it). Ian Barwick [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
This is my solution / bug report / RFC cross-posted from [GENERAL] regarding insertion of hexadecimal characters from the command line. --- Okay. I have NO IDEA why this works. If someone could enlighten me as to the math involved I'd appreciate it. First, a little background: The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of representing most unicode characters in two bytes, and most latin characters in one byte. The only way I have found to insert a euro symbol into the database from the command line psql client is this: INSERT INTO mytable VALUES('\342\202\254'); I don't know why this works. In hex, those octal values are: E2 82 AC I don't know why my 20 byte turned into two bytes of E2 and 82. Furthermore, I was under the impression that a UTF-8 encoding of the Euro sign only took two bytes. Corroborating this assumption, upon dumping that table with pg_dump and examining the resultant file in a hex editor, I see this in that character position: AC 20 Additionally, according to the psql online documentation and man page: Anything contained in single quotes is furthermore subject to C-like substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits (the character with the given decimal, octal, or hexadecimal code). Those digits *should* be interpreted as decimal digits, but they aren't. The man page for psql is either incorrect, or the implementation is buggy. I did try the '\0x20AC' method, and '\0x20\0xAC' without success. It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm reading my UTF-8 string out of it like this, via JDBC: String value = new String( resultset.getBytes(1), UTF-8); Can anyone help me make sense of this mumbo jumbo? -Roland ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
On Fri, Jun 13, 2003 at 11:28:36AM -0400, Roland Glenn McIntosh wrote: The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of representing most unicode characters in two bytes, and most latin characters in one byte. More precisely, UTF-8 encodes ASCII characters in one byte. All other latin-1 characters take 2 bytes IIRC, with the rest taking up to 4 bytes. I don't know why my 20 byte turned into two bytes of E2 and 82. Haven't got the spec handy, but UTF-8 uses the most-significant bit(s) of each byte as a continuation field. If the upper bit is zero, the char is a plain 7-bit ASCII value. If it's 1, the byte is part of a multibyte sequence with a few most-significant bits indicating the sequence's length and the byte's position in it (IIRC it's something like a countdown to the end of the sequence). In a nutshell, you can't just take bits away from your Unicode value and call it UTF-8; it's a variable-length encoding and it needs some extra room for the length information to go. Furthermore, I don't think the Euro symbol is in latin-1 at all. It was added in latin-9 (iso 8859-15) and so it's not likely to have gotten a retroactive spot in the bottom 256 character values. Hence it will take UTF-8 more bytes to encode it. Furthermore, I was under the impression that a UTF-8 encoding of the Euro sign only took two bytes. Corroborating this assumption, upon dumping that table with pg_dump and examining the resultant file in a hex editor, I see this in that character position: AC 20 How does that corroborate the assumption? You're looking at the Unicode value now, in a fixed-length 16-bit encoding. I did try the '\0x20AC' method, and '\0x20\0xAC' without success. It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm reading my UTF-8 string out of it like this, via JDBC: You can't fit UTF-8 into ASCII. UTF-8 is an eight-byte encoding; ASCII is a 7-bit character set. Jeroen ---(end of broadcast)--- TIP 9: most folks find a random_page_cost between 1 or 2 is ideal
Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote: This is my solution / bug report / RFC cross-posted from [GENERAL] regarding insertion of hexadecimal characters from the command line. --- Okay. I have NO IDEA why this works. If someone could enlighten me as to the math involved I'd appreciate it. First, a little background: The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of representing most unicode characters in two bytes, and most latin characters in one byte. The only way I have found to insert a euro symbol into the database from the command line psql client is this: INSERT INTO mytable VALUES('\342\202\254'); I don't know why this works. In hex, those octal values are: E2 82 AC My apologies, I forgot to mention converting to UTF-8 in my original reply. Additionally, according to the psql online documentation and man page: Anything contained in single quotes is furthermore subject to C-like substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits (the character with the given decimal, octal, or hexadecimal code). Those digits *should* be interpreted as decimal digits, but they aren't. The man page for psql is either incorrect, or the implementation is buggy. The docs are easy to misunderstand if you are scanning them in a hurry. This section is referring to substitutions in psql's own meta commands, not SQL statements, e.g. this: \echo '\0xe2\0x82\0xac' will display the Euro sign (assuming your terminal can print it). Ian Barwick [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html