Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote: > This is my solution / bug report / RFC cross-posted from [GENERAL] > regarding insertion of hexadecimal characters from the command line. > --- > > Okay. I have NO IDEA why this works. If someone could enlighten me as to > the math involved I'd appreciate it. First, a little background: > > The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of > representing most unicode characters in two bytes, and most latin > characters in one byte. > > The only way I have found to insert a euro symbol into the database from > the command line psql client is this: INSERT INTO mytable > VALUES('\342\202\254'); > > I don't know why this works. In hex, those octal values are: > E2 82 AC My apologies, I forgot to mention converting to UTF-8 in my original reply. > Additionally, according to the psql online documentation and man page: > "Anything contained in single quotes is furthermore subject to C-like > substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits > (the character with the given decimal, octal, or hexadecimal code)." > > Those digits *should* be interpreted as decimal digits, but they aren't. > The man page for psql is either incorrect, or the implementation is buggy. The docs are easy to misunderstand if you are scanning them in a hurry. This section is referring to substitutions in psql's own meta commands, not SQL statements, e.g. this: \echo '\0xe2\0x82\0xac' will display the Euro sign (assuming your terminal can print it). Ian Barwick [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
On Friday 13 June 2003 17:28, Roland Glenn McIntosh wrote: > This is my solution / bug report / RFC cross-posted from [GENERAL] > regarding insertion of hexadecimal characters from the command line. > --- > > Okay. I have NO IDEA why this works. If someone could enlighten me as to > the math involved I'd appreciate it. First, a little background: > > The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of > representing most unicode characters in two bytes, and most latin > characters in one byte. > > The only way I have found to insert a euro symbol into the database from > the command line psql client is this: INSERT INTO mytable > VALUES('\342\202\254'); > > I don't know why this works. In hex, those octal values are: > E2 82 AC My apologies, I forgot to mention converting to UTF-8 in my original reply. > Additionally, according to the psql online documentation and man page: > "Anything contained in single quotes is furthermore subject to C-like > substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits > (the character with the given decimal, octal, or hexadecimal code)." > > Those digits *should* be interpreted as decimal digits, but they aren't. > The man page for psql is either incorrect, or the implementation is buggy. The docs are easy to misunderstand if you are scanning them in a hurry. This section is referring to substitutions in psql's own meta commands, not SQL statements, e.g. this: \echo '\0xe2\0x82\0xac' will display the Euro sign (assuming your terminal can print it). Ian Barwick [EMAIL PROTECTED] ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
On Fri, Jun 13, 2003 at 11:28:36AM -0400, Roland Glenn McIntosh wrote: > > The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of representing > most unicode characters in two bytes, and most latin characters in one byte. More precisely, UTF-8 encodes ASCII characters in one byte. All other latin-1 characters take 2 bytes IIRC, with the rest taking up to 4 bytes. > I don't know why my "20" byte turned into two bytes of E2 and 82. Haven't got the spec handy, but UTF-8 uses the most-significant bit(s) of each byte as a "continuation" field. If the upper bit is zero, the char is a plain 7-bit ASCII value. If it's 1, the byte is part of a multibyte sequence with a few most-significant bits indicating the sequence's length and the byte's position in it (IIRC it's something like a countdown to the end of the sequence). In a nutshell, you can't just take bits away from your Unicode value and call it UTF-8; it's a variable-length encoding and it needs some extra room for the length information to go. Furthermore, I don't think the Euro symbol is in latin-1 at all. It was added in latin-9 (iso 8859-15) and so it's not likely to have gotten a retroactive spot in the bottom 256 character values. Hence it will take UTF-8 more bytes to encode it. > Furthermore, I was under the impression that a UTF-8 encoding of the Euro sign only > took two bytes. Corroborating this assumption, upon dumping that table with pg_dump > and examining the resultant file in a hex editor, I see this in that character > position: AC 20 How does that "corroborate the assumption?" You're looking at the Unicode value now, in a fixed-length 16-bit encoding. > I did try the '\0x20AC' method, and '\0x20\0xAC' without success. > It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm > reading my UTF-8 string out of it like this, via JDBC: You can't fit UTF-8 into ASCII. UTF-8 is an eight-byte encoding; ASCII is a 7-bit character set. Jeroen ---(end of broadcast)--- TIP 9: most folks find a random_page_cost between 1 or 2 is ideal
[HACKERS] SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.
This is my solution / bug report / RFC cross-posted from [GENERAL] regarding insertion of hexadecimal characters from the command line. --- Okay. I have NO IDEA why this works. If someone could enlighten me as to the math involved I'd appreciate it. First, a little background: The Euro symbol is unicode value 0x20AC. UTF-8 encoding is a way of representing most unicode characters in two bytes, and most latin characters in one byte. The only way I have found to insert a euro symbol into the database from the command line psql client is this: INSERT INTO mytable VALUES('\342\202\254'); I don't know why this works. In hex, those octal values are: E2 82 AC I don't know why my "20" byte turned into two bytes of E2 and 82. Furthermore, I was under the impression that a UTF-8 encoding of the Euro sign only took two bytes. Corroborating this assumption, upon dumping that table with pg_dump and examining the resultant file in a hex editor, I see this in that character position: AC 20 Additionally, according to the psql online documentation and man page: "Anything contained in single quotes is furthermore subject to C-like substitutions for \n (new line), \t (tab), \digits, \0digits, and \0xdigits (the character with the given decimal, octal, or hexadecimal code)." Those digits *should* be interpreted as decimal digits, but they aren't. The man page for psql is either incorrect, or the implementation is buggy. I did try the '\0x20AC' method, and '\0x20\0xAC' without success. It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm reading my UTF-8 string out of it like this, via JDBC: String value = new String( resultset.getBytes(1), "UTF-8"); Can anyone help me make sense of this mumbo jumbo? -Roland ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly