[SQL] Significance of Database Encoding

2005-05-15 Thread Rajesh Mallah
Hi ,

I would want to know what is the difference between databases
that are created using UNICODE encoding and SQL_ASCII encoding.

I have an existing database that has SQL_ASCII encoding but
still i am able to store multibyte characters that are not
in ASCII character set. for example:

tradein_clients=# \l
  
  List of databases
+-+--+---+
|  Name   |  Owner   | Encoding  |
+-+--+---+
| template0   | postgres | SQL_ASCII |
| template1   | postgres | SQL_ASCII |
| tradein_clients | tradein  | SQL_ASCII |
+-+--+---+

tradein_clients=# SELECT  * from t_A;
+--+
|a  
   |
+--+
| 私はガラス  
  
|
+--+

Above is some japanese character.

I have seen some posting regarding migrating databases from
SQL_ASCII to UNICODE, given the above observation what 
significance does a migration have.

Regards

Rajesh Kumar Mallah.







__ 
Yahoo! Mail Mobile 
Take Yahoo! Mail with you! Check email on your mobile phone. 
http://mobile.yahoo.com/learn/mail 

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [SQL] Significance of Database Encoding [ update ]

2005-05-15 Thread Rajesh Mallah


I am not sure why the characters did not display properly
in the mailling list archives.

http://archives.postgresql.org/pgsql-sql/2005-05/msg00102.php

but when i do the select in my screen (xterm -u8) i do 
see the japanese glyphs properly.


Regds
Mallah.




--- Rajesh Mallah <[EMAIL PROTECTED]> wrote:
> Hi ,
> 
> I would want to know what is the difference between databases
> that are created using UNICODE encoding and SQL_ASCII encoding.
> 
> I have an existing database that has SQL_ASCII encoding but
> still i am able to store multibyte characters that are not
> in ASCII character set. for example:
> 
> tradein_clients=# \l
>   
>   List of databases
> +-+--+---+
> |  Name   |  Owner   | Encoding  |
> +-+--+---+
> | template0   | postgres | SQL_ASCII |
> | template1   | postgres | SQL_ASCII |
> | tradein_clients | tradein  | SQL_ASCII |
> +-+--+---+
> 
> tradein_clients=# SELECT  * from t_A;
> +--+
> |a
>  |
> +--+
> | 私はガラス
>   
>  
> |
> +--+
> 
> Above is some japanese character.
> 
> I have seen some posting regarding migrating databases from
> SQL_ASCII to UNICODE, given the above observation what 
> significance does a migration have.
> 
> Regards
> 
> Rajesh Kumar Mallah.
> 
> 
> 
> 
> 
> 
>   
> __ 
> Yahoo! Mail Mobile 
> Take Yahoo! Mail with you! Check email on your mobile phone. 
> http://mobile.yahoo.com/learn/mail 
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 



Discover Yahoo! 
Find restaurants, movies, travel and more fun for the weekend. Check it out! 
http://discover.yahoo.com/weekend.html 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [SQL] Significance of Database Encoding

2005-05-15 Thread PFC

+--+
| 私はガラス
+--+
	You say it displays correctly in xterm (ie. you didn't see these in your  
xterm).
	There are HTML/XML unicode character entities, probably generated by your  
mailer from your Unicode cut'n'paste.
	Using SQL ASCII to store UTF8 encoded data will work, but postgres won't  
know that it's manipulating multibyte characters, so for instance the  
length of a string will be its Byte length instead of correctly counting  
the characters, collation rules will be funky, etc. And substring() may  
well cut in the middle of an UTF8 multibyte char which will then screw  
your application side processing...
	Apart from that, it'll work ;)

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [SQL] Significance of Database Encoding

2005-05-15 Thread Rajesh Mallah

--- PFC <[EMAIL PROTECTED]> wrote:
> 
> > +--+
> > | 私はガラス
> > +--+
> 
>   You say it displays correctly in xterm (ie. you didn't see these in 
> your  
> xterm).
>   There are HTML/XML unicode character entities, probably generated by 
> your  
> mailer from your Unicode cut'n'paste.

That is correct.

Now the question is how to convert from SQL_ASCII to UNICODE. 
Mailing lists suggests to run recode or iconv on the dump file
and restore. The problem is on running iconv with -f US-ASCII
the program aborted:

$ iconv -f US-ASCII -t UTF-8  < test.sql > out.sql
iconv: illegal input sequence at position 114500

Any ideas how the job can be accomplised reliably.

Also my database may contain data in multiple encodings
like WINDOWS-1251 and WINDOWS-1256 in various places
as data has been inserted by different peoples using
different sources and client software.


Regds
Rajesh Kumar Mallah.










>   Using SQL ASCII to store UTF8 encoded data will work, but postgres 
> won't  
> know that it's manipulating multibyte characters, so for instance the  
> length of a string will be its Byte length instead of correctly counting  
> the characters, collation rules will be funky, etc. And substring() may  
> well cut in the middle of an UTF8 multibyte char which will then screw  
> your application side processing...
>   Apart from that, it'll work ;)
> 

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [SQL] Significance of Database Encoding

2005-05-15 Thread PFC

$ iconv -f US-ASCII -t UTF-8  < test.sql > out.sql
iconv: illegal input sequence at position 114500
Any ideas how the job can be accomplised reliably.
Also my database may contain data in multiple encodings
like WINDOWS-1251 and WINDOWS-1256 in various places
as data has been inserted by different peoples using
different sources and client software.
You could use a simple program like that (in Python):
output = open( "unidump", "w" )
for line in open( "your dump" ):
for encoding in "utf-8", "iso-8859-15", "whatever":
try:
output.write( unicode( line, encoding ).encode( "utf-8" 
))
break
except UnicodeError:
pass
else:
print "No suitable encoding for line..."
	I'd say this might work, if UTF-8 cannot absorb an apostrophe inside a  
multibit character. Can it ?

	Or you could do that to all your table using SELECTs but it's going to be  
painful...

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [SQL] Significance of Database Encoding

2005-05-15 Thread Rajesh Mallah

--- PFC <[EMAIL PROTECTED]> wrote:
> 
> > $ iconv -f US-ASCII -t UTF-8  < test.sql > out.sql
> > iconv: illegal input sequence at position 114500
> >
> > Any ideas how the job can be accomplised reliably.
> >
> > Also my database may contain data in multiple encodings
> > like WINDOWS-1251 and WINDOWS-1256 in various places
> > as data has been inserted by different peoples using
> > different sources and client software.
> 
>   You could use a simple program like that (in Python):
> 
> output = open( "unidump", "w" )
> for line in open( "your dump" ):
>   for encoding in "utf-8", "iso-8859-15", "whatever":
>   try:
>   output.write( unicode( line, encoding ).encode( "utf-8" 
> ))
>   break
>   except UnicodeError:
>   pass
>   else:
>   print "No suitable encoding for line..."


This may not work . Becuase ,conversion to utf-8 can be successfull (no runtime 
error)
even for an incorrect guess of the original encoding but the  result will be  
an 
incorrect utf8. 

Regds
Rajesh Kumar Mallah


> 
>   I'd say this might work, if UTF-8 cannot absorb an apostrophe inside a  
> multibit character. Can it ?
> 
>   Or you could do that to all your table using SELECTs but it's going to 
> be  
> painful...
> 
> ---(end of broadcast)---
> TIP 7: don't forget to increase your free space map settings
> 



__ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq