Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit)chars?

2001-05-07 Thread Tony Grant

On 04 May 2001 15:44:23 -0400, Tom Lane wrote:

Back from the weekend with sunburn (very important sign that it has stopped 
raining here on the west of Europe)

 
  All this does 
  is move the problem from being one that non-english countries have to 
  being one where it is a non-english and non-western european problem 
  (eg. Eastern Europe, Russia, etc.).
 
 Nonsense.  The non-Western-European folks see broken behavior now
 anyway, unless they compile with MULTIBYTE and set an appropriate
 encoding.  How would this make their lives worse, or even different?
 
 I'm merely suggesting that the default behavior could be made useful
 to a larger set of people than it now is, without making things any
 worse for those that it's not useful to.

This reminds me of e-mail software when I joined the net. 7 bit ASCII
only software made the use of accents impossible so we learnt to type
without them or put up with garbage in our mail.

I must agree with Tom here. There is a 256 caracter alphabet which is
standard in many languages. For North America, Spanish and French spring
to mind. How are you going to build a common market if these two
languages plus Brasilian Portugese are not supported in business
software?

Multibyte is supported for other alphabets. This is already a wonderfull
achievement for those concerned.

The standard backend should in my opinion support the LATIN alphabet. US
ASCII is a subset of that alphabet, it is not _the_ alphabet.

The JDBC and Java itself should also support the whole alphabet. All
this should be transparent for the programmer and the end user. Another
battle to be fought...

Cheers

Tony Grant

-- 
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html
Macromedia UltraDev with PostgreSQL
http://www.animaproductions.com/ultra.html


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



MULTIBYTE and SQL_ASCII (was Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?)

2001-05-05 Thread Tom Lane

[ thread renamed and cross-posted to pghackers, since this isn't only
about JDBC anymore ]

Barry Lind [EMAIL PROTECTED] writes:
 The basic issue I have it that the server is providing an API to the 
 client to get the character encoding for the database and that API can 
 report incorrect information to the client. 

I don't have any objection to changing the system so that even a
non-MULTIBYTE server can store and return encoding settings.
(Presumably it should only accept encoding settings that correspond
to single-byte encodings.)  That can't happen before 7.2, however,
as the necessary changes are a bit larger than I'd care to shoehorn
into a 7.1.* release.

 Thus I would be happy if getdatabaseencoding() returned 'UNKNOWN' or 
 something similar when in fact it doesn't know what the encoding is 
 (i.e. when not compiled with multibyte).

I have a philosophical difference with this: basically, I think that
since SQL_ASCII is the default value, you probably ought to assume that
it's not too trustworthy.  The software can *never* be said to KNOW what
the data encoding is; at most it knows what it's been told, and in the
case of a default it probably hasn't been told anything.  I'd argue that
SQL_ASCII should be interpreted in the way you are saying UNKNOWN
ought to be: ie, it's an unspecified 8-bit encoding (and from there
it's not much of a jump to deciding to treat it as LATIN1, if you're
forced to do conversion to Unicode or whatever).  Certainly, seeing
SQL_ASCII from the server is not license to throw away data, which is
what JDBC is doing now.

 PS.  Note that if multibyte is enabled, the functionality that is being 
 complained about here in the jdbc client is apparently ok for the server 
 to do.  If you insert a value into a text column on a SQL_ASCII database 
 with multibyte enabled and that value contains 8bit characters, those 
 8bit characters will be quietly replaced with a dummy character since 
 they are invalid for the SQL_ASCII 7bit character set.

I have not tried it, but if the backend does that then I'd argue that
that's a bug too.  To my mind, a MULTIBYTE backend operating in
SQL_ASCII encoding ought to behave the same as a non-MULTIBYTE backend:
transparent pass-through of characters with the high bit set.  But I'm
not a multibyte guru.  Comments anyone?

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



[JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?

2001-05-04 Thread Jani Averbach

On Thu, 3 May 2001, Barry Lind wrote:

 With regards to your specific problem, my guess is that you haven't 
 created you database with the proper character set for the data you are 
 storing in it.  I am guessing you simply used the default SQL Acsii 
 character set for your created database and therefore only the first 127 
 characters are defined.  Any characters above 127 will be returned by 
 java as ?'s.
 
 If this is the case you will need to recreate your database with the 
 proper character set for the data you are storing in it and then 
 everything should be fine.
 

Thanks, you are right!

The main problem was that I had not enabled the multibyte support for
database. (I believe fairytale and supposed that correct locale
setting will be enough.)

So my humble wish is that the instructions in the INSTALL file should be
corrected.
Because:

 --enable-multibyte
 
  Allows the use of multibyte character encodings. This is
  primarily for languages like Japanese, Korean, and Chinese. Read
  the Administrator's Guide for details.

I think that this is a little bit missleading. 

There is correct information in the Administrator's Guide, so I should
have to read the Guide, but but... The world would be much better place,
if there is little mention in the installation instruction that this
concerns also 8-bit chars...


But anyway, it works now very fine, thanks!

BR, Jani

---
Jani Averbach


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?

2001-05-04 Thread Tom Lane

Barry Lind [EMAIL PROTECTED] writes:
 With regards to your specific problem, my guess is that you haven't 
 created you database with the proper character set for the data you are 
 storing in it.  I am guessing you simply used the default SQL Acsii 
 character set for your created database and therefore only the first 127 
 characters are defined.  Any characters above 127 will be returned by 
 java as ?'s.

Does this happen with a non-multibyte-compiled database?  If so, I'd
argue that's a serious bug in the JDBC code: it makes JDBC unusable
for non-ASCII 8-bit character sets, unless one puts up with the overhead
of MULTIBYTE support.

regards, tom lane

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit)chars?

2001-05-04 Thread Tony Grant

On 04 May 2001 11:40:48 -0400, Tom Lane wrote:

  I fought with this for a few days. The solution is to dump the database
  and create a new database with the correct encoding.
 
  MULTIBYTE is not neccesary I just set the type to LATIN1 and it works
  fine.
 
 But a non-MULTIBYTE backend doesn't even have the concept of setting
 the encoding --- it will always just report SQL_ASCII.

OK I just read the configure script for my backend - you guessed it
multibyte support and locale support compiled in there... So createdb -E
LATIN1 works just fine =:-b
 
 Perhaps what this really says is that it'd be better if the JDBC code
 assumed LATIN1 translations when the backend claims SQL_ASCII.
 Certainly, translating all high-bit-set characters to '?' is about as
 uselessly obstructionist a policy as I can think of...


I will be adding this snippet to my doc on techdocs in the French
version. It will save somebody a lot of head scratching.

Cheers
Tony Grant


-- 
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])



Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit)chars?

2001-05-04 Thread Tony Grant

On 04 May 2001 10:29:50 -0400, Tom Lane wrote:

  With regards to your specific problem, my guess is that you haven't 
  created you database with the proper character set for the data you are 
  storing in it.  I am guessing you simply used the default SQL Acsii 
  character set for your created database and therefore only the first 127 
  characters are defined.  Any characters above 127 will be returned by 
  java as ?'s.
 
 Does this happen with a non-multibyte-compiled database?  If so, I'd
 argue that's a serious bug in the JDBC code: it makes JDBC unusable
 for non-ASCII 8-bit character sets, unless one puts up with the overhead
 of MULTIBYTE support.

I fought with this for a few days. The solution is to dump the database
and create a new database with the correct encoding.

MULTIBYTE is not neccesary I just set the type to LATIN1 and it works
fine.

Queries even work on accentuated caracters!!! 

I have a demo database for those interested

Cheers

Tony Grant



-- 
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html
Ultradev and PostgreSQL
http://www.animaproductions.com/ultra.html


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit)chars?

2001-05-04 Thread Tony Grant

On 04 May 2001 11:40:48 -0400, Tom Lane wrote:

 
 But a non-MULTIBYTE backend doesn't even have the concept of setting
 the encoding --- it will always just report SQL_ASCII.

What kind of error message does createdb -E LATIN1 give on a non
MULTIBYTE backend? 

Maybe there needs to be a note somewhere informing people from Europe
that they too need MULTIBYTE as an option at compile time. i.e. In a
bright yellow box in the HTML docs...

And in the Reference manual and man pages the -E option for createdb
needs a note to specify that it applies to MULTIBYTE backends only. 

Cheers

Tony Grant

-- 
RedHat Linux on Sony Vaio C1XD/S
http://www.animaproductions.com/linux2.html


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]



Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?

2001-05-04 Thread Tom Lane

Tony Grant [EMAIL PROTECTED] writes:
 What kind of error message does createdb -E LATIN1 give on a non
 MULTIBYTE backend? 

$ createdb -E LATIN1 foo
/home/postgres/testversion/bin/createdb[143]: 
/home/postgres/testversion/bin/pg_encoding:  not found.
createdb: LATIN1 is not a valid encoding name
$

 Maybe there needs to be a note somewhere informing people from Europe
 that they too need MULTIBYTE as an option at compile time. i.e. In a
 bright yellow box in the HTML docs...

But they *should not* need it, if they only want to use an 8-bit character
set.  Locale support should be enough.  Or so I would think, anyway.
I have to admit I have not looked very closely at the functionality
that's enabled by MULTIBYTE; is any of it really needed to deal with
LATINn character sets?

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



[JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?

2001-05-04 Thread Barry Lind



Tony Grant wrote:

 On 04 May 2001 11:40:48 -0400, Tom Lane wrote:
 
 But a non-MULTIBYTE backend doesn't even have the concept of setting
 the encoding --- it will always just report SQL_ASCII.
 
 
 What kind of error message does createdb -E LATIN1 give on a non
 MULTIBYTE backend? 
 
 Maybe there needs to be a note somewhere informing people from Europe
 that they too need MULTIBYTE as an option at compile time. i.e. In a
 bright yellow box in the HTML docs...
 
 And in the Reference manual and man pages the -E option for createdb
 needs a note to specify that it applies to MULTIBYTE backends only. 
 
 Cheers
 
 Tony Grant
 
The errors you get are:
from createdb-

$ createdb -E LATIN1 testdb
/usr/local/pgsql/bin/createdb: /usr/local/pgsql/bin/pg_encoding: No such 
file or directory
createdb: LATIN1 is not a valid encoding name

and from psql-

template1=# create database testdb with encoding = 'LATIN1';
ERROR:  Multi-byte support is not enabled

thanks,
--Barry


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?

2001-05-04 Thread Tom Lane

Barry Lind [EMAIL PROTECTED] writes:
 Now it is an easy change in the jdbc code to use LATIN1 when the server 
 reports SQL_ASCII, but I really dislike hardcoding support that only 
 works in english speaking countries and Western Europe.

What's wrong with that?  It won't be any more broken for people who are
not really using LATIN1, and it will be considerably less broken for
those who are.  Seems like a net win to me, even without making the
obvious point about where the majority of Postgres users are.

It probably would be a good idea to allow the backend to store an
indication of character set even when not compiled for MULTIBYTE,
but that's not the issue here.  To me, the issue is whether JDBC
makes a reasonable effort not to munge data when presented with
a backend that claims to be using SQL_ASCII (which, let me remind
you, is the default setting).  Converting high-bit-set characters
to '?' is almost certainly NOT what the user wants you to do.
Converting on the assumption of LATIN1 will make a lot of people
happy, and the people who aren't happy with it will certainly not
be happy with '?' conversion either.

 All this does 
 is move the problem from being one that non-english countries have to 
 being one where it is a non-english and non-western european problem 
 (eg. Eastern Europe, Russia, etc.).

Nonsense.  The non-Western-European folks see broken behavior now
anyway, unless they compile with MULTIBYTE and set an appropriate
encoding.  How would this make their lives worse, or even different?

I'm merely suggesting that the default behavior could be made useful
to a larger set of people than it now is, without making things any
worse for those that it's not useful to.

regards, tom lane

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



[JDBC] Re: A bug with pgsql 7.1/jdbc and non-ascii (8-bit) chars?

2001-05-03 Thread Barry Lind

Since Java uses unicode (ucs2) internally for all Strings, the jdbc code 
always needs to do character set conversions for all strings it gets 
from the database.  In the 7.0 drivers this was done by assuming the 
database was using the same character set as the default on the client, 
which is incorrect for a number of reasons.  In 7.1 the jdbc code asks 
the database what character set it is using and does the conversion from 
the server character set to the java unicode strings.

Now it turns out that Postgres is a little lax in its character set 
support, so you can very easily insert char/varchar/text with values 
that fall outside the range of valid values for a given character set 
(and psql doesn't really care either).  However in java since we must do 
character set conversion to unicode, it does make a difference and any 
values that were inserted that are incorrect with regards to the 
database character set will be reported as ?'s in java.

With regards to your specific problem, my guess is that you haven't 
created you database with the proper character set for the data you are 
storing in it.  I am guessing you simply used the default SQL Acsii 
character set for your created database and therefore only the first 127 
characters are defined.  Any characters above 127 will be returned by 
java as ?'s.

If this is the case you will need to recreate your database with the 
proper character set for the data you are storing in it and then 
everything should be fine.

thanks,
--Barry

Jani Averbach wrote:

 Hi!
  
 I have a problem like that:
  
 Environment:
  
 The database is postgresql v7.1, with locale-support
 jdk is sun's jdk1.3.0_02
 and jdbc is that one which comes with postgres (type 2).
  
 Both database and jdbc driver has been build by myself.
  
 OS: RH 6.2 based linux with 2.4.3 kernel, glibc 2.1.3.
  
 The problem:
  
 There is a database which contains fields (the field's type is 'text')
 with scandinavian alphabet. (Especially ÖÄÅöäå (odiaeresis, adiaeresis,
 aring, or in other words, oe, ae, and a with ring above it)).
 
 
 The database has been installed, created and used under
 LC_ALL=finnish and LANG=fi_FI environment variables in act.
  
 Ok, the problem:
 
 When I try to read those field, I get guestion marks instead of those
 8-bit scandic chars. 
 
 I have been check my java programs and the database. (in fact, same
 problem appears with postgres-7.1/src/interfaces/jdbc/example/psql.java).
 In general, my java environment works fine with 8-bit chars, with psgl
 (not the java one) there is everything very well with those fields with
 8-bit chars.
 
 So my question is, am I doing something wrong, or is there a bug in the
 pgsql-jdbc? 
 
 If this is a bug or you need otherwise help or more information, please
 let me know. I will try to help as much as possible to hunt this one
 down.
 
 If I am doing something stupid, I would very likely to know it...
 
 BR, Jani
 
 ---
 Jani Averbach 
 
 
 
 
 ---(end of broadcast)---
 TIP 5: Have you checked our extensive FAQ?
 
 http://www.postgresql.org/users-lounge/docs/faq.html
 
 


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html