Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-06-05 Thread Tom Lane
Nicolai Tufar [EMAIL PROTECTED] writes:
 Sorry for rising up old issue again but the problem still persists.
 And database cluster is not being created with Turkish locale

I've committed the attached fix, which I believe will solve this
problem.  Could you test it?

(Patch is against 7.4 branch)

regards, tom lane



binuQ5kYk0rs1.bin
Description: downcase.patch

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-21 Thread Tom Lane
Nicolai Tufar [EMAIL PROTECTED] writes:
 It looks to me like every use of strcasecmp in the backend has to be
 questioned if we're going to make this work.  I'm starting to lean in
 the direction of tr_TR is hopelessly broken again...

 With this patch applied everything works fine. Thanks!

Did you try running the regression tests under tr_TR locale?  It seems
a few bricks short of fine yet :-(

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-21 Thread Nicolai Tufar
 -Original Message-
 From: Tom Lane 
 Did you try running the regression tests under tr_TR locale?  It seems
 a few bricks short of fine yet :-(

I run regression tests under tr_TR locale. To do this I hardcoded
Turkish locale in initdb in pg_regress.sh. Three tests failed, I
attached resulting diff. 

With days of the week, the same problem is with downcasting occurs. I
think it is not that crucial, but the rest of the differences in the
file seem to be important. I was not able to interpret them.

Thanks,
Nicolai







 
   regards, tom lane


regression.diffs
Description: Binary data

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-20 Thread Tom Lane
Nicolai Tufar [EMAIL PROTECTED] writes:
 Under locale-ignorant FreeBSD it works fine.
 But under Fedora Core 1 initdb it crashes under all
 locales I tested -C, en_US, tr_TR with message given below.

Hmm.  It seems that tr_TR has problems much more extensive than you've
indicated previously.  I was able to get through initdb with the attached
additional patch, but the regression tests fail in several places.
It looks to me like every use of strcasecmp in the backend has to be
questioned if we're going to make this work.  I'm starting to lean in
the direction of tr_TR is hopelessly broken again...

regards, tom lane


*** src/backend/commands/variable.c~Mon Jan 19 14:04:40 2004
--- src/backend/commands/variable.c Fri Feb 20 23:16:16 2004
***
*** 82,103 
  
/* Ugh. Somebody ought to write a table driven version -- mjl */
  
!   if (strcasecmp(tok, ISO) == 0)
{
newDateStyle = USE_ISO_DATES;
scnt++;
}
!   else if (strcasecmp(tok, SQL) == 0)
{
newDateStyle = USE_SQL_DATES;
scnt++;
}
!   else if (strncasecmp(tok, POSTGRES, 8) == 0)
{
newDateStyle = USE_POSTGRES_DATES;
scnt++;
}
!   else if (strcasecmp(tok, GERMAN) == 0)
{
newDateStyle = USE_GERMAN_DATES;
scnt++;
--- 82,108 
  
/* Ugh. Somebody ought to write a table driven version -- mjl */
  
!   /*
!* Note: SplitIdentifierString already downcased the input, so
!* we needn't use strcasecmp here.
!*/
! 
!   if (strcmp(tok, iso) == 0)
{
newDateStyle = USE_ISO_DATES;
scnt++;
}
!   else if (strcmp(tok, sql) == 0)
{
newDateStyle = USE_SQL_DATES;
scnt++;
}
!   else if (strncmp(tok, postgres, 8) == 0)
{
newDateStyle = USE_POSTGRES_DATES;
scnt++;
}
!   else if (strcmp(tok, german) == 0)
{
newDateStyle = USE_GERMAN_DATES;
scnt++;
***
*** 105,129 
if (ocnt == 0)
newDateOrder = DATEORDER_DMY;
}
!   else if (strcasecmp(tok, YMD) == 0)
{
newDateOrder = DATEORDER_YMD;
ocnt++;
}
!   else if (strcasecmp(tok, DMY) == 0 ||
!strncasecmp(tok, EURO, 4) == 0)
{
newDateOrder = DATEORDER_DMY;
ocnt++;
}
!   else if (strcasecmp(tok, MDY) == 0 ||
!strcasecmp(tok, US) == 0 ||
!strncasecmp(tok, NONEURO, 7) == 0)
{
newDateOrder = DATEORDER_MDY;
ocnt++;
}
!   else if (strcasecmp(tok, DEFAULT) == 0)
{
/*
 * Easiest way to get the current DEFAULT state is to fetch
--- 110,134 
if (ocnt == 0)
newDateOrder = DATEORDER_DMY;
}
!   else if (strcmp(tok, ymd) == 0)
{
newDateOrder = DATEORDER_YMD;
ocnt++;
}
!   else if (strcmp(tok, dmy) == 0 ||
!strncmp(tok, euro, 4) == 0)
{
newDateOrder = DATEORDER_DMY;
ocnt++;
}
!   else if (strcmp(tok, mdy) == 0 ||
!strcmp(tok, us) == 0 ||
!strncmp(tok, noneuro, 7) == 0)
{
newDateOrder = DATEORDER_MDY;
ocnt++;
}
!   else if (strcmp(tok, default) == 0)
{
/*
 * Easiest way to get the current DEFAULT state is to fetch
***
*** 474,480 
HasCTZSet = true;
}
}
!   else if (strcasecmp(value, UNKNOWN) == 0)
{
/*
 * UNKNOWN is the value shown as the default for TimeZone in
--- 479,485 
HasCTZSet = true;
   

Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-19 Thread Nicolai Tufar
Sorry for rising up old issue again but the problem still persists.
And database cluster is not being created with Turkish locale

If you have any doubts about how Turkish users will react to the fact
that both I and I WITH DOT will be treated as same character, rest
assured that this behavior is de-facto standard when it comes to file
names, 
identifiers and commands. Greg Stark and Devrim Gunduz will confirm
that, 
no doubt. Please review and apply this patch I send you for the third
time. You will not regret and many users will be grateful. Please note
that to my knowledge it will not break any other locales.


Best regards,
Nicolai


tr20040203.diff
Description: Binary data

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-09 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 It seems to me that that's too narrow a definition of the problem.
 I think we should state our goal as we don't want bizarre locale
 definitions to interfere with downcasing of the basic ASCII letters.
 If we put in a special case for 'I' we will fix the known problem
 with Turkish, but what other strange locales might be out there?
 And if we don't trust tolower() for 'I', why should we trust it
 for 'A'-'Z'?

But then wouldn't it be a little weird for Turkish table and column names to
treat I and Ý (I think that's a dotted capital I) as equivalent to i
instead of ý i respectively. (I think that first one was a dotless i).

Perhaps what really ought to be happening is that the downcasing should be
done separately for keywords, or postponed until the point where it's checked
to see if it's a keyword. Then it could be done using an entirely
ascii-centric bit-twiddling implementation.

If it matches an SQL keyword after being downcased the old fashioned way, then
it's an SQL keyword. If not then the locale-aware tolower() would be
appropriate for tables, columns, etc.

But then perhaps that's unnecessarily complex.

-- 
greg


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-09 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 If it matches an SQL keyword after being downcased the old fashioned way, then
 it's an SQL keyword. If not then the locale-aware tolower() would be
 appropriate for tables, columns, etc.

That's exactly what we do already.  The complaint was that the
locale-aware downcasing is broken (not to put it too finely) in Turkish
locales, leading to unexpected/unwanted results for identifiers that are
not keywords.  My own opinion is that the correct response is to fix the
Turkish locale tables, but I can see where that might be beyond the
skills of the average Postgres user.  Thus I thought a reasonable
compromise would be to override the locale for the handling of A-Z,
allowing it to determine what happens to high-bit-set characters only.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-09 Thread Greg Stark

Tom Lane [EMAIL PROTECTED] writes:

 Greg Stark [EMAIL PROTECTED] writes:
  If it matches an SQL keyword after being downcased the old fashioned way, then
  it's an SQL keyword. If not then the locale-aware tolower() would be
  appropriate for tables, columns, etc.
 
 That's exactly what we do already.  The complaint was that the
 locale-aware downcasing is broken (not to put it too finely) in Turkish
 locales, leading to unexpected/unwanted results for identifiers that are
 not keywords.  

But the example given was SERIAL. serial is an English word, not a Turkish
word. It shouldn't really be subject to Turkish locale effects at all. Perhaps
keyword wasn't the right word in my message.

I'm wondering if he really expects all identifiers to be subject to this ascii
downcasing. Like, if he had a GÜNAYDIN column he might be surprised to when
günaydýn (where ý is the lowercase dotless i) says column günaydýn doesn't
exist.

Or is the real problem simply that both styles of i really ought to match all
the time, ie, that they should really be considered the same letter for
matches? I wonder if there are other locales where that's an issue.

-- 
greg


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-09 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 But the example given was SERIAL. serial is an English word, not a
 Turkish word. It shouldn't really be subject to Turkish locale effects
 at all.

SERIAL is not a keyword according to the grammar.  Neither are PUBLIC,
VOID, INT4, and numerous other examples.  It's not appropriate to try to
fix this by making them all keywords --- that will just create other
problems.  (And where do you draw the line, anyway?  Should every
identifier present in the default system catalogs become a keyword?)

 I'm wondering if he really expects all identifiers to be subject to
 this ascii downcasing.

Without doubt it isn't ideal, but if we don't do something then a lot of
stuff starting with initdb is broken.  We could perhaps work around the
problem by spelling everything in lower-case in all the commands we
issue, but I can't see that as an acceptable answer either.  We can't
expect to control all the SQL sent to a database.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-02-01 Thread Nicolai Tufar
Tom Lane [EMAIL PROTECTED] wrote:

Nicolai Tufar [EMAIL PROTECTED] writes:
 A possible compromise is to apply ASCII downcasing (same as in
 keywords.c) for 7-bit-ASCII characters, and apply tolower() only
 for character codes above 127.  In other words

 If we go this way why not make a special case only and only for 'I'
 Character and not all 7-bit ASCII:

 It seems to me that that's too narrow a definition of the problem.
 I think we should state our goal as we don't want bizarre locale
 definitions to interfere with downcasing of the basic ASCII letters.
 If we put in a special case for 'I' we will fix the known problem
 with Turkish, but what other strange locales might be out there?
 And if we don't trust tolower() for 'I', why should we trust it
 for 'A'-'Z'?

To my knowledge no other locale have similar problems. At least nobody
complained so far while Turk users are rising their voices for many
years 
now. Let try and put this very special case, together with an extensive
explanation in comment and see if someone complains. And by the way,
national characters in table, column, index or function names is
something
that never happens in production databases.

As for 'A'-'Z'^, it was pointed to me that SQL99 standard states that
identifier names need to be downcasted in locale-dependent manner.

Would you like me to create a patch that would touch only 
src/backend/parser/scan.l, introduce a special case for 'I'
and include an explanation in comment?

 What it comes down to is that by training and experience, I always
 expect that any bug might be just one example of a whole class of
bugs.
 You have to look for the related cases that might happen in future,
 not only fix the case that's under your nose.

   regards, tom lane

Regards,
Nicolai Tufar


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-01-31 Thread Nicolai Tufar






 We might think that the Turkish-locale problem Devrim Gunduz pointed out is a must-fix, too. But I'm not convinced yet what to do about it.



Here is a first try
to fix what
Devrim Gunduz talked about.

Please be patient with me for
it is the first major patch
I submit and I realize that I blatantly violated many rules
of good style in PostgreSQL source code.

First, about the problem. Turkish language has two letters i.
One is with dot on top and the other is without.
Simply as that.
The one with
dot has the dot both as capital
and lower-case and
the one without
dot has no dot in both upper and
lower case...
as opposed to English where i has a
dot when lower-case and
has no dot when upper-case.

Problem arise when PostgreSQL, while running with tr_TR locale
converts to lower-case an identifier
as a table, an index or 
a column name. If it is written with capital
I, tolower() with
'I' as argument will return Turkish specific character: 
'i'-without-a-dot what I am afraid
will not be shown correctly
in your e-mail readers.

Let me give
some examples.

initdb script runs apparently innocent script in file
src/backend/utils/mb/conversion_procs/conversion_create.sql
to create a couple of functions whose only fault
was
to declare it their return parameters
as VOID. Backend
returns error message that type
vo d is not found
and
initdb fails.

A nothing suspecting novice user was
excited about 
SERIAL data type he was tail is present
in PostgreSQL.
It took us with Devrim a lot of time to explain why
he
need to type
SERIAL as SERiAL for now till a workaround
is developed.

Another case happened with me
when I wanted to restore
a pg_dump dump. Restore failed because script was creating
scripts that belong to PUBLIC.


For the solution,
after some research we found
out that
offender is tolower() call in src/backend/parser/scan.l
in {identifier} section. tolower() works fine with any
locale and with any character
save for the Turkish locale
and capital 'I' character. So, the obvious solution
is
to put a check for Turkish locale
and 'I' character.
Something like this:

if( locale is Turkish  ident[i]
== 'I' )
 ident[i] =
'i';
else
 ident[i] =
tolower((unsigned char) ident[i]);

Looks rather simple but the hard part was to
figure out 
what is the current locale. To do this I added


const char *get_locale_category(const char *category);

to src/backend/utils/adt/pg_locale.c that would
return
locale identifier for the category
specified or LC_ALL
if category is NULL. I could not find any other function
that will return what I need.
Please help me to find
one because I would hate to
introduce a new function.

I realize that {identifier} section is very performance
critical so I introduced a global variable 

static int isturkishlocale = -1;

at the beginning of src/backend/parser/scan.l
It is set to -1 when not yet initialized, 0 if
locale is not Turkish and 1 if locale
is Turkish.

It might not be the way it is usually
done in PostgreSQL
source code. Could you pleas
advise if the name I chose
is appropriate and whether there is a more appropriate 
place to put declaration and initialization.

Best regards,
Nicolai Tufar  Devrim Gunduz










trpatch.diff
Description: Binary data

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] 7.4.1 release status - Turkish Locale

2004-01-31 Thread Tom Lane
Nicolai Tufar [EMAIL PROTECTED] writes:
 We might think that the Turkish-locale problem Devrim Gunduz pointed out
 is a must-fix, too.  But I'm not convinced yet what to do about it.
 
 Here is a first try to fix what Devrim Gunduz talked about.

I still don't much like having a locale-specific wart in the parser
(and the code you give could not work anyway --- for starters, the
first argument of setlocale is not a pointer).

A possible compromise is to apply ASCII downcasing (same as in
keywords.c) for 7-bit-ASCII characters, and apply tolower() only
for character codes above 127.  In other words

unsigned char ch = (unsigned char) ident[i];

if (ch = 'A'  ch = 'Z')
ch += 'a' - 'A';
else if (ch  127  isupper(ch))
ch = tolower(ch);
ident[i] = (char) ch;

In reasonably sane locales this will have the same effects as currently,
while in unsane locales it will ensure that basic-ASCII identifiers are
treated the way we want.

Comments?

regards, tom lane

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match