Re: [HACKERS] 7.4.1 release status - Turkish Locale
Nicolai Tufar [EMAIL PROTECTED] writes: Sorry for rising up old issue again but the problem still persists. And database cluster is not being created with Turkish locale I've committed the attached fix, which I believe will solve this problem. Could you test it? (Patch is against 7.4 branch) regards, tom lane binuQ5kYk0rs1.bin Description: downcase.patch ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Nicolai Tufar [EMAIL PROTECTED] writes: It looks to me like every use of strcasecmp in the backend has to be questioned if we're going to make this work. I'm starting to lean in the direction of tr_TR is hopelessly broken again... With this patch applied everything works fine. Thanks! Did you try running the regression tests under tr_TR locale? It seems a few bricks short of fine yet :-( regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] 7.4.1 release status - Turkish Locale
-Original Message- From: Tom Lane Did you try running the regression tests under tr_TR locale? It seems a few bricks short of fine yet :-( I run regression tests under tr_TR locale. To do this I hardcoded Turkish locale in initdb in pg_regress.sh. Three tests failed, I attached resulting diff. With days of the week, the same problem is with downcasting occurs. I think it is not that crucial, but the rest of the differences in the file seem to be important. I was not able to interpret them. Thanks, Nicolai regards, tom lane regression.diffs Description: Binary data ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Nicolai Tufar [EMAIL PROTECTED] writes: Under locale-ignorant FreeBSD it works fine. But under Fedora Core 1 initdb it crashes under all locales I tested -C, en_US, tr_TR with message given below. Hmm. It seems that tr_TR has problems much more extensive than you've indicated previously. I was able to get through initdb with the attached additional patch, but the regression tests fail in several places. It looks to me like every use of strcasecmp in the backend has to be questioned if we're going to make this work. I'm starting to lean in the direction of tr_TR is hopelessly broken again... regards, tom lane *** src/backend/commands/variable.c~Mon Jan 19 14:04:40 2004 --- src/backend/commands/variable.c Fri Feb 20 23:16:16 2004 *** *** 82,103 /* Ugh. Somebody ought to write a table driven version -- mjl */ ! if (strcasecmp(tok, ISO) == 0) { newDateStyle = USE_ISO_DATES; scnt++; } ! else if (strcasecmp(tok, SQL) == 0) { newDateStyle = USE_SQL_DATES; scnt++; } ! else if (strncasecmp(tok, POSTGRES, 8) == 0) { newDateStyle = USE_POSTGRES_DATES; scnt++; } ! else if (strcasecmp(tok, GERMAN) == 0) { newDateStyle = USE_GERMAN_DATES; scnt++; --- 82,108 /* Ugh. Somebody ought to write a table driven version -- mjl */ ! /* !* Note: SplitIdentifierString already downcased the input, so !* we needn't use strcasecmp here. !*/ ! ! if (strcmp(tok, iso) == 0) { newDateStyle = USE_ISO_DATES; scnt++; } ! else if (strcmp(tok, sql) == 0) { newDateStyle = USE_SQL_DATES; scnt++; } ! else if (strncmp(tok, postgres, 8) == 0) { newDateStyle = USE_POSTGRES_DATES; scnt++; } ! else if (strcmp(tok, german) == 0) { newDateStyle = USE_GERMAN_DATES; scnt++; *** *** 105,129 if (ocnt == 0) newDateOrder = DATEORDER_DMY; } ! else if (strcasecmp(tok, YMD) == 0) { newDateOrder = DATEORDER_YMD; ocnt++; } ! else if (strcasecmp(tok, DMY) == 0 || !strncasecmp(tok, EURO, 4) == 0) { newDateOrder = DATEORDER_DMY; ocnt++; } ! else if (strcasecmp(tok, MDY) == 0 || !strcasecmp(tok, US) == 0 || !strncasecmp(tok, NONEURO, 7) == 0) { newDateOrder = DATEORDER_MDY; ocnt++; } ! else if (strcasecmp(tok, DEFAULT) == 0) { /* * Easiest way to get the current DEFAULT state is to fetch --- 110,134 if (ocnt == 0) newDateOrder = DATEORDER_DMY; } ! else if (strcmp(tok, ymd) == 0) { newDateOrder = DATEORDER_YMD; ocnt++; } ! else if (strcmp(tok, dmy) == 0 || !strncmp(tok, euro, 4) == 0) { newDateOrder = DATEORDER_DMY; ocnt++; } ! else if (strcmp(tok, mdy) == 0 || !strcmp(tok, us) == 0 || !strncmp(tok, noneuro, 7) == 0) { newDateOrder = DATEORDER_MDY; ocnt++; } ! else if (strcmp(tok, default) == 0) { /* * Easiest way to get the current DEFAULT state is to fetch *** *** 474,480 HasCTZSet = true; } } ! else if (strcasecmp(value, UNKNOWN) == 0) { /* * UNKNOWN is the value shown as the default for TimeZone in --- 479,485 HasCTZSet = true;
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Sorry for rising up old issue again but the problem still persists. And database cluster is not being created with Turkish locale If you have any doubts about how Turkish users will react to the fact that both I and I WITH DOT will be treated as same character, rest assured that this behavior is de-facto standard when it comes to file names, identifiers and commands. Greg Stark and Devrim Gunduz will confirm that, no doubt. Please review and apply this patch I send you for the third time. You will not regret and many users will be grateful. Please note that to my knowledge it will not break any other locales. Best regards, Nicolai tr20040203.diff Description: Binary data ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Tom Lane [EMAIL PROTECTED] writes: It seems to me that that's too narrow a definition of the problem. I think we should state our goal as we don't want bizarre locale definitions to interfere with downcasing of the basic ASCII letters. If we put in a special case for 'I' we will fix the known problem with Turkish, but what other strange locales might be out there? And if we don't trust tolower() for 'I', why should we trust it for 'A'-'Z'? But then wouldn't it be a little weird for Turkish table and column names to treat I and Ý (I think that's a dotted capital I) as equivalent to i instead of ý i respectively. (I think that first one was a dotless i). Perhaps what really ought to be happening is that the downcasing should be done separately for keywords, or postponed until the point where it's checked to see if it's a keyword. Then it could be done using an entirely ascii-centric bit-twiddling implementation. If it matches an SQL keyword after being downcased the old fashioned way, then it's an SQL keyword. If not then the locale-aware tolower() would be appropriate for tables, columns, etc. But then perhaps that's unnecessarily complex. -- greg ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Greg Stark [EMAIL PROTECTED] writes: If it matches an SQL keyword after being downcased the old fashioned way, then it's an SQL keyword. If not then the locale-aware tolower() would be appropriate for tables, columns, etc. That's exactly what we do already. The complaint was that the locale-aware downcasing is broken (not to put it too finely) in Turkish locales, leading to unexpected/unwanted results for identifiers that are not keywords. My own opinion is that the correct response is to fix the Turkish locale tables, but I can see where that might be beyond the skills of the average Postgres user. Thus I thought a reasonable compromise would be to override the locale for the handling of A-Z, allowing it to determine what happens to high-bit-set characters only. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Tom Lane [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: If it matches an SQL keyword after being downcased the old fashioned way, then it's an SQL keyword. If not then the locale-aware tolower() would be appropriate for tables, columns, etc. That's exactly what we do already. The complaint was that the locale-aware downcasing is broken (not to put it too finely) in Turkish locales, leading to unexpected/unwanted results for identifiers that are not keywords. But the example given was SERIAL. serial is an English word, not a Turkish word. It shouldn't really be subject to Turkish locale effects at all. Perhaps keyword wasn't the right word in my message. I'm wondering if he really expects all identifiers to be subject to this ascii downcasing. Like, if he had a GÜNAYDIN column he might be surprised to when günaydýn (where ý is the lowercase dotless i) says column günaydýn doesn't exist. Or is the real problem simply that both styles of i really ought to match all the time, ie, that they should really be considered the same letter for matches? I wonder if there are other locales where that's an issue. -- greg ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Greg Stark [EMAIL PROTECTED] writes: But the example given was SERIAL. serial is an English word, not a Turkish word. It shouldn't really be subject to Turkish locale effects at all. SERIAL is not a keyword according to the grammar. Neither are PUBLIC, VOID, INT4, and numerous other examples. It's not appropriate to try to fix this by making them all keywords --- that will just create other problems. (And where do you draw the line, anyway? Should every identifier present in the default system catalogs become a keyword?) I'm wondering if he really expects all identifiers to be subject to this ascii downcasing. Without doubt it isn't ideal, but if we don't do something then a lot of stuff starting with initdb is broken. We could perhaps work around the problem by spelling everything in lower-case in all the commands we issue, but I can't see that as an acceptable answer either. We can't expect to control all the SQL sent to a database. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Tom Lane [EMAIL PROTECTED] wrote: Nicolai Tufar [EMAIL PROTECTED] writes: A possible compromise is to apply ASCII downcasing (same as in keywords.c) for 7-bit-ASCII characters, and apply tolower() only for character codes above 127. In other words If we go this way why not make a special case only and only for 'I' Character and not all 7-bit ASCII: It seems to me that that's too narrow a definition of the problem. I think we should state our goal as we don't want bizarre locale definitions to interfere with downcasing of the basic ASCII letters. If we put in a special case for 'I' we will fix the known problem with Turkish, but what other strange locales might be out there? And if we don't trust tolower() for 'I', why should we trust it for 'A'-'Z'? To my knowledge no other locale have similar problems. At least nobody complained so far while Turk users are rising their voices for many years now. Let try and put this very special case, together with an extensive explanation in comment and see if someone complains. And by the way, national characters in table, column, index or function names is something that never happens in production databases. As for 'A'-'Z'^, it was pointed to me that SQL99 standard states that identifier names need to be downcasted in locale-dependent manner. Would you like me to create a patch that would touch only src/backend/parser/scan.l, introduce a special case for 'I' and include an explanation in comment? What it comes down to is that by training and experience, I always expect that any bug might be just one example of a whole class of bugs. You have to look for the related cases that might happen in future, not only fix the case that's under your nose. regards, tom lane Regards, Nicolai Tufar ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] 7.4.1 release status - Turkish Locale
We might think that the Turkish-locale problem Devrim Gunduz pointed out is a must-fix, too. But I'm not convinced yet what to do about it. Here is a first try to fix what Devrim Gunduz talked about. Please be patient with me for it is the first major patch I submit and I realize that I blatantly violated many rules of good style in PostgreSQL source code. First, about the problem. Turkish language has two letters i. One is with dot on top and the other is without. Simply as that. The one with dot has the dot both as capital and lower-case and the one without dot has no dot in both upper and lower case... as opposed to English where i has a dot when lower-case and has no dot when upper-case. Problem arise when PostgreSQL, while running with tr_TR locale converts to lower-case an identifier as a table, an index or a column name. If it is written with capital I, tolower() with 'I' as argument will return Turkish specific character: 'i'-without-a-dot what I am afraid will not be shown correctly in your e-mail readers. Let me give some examples. initdb script runs apparently innocent script in file src/backend/utils/mb/conversion_procs/conversion_create.sql to create a couple of functions whose only fault was to declare it their return parameters as VOID. Backend returns error message that type vo d is not found and initdb fails. A nothing suspecting novice user was excited about SERIAL data type he was tail is present in PostgreSQL. It took us with Devrim a lot of time to explain why he need to type SERIAL as SERiAL for now till a workaround is developed. Another case happened with me when I wanted to restore a pg_dump dump. Restore failed because script was creating scripts that belong to PUBLIC. For the solution, after some research we found out that offender is tolower() call in src/backend/parser/scan.l in {identifier} section. tolower() works fine with any locale and with any character save for the Turkish locale and capital 'I' character. So, the obvious solution is to put a check for Turkish locale and 'I' character. Something like this: if( locale is Turkish ident[i] == 'I' ) ident[i] = 'i'; else ident[i] = tolower((unsigned char) ident[i]); Looks rather simple but the hard part was to figure out what is the current locale. To do this I added const char *get_locale_category(const char *category); to src/backend/utils/adt/pg_locale.c that would return locale identifier for the category specified or LC_ALL if category is NULL. I could not find any other function that will return what I need. Please help me to find one because I would hate to introduce a new function. I realize that {identifier} section is very performance critical so I introduced a global variable static int isturkishlocale = -1; at the beginning of src/backend/parser/scan.l It is set to -1 when not yet initialized, 0 if locale is not Turkish and 1 if locale is Turkish. It might not be the way it is usually done in PostgreSQL source code. Could you pleas advise if the name I chose is appropriate and whether there is a more appropriate place to put declaration and initialization. Best regards, Nicolai Tufar Devrim Gunduz trpatch.diff Description: Binary data ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] 7.4.1 release status - Turkish Locale
Nicolai Tufar [EMAIL PROTECTED] writes: We might think that the Turkish-locale problem Devrim Gunduz pointed out is a must-fix, too. But I'm not convinced yet what to do about it. Here is a first try to fix what Devrim Gunduz talked about. I still don't much like having a locale-specific wart in the parser (and the code you give could not work anyway --- for starters, the first argument of setlocale is not a pointer). A possible compromise is to apply ASCII downcasing (same as in keywords.c) for 7-bit-ASCII characters, and apply tolower() only for character codes above 127. In other words unsigned char ch = (unsigned char) ident[i]; if (ch = 'A' ch = 'Z') ch += 'a' - 'A'; else if (ch 127 isupper(ch)) ch = tolower(ch); ident[i] = (char) ch; In reasonably sane locales this will have the same effects as currently, while in unsane locales it will ensure that basic-ASCII identifiers are treated the way we want. Comments? regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match