Re: [HACKERS] Lexing with different charsets

2004-04-14 Thread Fabien COELHO

 My next question is about lexing. The spec says that one can use strings
 of different charsets in the queries, like:

   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'

 different charsets or this is not going to work very well.

Sorry for this maybe stupid question about an must-be-obvious hidden
rationnal behind this feature:

What editor or terminal is supposed to be able to generate text in
different encodings depending on the part of the sentence? I don't think I
have that in emacs. Or is it irrelevant??

I cannot see where I could use such a feature.

-- 
Fabien Coelho - [EMAIL PROTECTED]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Lexing with different charsets

2004-04-14 Thread Dennis Bjorklund
On Wed, 14 Apr 2004, Fabien COELHO wrote:

... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'
 
  different charsets or this is not going to work very well.
 
 What editor or terminal is supposed to be able to generate text in
 different encodings depending on the part of the sentence? I don't think I
 have that in emacs. Or is it irrelevant??
 
 I cannot see where I could use such a feature.

Applications usually generate queries. So you can do things like

printf (SELECT * FROM foo WHERE field1 = _latin1'%s';, my_latin1_data);

for use on the terminal one would need to use some escaping/encoding much 
like is done with bytea. For example something like _latin1 H'0a660d' (but 
that is not sql-standard).

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Lexing with different charsets

2004-04-14 Thread Fabien COELHO

  I cannot see where I could use such a feature.

 Applications usually generate queries.

Sure.

 So you can do things like

 printf (SELECT * FROM foo WHERE field1 = _latin1'%s';, my_latin1_data);

Hmmm... I guess the following was too complicated. You need a library
for conversion. You need to take care of conversions.

printf(SELECT * FROM foo WHERE field1 = '%s',
   latin1_to_database_encoding(...));


Well, so this is a great new useful feature indeed, that will help improve
the lexer code a lot;-)

Good luck,

-- 
Fabien Coelho - [EMAIL PROTECTED]

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [HACKERS] Lexing with different charsets

2004-04-14 Thread Dennis Bjorklund
On Wed, 14 Apr 2004, Fabien COELHO wrote:

 printf(SELECT * FROM foo WHERE field1 = '%s',
latin1_to_database_encoding(...));

And how do you do this if the database encoding is latin2? You can not 
convert latin1 to latin2.

The specification was written like this to handle things like latin1 
strings in latin2 databases, or latin1 in a database that otherwise 
only uses ascii.

The intention is good, but the specification is not perfect in any way.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[HACKERS] Lexing with different charsets

2004-04-13 Thread Dennis Bjorklund
I've spent some more time reading specs today. Together with Peter E's
explanataion (Thanks!) I think I've got a farily good understanding of the
parts talking about locales now.

My next question is about lexing. The spec says that one can use strings 
of different charsets in the queries, like:

  ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'

I can see that the lexer either needs to be taught about all the
different charsets or this is not going to work very well.

What if one wants to include a string in utf-16 in the query, the lexer
can not handle that without understanding utf-16. The query can also be in
different charsets. If it's in utf-8 for example, then we can not embed
latin1 strings and still have a validating utf-8 query. With the above we
can not think of the query as being in a single charset anymore. That's 
strange but okay I guess.

The new wire protocol allows us to send data seperatly from the query
which is nice, but the standard talked about strings as above so it's not
a solution to the problem.

Maybe I should have adressed this to Peter directly :-)

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Lexing with different charsets

2004-04-13 Thread Tom Lane
Dennis Bjorklund [EMAIL PROTECTED] writes:
 My next question is about lexing. The spec says that one can use strings 
 of different charsets in the queries, like:
   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'
 I can see that the lexer either needs to be taught about all the
 different charsets or this is not going to work very well.

Yeah.  I'm not sure that we're ever going to support that part of the
spec; doing so would break too many useful things without adding very
much useful functionality.

We could possibly do it if we restrict to ASCII-superset character sets
(not UTF-16 for instance), so that the string quoting boundaries can be
found without hardwired knowledge about every character set.

regards, tom lane

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] Lexing with different charsets

2004-04-13 Thread Peter Eisentraut
Tom Lane wrote:
 Dennis Bjorklund [EMAIL PROTECTED] writes:
  My next question is about lexing. The spec says that one can use
  strings of different charsets in the queries, like:
... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'
  I can see that the lexer either needs to be taught about all the
  different charsets or this is not going to work very well.

 Yeah.  I'm not sure that we're ever going to support that part of the
 spec; doing so would break too many useful things without adding very
 much useful functionality.

Like what?  I think it could be fairly useful.  We would have to 
restrict ourselves to character sets that are supersets of ASCII, but 
there are boatloads of reasons to do that besides this issue.


---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] Lexing with different charsets

2004-04-13 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Yeah.  I'm not sure that we're ever going to support that part of the
 spec; doing so would break too many useful things without adding very
 much useful functionality.

 Like what?

The first things that came to mind were losing psql's ability to tell
what's a literal, losing the existing capability for queries to be
translated from client-side to server-side character set, and losing the
capability to have character sets defined by plug-in extensions rather
than being hard-wired into the lexer.  (Before you claim that the last
is easily solved, consider that the lexer is not allowed to do database
accesses.)

 I think it could be fairly useful.  We would have to 
 restrict ourselves to character sets that are supersets of ASCII, but 
 there are boatloads of reasons to do that besides this issue.

If we do that then some of the problems go away, but I'm not sure they
all do.  Are you willing to drop support for non-ASCII-superset
character sets on the client side as well as the server?

regards, tom lane

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] Lexing with different charsets

2004-04-13 Thread Dennis Bjorklund
On Tue, 13 Apr 2004, Tom Lane wrote:

 We could possibly do it if we restrict to ASCII-superset character sets
 (not UTF-16 for instance), so that the string quoting boundaries can be
 found without hardwired knowledge about every character set.

It's a reasonable compromise I guess. One can still support utf-16 and
others using the new wire protocol and maybe with some escaping extension
like:

 _utf16 H'a42a1121311'

where H would be a way to form a string from hexencoded bytes (or 
using the same as for bytea, or whatever). It's a problem for the future.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] Lexing with different charsets

2004-04-13 Thread Tatsuo Ishii
 I've spent some more time reading specs today. Together with Peter E's
 explanataion (Thanks!) I think I've got a farily good understanding of the
 parts talking about locales now.
 
 My next question is about lexing. The spec says that one can use strings 
 of different charsets in the queries, like:
 
   ... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'

In my understanding this was removed as of SQL:1999. I'm not sure
about SQL:2003 though.
--
Tatsuo Ishii

 I can see that the lexer either needs to be taught about all the
 different charsets or this is not going to work very well.
 
 What if one wants to include a string in utf-16 in the query, the lexer
 can not handle that without understanding utf-16. The query can also be in
 different charsets. If it's in utf-8 for example, then we can not embed
 latin1 strings and still have a validating utf-8 query. With the above we
 can not think of the query as being in a single charset anymore. That's 
 strange but okay I guess.
 
 The new wire protocol allows us to send data seperatly from the query
 which is nice, but the standard talked about strings as above so it's not
 a solution to the problem.
 
 Maybe I should have adressed this to Peter directly :-)
 
 -- 
 /Dennis Björklund
 
 
 ---(end of broadcast)---
 TIP 8: explain analyze is your friend
 

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] Lexing with different charsets

2004-04-13 Thread Stephan Szabo

On Wed, 14 Apr 2004, Tatsuo Ishii wrote:

  I've spent some more time reading specs today. Together with Peter E's
  explanataion (Thanks!) I think I've got a farily good understanding of the
  parts talking about locales now.
 
  My next question is about lexing. The spec says that one can use strings
  of different charsets in the queries, like:
 
... WHERE field1 = _latin1'FooBar' and field2 = _utf8'Åäö'

 In my understanding this was removed as of SQL:1999. I'm not sure
 about SQL:2003 though.

AFAICS, it still basically has:
character string literal ::=
[ introducercharacter set specification ]
quote [ character representation... ] quote
[ { separator quote [ character representation... ] quote }... ]

---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match