Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread John Hansen
Tatsuo Ishii wrote:
 Sent: Sunday, May 08, 2005 3:31 PM
 To: John Hansen
 Cc: [EMAIL PROTECTED]; pgsql-general@postgresql.org; 
 pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
 
  Tatsuo Ishii wrote:
   Sent: Sunday, May 08, 2005 12:01 PM
   To: [EMAIL PROTECTED]
   Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
   Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
   
   We have developed patches which relaxes the character 
 validation so 
   that PostgreSQL accepts invalid characters. It works like this:
  
  That is just plain 100% wrong!!
 
  Under no circumstances should there be invalid data in a database.
  And if you're trying to make a database of invalid data, 
 then at least 
  encode it using a valid encoding.
  
  In fact, I've proposed strengthening the validation 
 routines for UTF-8.
 
 Actually I myself thought as you are before. Later I found 
 that it was not so good idea. People already have invalid 
 encoded data in their precious database and have very hard 
 time to migrate to newer version of PostgreSQL because of 
 encoding validation.
 
 Think about this kind of situation:
 
 There is a table t1(member_id integer primary key, 
 member_name text, address text, phone text, email text). I 
 have to reach each member by either adress, phone or email. 
 Unfortunately some of address field have wrong encoded data. 
 In this case I will use phone or email to reach them. 
 
 Now I need to upgrade to newer PostgreSQL within 1 day. I 
 know I have to fix wrong encoded field but it will take more 
 than 1 day. So I would like to import the data first then fix 
 wrong encoded field on running database since I can reach 
 members by phone or email even with wrong encoded address field...

Actually would be very simple, create function isvalidutf8(text) in your
preferred language.
C source is available from unicode.org.
Create function converttoutf8(text) using whatever code is required to
transform the _wrong_ encoding (SQL_ASCII - UTF8 for instance) to
utf-8.
Update table set field=converttoutf8(field) where !isvalidutf8(field);

Now sit back and relax while your invalid data is converted to utf-8.

When done, pg_dump the database, upgrade, and reload.

This should take less than a day.

 I saw this kind of situation in the real world and that's why 
 we developed the patches.
 --
 Tatsuo Ishii
 
 

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Tino Wildenhain
Am Sonntag, den 08.05.2005, 14:30 +0900 schrieb Tatsuo Ishii:
...
 Actually I myself thought as you are before. Later I found that it was
 not so good idea. People already have invalid encoded data in their
 precious database and have very hard time to migrate to newer version
 of PostgreSQL because of encoding validation.
...
 Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
 to fix wrong encoded field but it will take more than 1 day. So I
 would like to import the data first then fix wrong encoded field on
 running database since I can reach members by phone or email even with
 wrong encoded address field...

Well, if you are so in a hurry you better not migrate. Postgres is
proud of validating the input and to have no invalid data. So if
you have invalid data, better fix it.

 I saw this kind of situation in the real world and that's why we
 developed the patches.

Why not developing a helper for contrib to help reencoding the
database instead?

Regards
Tino


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Madison Kelly
John Hansen wrote:
Tatsuo Ishii wrote:
We have developed patches which relaxes the character 
validation so that PostgreSQL accepts invalid characters. It 
works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.
... John
  Under most circumstances I would agree with you completely. In my 
case though I have to decide between risking a loss of a user's data or 
attempt to store the file name in some manner that would return the same 
name used by the file system.

  The user (or one of his/her users in the case of an admin) may be 
completely unaware of the file name being an invalid unicode name. The 
file itself though may still be quite valid and contain information 
worthy of backing up. I could notify the user/admin that the name is not 
valid but there is no way I could rely on the name being changed. Given 
the choices, I would prefer to attempt to store/use the file name with 
the invalid unicode character than simply ignore the file.

  Is there a way to store the name in raw binary? If so, would this not 
be safe because to postgresql it should no longer matter what data is or 
represents, right? Maybe there is a third option I am not yet concidering?

Madison
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Madison Kelly
Thank you, I would!

  What versions have you tested the patch against? I am sorry but I am
not too familiar with applying patches against the main program, is
there documentation on how to apply the patch? Is there a way to roll
the patch back/remove it? Would I be able to script the installation of
the patch (I would expect so).

  The reason for the last question is that I expect (hope) many people
will use it and I want to make it as easy as possible for a user to
simply select or unselect the patch if it works well. If I can script
the install and removal of this patch then I can do just this and that
would be wonderful.

  Thank you again!

  どうも ありがとう ございます! (I hope that is right, my 
Japanese is
still elementary. :) )

  Madison


Tatsuo Ishii wrote:
 We have developed patches which relaxes the character validation so
 that PostgreSQL accepts invalid characters. It works like this:
 
 1) new postgresql.conf item "mbstr_check" added.
 2) if mbstr_check = 0 then invalid characters are not accepted
(same as current PostgreSQL behavior). This is the default.
 3) if mbstr_check = 1 then invalid characters are accepted with
WARNING
 4) if mbstr_check = 2 then invalid characters are accepted without any
warnings
 5) We have checked PostgreSQL source code if accepting invalid
characters makes some troubles. We have found that we need to fix a
place and the fix is included in the patches.
 
 Madison,
 If you are interested in the patches, I could send it to you.
 
 Hackers,
 Do you think the functionality something like above is worth to add to
 PostgreSQL?
 --
 Tatsuo Ishii
 
 
Hi all,

   I've been chasing down a bug and from what I have learned it may be 
because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles 
invalid unicode. I've been given some ideas on how to try to catch 
invalid unicode but it seems expensive so I am hoping there is a 
postgresql way to deal with this problem.

   I've run into a problem where a bulk postgres "COPY..." statement is 
dieing because one of the lines contains a file name with an invalid 
unicode character. In nautilus this file has '(invalid encoding)' and 
the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column 
file_name: "Femme Fatal\u.url"'.

   To actually look at the file from the shell (bash) shows what appears 
to be a whitespace but when I copy/paste the file name I get the 
'\u' you see above.

   I could, with the help of the TLUG people, use regex to match for an 
invalid character and skip the file but that is not ideal. The reason is 
that this is for my backup program and invalid unicode or not, the 
contents of the file may still be important and I would prefer to have 
it in the database so that it is later copied. I can copy and move the 
file in the shell so the file isn't apparently in an of itself corrupt.

   So then, is there a way I can tell postresql to accept the invalid 
unicode name? Here is a copy of my schema:

tle-bu= \d file_info_2
   Table "public.file_info_2"
 Column| Type |Modifiers
--+--+-
  file_group_name  | text |
  file_group_uid   | bigint   | not null
  file_mod_time| bigint   | not null
  file_name| text | not null
  file_parent_dir  | text | not null
  file_perm| text | not null
  file_size| bigint   | not null
  file_type| character varying(2) | not null default 
'f'::character varying
  file_user_name   | text |
  file_user_uid| bigint   | not null
  file_backup  | boolean  | not null default true
  file_display | boolean  | not null default false
  file_restore_display | boolean  | not null default false
  file_restore | boolean  | not null default false
Indexes:
 "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)

   'file_name' and 'file_parent_dir' are the columns that could have 
entries with the invalid unicode characters. Maybe I could/should use 
something other than 'text'? These columns could contain anything that a 
file or directory name could be.

   Thanks!

Madison

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com

Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Oliver Jowett
Madison Kelly wrote:
  Is there a way to store the name in raw binary? 
Yes: bytea.
-O
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tatsuo Ishii
We have developed patches which relaxes the character validation so
that PostgreSQL accepts invalid characters. It works like this:

1) new postgresql.conf item mbstr_check added.
2) if mbstr_check = 0 then invalid characters are not accepted
   (same as current PostgreSQL behavior). This is the default.
3) if mbstr_check = 1 then invalid characters are accepted with
   WARNING
4) if mbstr_check = 2 then invalid characters are accepted without any
   warnings
5) We have checked PostgreSQL source code if accepting invalid
   characters makes some troubles. We have found that we need to fix a
   place and the fix is included in the patches.

Madison,
If you are interested in the patches, I could send it to you.

Hackers,
Do you think the functionality something like above is worth to add to
PostgreSQL?
--
Tatsuo Ishii

 Hi all,
 
I've been chasing down a bug and from what I have learned it may be 
 because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles 
 invalid unicode. I've been given some ideas on how to try to catch 
 invalid unicode but it seems expensive so I am hoping there is a 
 postgresql way to deal with this problem.
 
I've run into a problem where a bulk postgres COPY... statement is 
 dieing because one of the lines contains a file name with an invalid 
 unicode character. In nautilus this file has '(invalid encoding)' and 
 the postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column 
 file_name: Femme Fatal\u.url'.
 
To actually look at the file from the shell (bash) shows what appears 
 to be a whitespace but when I copy/paste the file name I get the 
 '\u' you see above.
 
I could, with the help of the TLUG people, use regex to match for an 
 invalid character and skip the file but that is not ideal. The reason is 
 that this is for my backup program and invalid unicode or not, the 
 contents of the file may still be important and I would prefer to have 
 it in the database so that it is later copied. I can copy and move the 
 file in the shell so the file isn't apparently in an of itself corrupt.
 
So then, is there a way I can tell postresql to accept the invalid 
 unicode name? Here is a copy of my schema:
 
 tle-bu= \d file_info_2
Table public.file_info_2
  Column| Type |Modifiers
 --+--+-
   file_group_name  | text |
   file_group_uid   | bigint   | not null
   file_mod_time| bigint   | not null
   file_name| text | not null
   file_parent_dir  | text | not null
   file_perm| text | not null
   file_size| bigint   | not null
   file_type| character varying(2) | not null default 
 'f'::character varying
   file_user_name   | text |
   file_user_uid| bigint   | not null
   file_backup  | boolean  | not null default true
   file_display | boolean  | not null default false
   file_restore_display | boolean  | not null default false
   file_restore | boolean  | not null default false
 Indexes:
  file_info_2_display_idx btree (file_type, file_parent_dir, file_name)
 
'file_name' and 'file_parent_dir' are the columns that could have 
 entries with the invalid unicode characters. Maybe I could/should use 
 something other than 'text'? These columns could contain anything that a 
 file or directory name could be.
 
Thanks!
 
 Madison
 
 -- 
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Madison Kelly (Digimer)
 TLE-BU, The Linux Experience; Back Up
 http://tle-bu.thelinuxexperience.com
 -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 
 ---(end of broadcast)---
 TIP 7: don't forget to increase your free space map settings
 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Tatsuo Ishii wrote:
 Sent: Sunday, May 08, 2005 12:01 PM
 To: [EMAIL PROTECTED]
 Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
 Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
 
 We have developed patches which relaxes the character 
 validation so that PostgreSQL accepts invalid characters. It 
 works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.

... John

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread John Hansen
Madison Kelly wrote:
Under most circumstances I would agree with you completely. In my 
 case though I have to decide between risking a loss of a 
 user's data or 
 attempt to store the file name in some manner that would 
 return the same 
 name used by the file system.
 
The user (or one of his/her users in the case of an admin) may be 
 completely unaware of the file name being an invalid unicode 
 name. The 
 file itself though may still be quite valid and contain information 
 worthy of backing up. I could notify the user/admin that the 
 name is not 
 valid but there is no way I could rely on the name being 
 changed. Given 
 the choices, I would prefer to attempt to store/use the file 
 name with 
 the invalid unicode character than simply ignore the file.
 
Is there a way to store the name in raw binary? If so, 
 would this not 
 be safe because to postgresql it should no longer matter what 
 data is or 
 represents, right? Maybe there is a third option I am not yet 
 concidering?

Set the client_encoding to ascii when storing that name, and again when
retrieving it.
Or, use a bytea column.

 
 Madison

... John

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tom Lane
John Hansen [EMAIL PROTECTED] writes:
 Tatsuo Ishii wrote:
 We have developed patches which relaxes the character 
 validation so that PostgreSQL accepts invalid characters.

 That is just plain 100% wrong!!

That was my first reaction too.  Why would this be a good idea?
If someone does want an encoding-agnostic database, they can
set it as SQL_ASCII.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-07 Thread Tatsuo Ishii
 Tatsuo Ishii wrote:
  Sent: Sunday, May 08, 2005 12:01 PM
  To: [EMAIL PROTECTED]
  Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org
  Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
  
  We have developed patches which relaxes the character 
  validation so that PostgreSQL accepts invalid characters. It 
  works like this:
 
 That is just plain 100% wrong!!

 Under no circumstances should there be invalid data in a database.
 And if you're trying to make a database of invalid data, then at 
 least encode it using a valid encoding.
 
 In fact, I've proposed strengthening the validation routines for UTF-8.

Actually I myself thought as you are before. Later I found that it was
not so good idea. People already have invalid encoded data in their
precious database and have very hard time to migrate to newer version
of PostgreSQL because of encoding validation.

Think about this kind of situation:

There is a table t1(member_id integer primary key, member_name text,
address text, phone text, email text). I have to reach each member by
either adress, phone or email. Unfortunately some of address field
have wrong encoded data. In this case I will use phone or email to
reach them. 

Now I need to upgrade to newer PostgreSQL within 1 day. I know I have
to fix wrong encoded field but it will take more than 1 day. So I
would like to import the data first then fix wrong encoded field on
running database since I can reach members by phone or email even with
wrong encoded address field...

I saw this kind of situation in the real world and that's why we
developed the patches.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org