Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 3:31 PM To: John Hansen Cc: [EMAIL PROTECTED]; pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 12:01 PM To: [EMAIL PROTECTED] Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. Actually I myself thought as you are before. Later I found that it was not so good idea. People already have invalid encoded data in their precious database and have very hard time to migrate to newer version of PostgreSQL because of encoding validation. Think about this kind of situation: There is a table t1(member_id integer primary key, member_name text, address text, phone text, email text). I have to reach each member by either adress, phone or email. Unfortunately some of address field have wrong encoded data. In this case I will use phone or email to reach them. Now I need to upgrade to newer PostgreSQL within 1 day. I know I have to fix wrong encoded field but it will take more than 1 day. So I would like to import the data first then fix wrong encoded field on running database since I can reach members by phone or email even with wrong encoded address field... Actually would be very simple, create function isvalidutf8(text) in your preferred language. C source is available from unicode.org. Create function converttoutf8(text) using whatever code is required to transform the _wrong_ encoding (SQL_ASCII - UTF8 for instance) to utf-8. Update table set field=converttoutf8(field) where !isvalidutf8(field); Now sit back and relax while your invalid data is converted to utf-8. When done, pg_dump the database, upgrade, and reload. This should take less than a day. I saw this kind of situation in the real world and that's why we developed the patches. -- Tatsuo Ishii ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Am Sonntag, den 08.05.2005, 14:30 +0900 schrieb Tatsuo Ishii: ... Actually I myself thought as you are before. Later I found that it was not so good idea. People already have invalid encoded data in their precious database and have very hard time to migrate to newer version of PostgreSQL because of encoding validation. ... Now I need to upgrade to newer PostgreSQL within 1 day. I know I have to fix wrong encoded field but it will take more than 1 day. So I would like to import the data first then fix wrong encoded field on running database since I can reach members by phone or email even with wrong encoded address field... Well, if you are so in a hurry you better not migrate. Postgres is proud of validating the input and to have no invalid data. So if you have invalid data, better fix it. I saw this kind of situation in the real world and that's why we developed the patches. Why not developing a helper for contrib to help reencoding the database instead? Regards Tino ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
John Hansen wrote: Tatsuo Ishii wrote: We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. ... John Under most circumstances I would agree with you completely. In my case though I have to decide between risking a loss of a user's data or attempt to store the file name in some manner that would return the same name used by the file system. The user (or one of his/her users in the case of an admin) may be completely unaware of the file name being an invalid unicode name. The file itself though may still be quite valid and contain information worthy of backing up. I could notify the user/admin that the name is not valid but there is no way I could rely on the name being changed. Given the choices, I would prefer to attempt to store/use the file name with the invalid unicode character than simply ignore the file. Is there a way to store the name in raw binary? If so, would this not be safe because to postgresql it should no longer matter what data is or represents, right? Maybe there is a third option I am not yet concidering? Madison -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Madison Kelly (Digimer) TLE-BU, The Linux Experience; Back Up http://tle-bu.thelinuxexperience.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Thank you, I would! What versions have you tested the patch against? I am sorry but I am not too familiar with applying patches against the main program, is there documentation on how to apply the patch? Is there a way to roll the patch back/remove it? Would I be able to script the installation of the patch (I would expect so). The reason for the last question is that I expect (hope) many people will use it and I want to make it as easy as possible for a user to simply select or unselect the patch if it works well. If I can script the install and removal of this patch then I can do just this and that would be wonderful. Thank you again! どうも ありがとう ございます! (I hope that is right, my Japanese is still elementary. :) ) Madison Tatsuo Ishii wrote: We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: 1) new postgresql.conf item "mbstr_check" added. 2) if mbstr_check = 0 then invalid characters are not accepted (same as current PostgreSQL behavior). This is the default. 3) if mbstr_check = 1 then invalid characters are accepted with WARNING 4) if mbstr_check = 2 then invalid characters are accepted without any warnings 5) We have checked PostgreSQL source code if accepting invalid characters makes some troubles. We have found that we need to fix a place and the fix is included in the patches. Madison, If you are interested in the patches, I could send it to you. Hackers, Do you think the functionality something like above is worth to add to PostgreSQL? -- Tatsuo Ishii Hi all, I've been chasing down a bug and from what I have learned it may be because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles invalid unicode. I've been given some ideas on how to try to catch invalid unicode but it seems expensive so I am hoping there is a postgresql way to deal with this problem. I've run into a problem where a bulk postgres "COPY..." statement is dieing because one of the lines contains a file name with an invalid unicode character. In nautilus this file has '(invalid encoding)' and the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column file_name: "Femme Fatal\u.url"'. To actually look at the file from the shell (bash) shows what appears to be a whitespace but when I copy/paste the file name I get the '\u' you see above. I could, with the help of the TLUG people, use regex to match for an invalid character and skip the file but that is not ideal. The reason is that this is for my backup program and invalid unicode or not, the contents of the file may still be important and I would prefer to have it in the database so that it is later copied. I can copy and move the file in the shell so the file isn't apparently in an of itself corrupt. So then, is there a way I can tell postresql to accept the invalid unicode name? Here is a copy of my schema: tle-bu= \d file_info_2 Table "public.file_info_2" Column| Type |Modifiers --+--+- file_group_name | text | file_group_uid | bigint | not null file_mod_time| bigint | not null file_name| text | not null file_parent_dir | text | not null file_perm| text | not null file_size| bigint | not null file_type| character varying(2) | not null default 'f'::character varying file_user_name | text | file_user_uid| bigint | not null file_backup | boolean | not null default true file_display | boolean | not null default false file_restore_display | boolean | not null default false file_restore | boolean | not null default false Indexes: "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name) 'file_name' and 'file_parent_dir' are the columns that could have entries with the invalid unicode characters. Maybe I could/should use something other than 'text'? These columns could contain anything that a file or directory name could be. Thanks! Madison -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Madison Kelly (Digimer) TLE-BU, The Linux Experience; Back Up http://tle-bu.thelinuxexperience.com
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Madison Kelly wrote: Is there a way to store the name in raw binary? Yes: bytea. -O ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: 1) new postgresql.conf item mbstr_check added. 2) if mbstr_check = 0 then invalid characters are not accepted (same as current PostgreSQL behavior). This is the default. 3) if mbstr_check = 1 then invalid characters are accepted with WARNING 4) if mbstr_check = 2 then invalid characters are accepted without any warnings 5) We have checked PostgreSQL source code if accepting invalid characters makes some troubles. We have found that we need to fix a place and the fix is included in the patches. Madison, If you are interested in the patches, I could send it to you. Hackers, Do you think the functionality something like above is worth to add to PostgreSQL? -- Tatsuo Ishii Hi all, I've been chasing down a bug and from what I have learned it may be because of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles invalid unicode. I've been given some ideas on how to try to catch invalid unicode but it seems expensive so I am hoping there is a postgresql way to deal with this problem. I've run into a problem where a bulk postgres COPY... statement is dieing because one of the lines contains a file name with an invalid unicode character. In nautilus this file has '(invalid encoding)' and the postgres error is 'CONTEXT: COPY file_info_3, line 228287, column file_name: Femme Fatal\u.url'. To actually look at the file from the shell (bash) shows what appears to be a whitespace but when I copy/paste the file name I get the '\u' you see above. I could, with the help of the TLUG people, use regex to match for an invalid character and skip the file but that is not ideal. The reason is that this is for my backup program and invalid unicode or not, the contents of the file may still be important and I would prefer to have it in the database so that it is later copied. I can copy and move the file in the shell so the file isn't apparently in an of itself corrupt. So then, is there a way I can tell postresql to accept the invalid unicode name? Here is a copy of my schema: tle-bu= \d file_info_2 Table public.file_info_2 Column| Type |Modifiers --+--+- file_group_name | text | file_group_uid | bigint | not null file_mod_time| bigint | not null file_name| text | not null file_parent_dir | text | not null file_perm| text | not null file_size| bigint | not null file_type| character varying(2) | not null default 'f'::character varying file_user_name | text | file_user_uid| bigint | not null file_backup | boolean | not null default true file_display | boolean | not null default false file_restore_display | boolean | not null default false file_restore | boolean | not null default false Indexes: file_info_2_display_idx btree (file_type, file_parent_dir, file_name) 'file_name' and 'file_parent_dir' are the columns that could have entries with the invalid unicode characters. Maybe I could/should use something other than 'text'? These columns could contain anything that a file or directory name could be. Thanks! Madison -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Madison Kelly (Digimer) TLE-BU, The Linux Experience; Back Up http://tle-bu.thelinuxexperience.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 12:01 PM To: [EMAIL PROTECTED] Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. ... John ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Madison Kelly wrote: Under most circumstances I would agree with you completely. In my case though I have to decide between risking a loss of a user's data or attempt to store the file name in some manner that would return the same name used by the file system. The user (or one of his/her users in the case of an admin) may be completely unaware of the file name being an invalid unicode name. The file itself though may still be quite valid and contain information worthy of backing up. I could notify the user/admin that the name is not valid but there is no way I could rely on the name being changed. Given the choices, I would prefer to attempt to store/use the file name with the invalid unicode character than simply ignore the file. Is there a way to store the name in raw binary? If so, would this not be safe because to postgresql it should no longer matter what data is or represents, right? Maybe there is a third option I am not yet concidering? Set the client_encoding to ascii when storing that name, and again when retrieving it. Or, use a bytea column. Madison ... John ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
John Hansen [EMAIL PROTECTED] writes: Tatsuo Ishii wrote: We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. That is just plain 100% wrong!! That was my first reaction too. Why would this be a good idea? If someone does want an encoding-agnostic database, they can set it as SQL_ASCII. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem
Tatsuo Ishii wrote: Sent: Sunday, May 08, 2005 12:01 PM To: [EMAIL PROTECTED] Cc: pgsql-general@postgresql.org; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem We have developed patches which relaxes the character validation so that PostgreSQL accepts invalid characters. It works like this: That is just plain 100% wrong!! Under no circumstances should there be invalid data in a database. And if you're trying to make a database of invalid data, then at least encode it using a valid encoding. In fact, I've proposed strengthening the validation routines for UTF-8. Actually I myself thought as you are before. Later I found that it was not so good idea. People already have invalid encoded data in their precious database and have very hard time to migrate to newer version of PostgreSQL because of encoding validation. Think about this kind of situation: There is a table t1(member_id integer primary key, member_name text, address text, phone text, email text). I have to reach each member by either adress, phone or email. Unfortunately some of address field have wrong encoded data. In this case I will use phone or email to reach them. Now I need to upgrade to newer PostgreSQL within 1 day. I know I have to fix wrong encoded field but it will take more than 1 day. So I would like to import the data first then fix wrong encoded field on running database since I can reach members by phone or email even with wrong encoded address field... I saw this kind of situation in the real world and that's why we developed the patches. -- Tatsuo Ishii ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org