from:"Madison Kelly"

Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Madison Kelly

John Hansen wrote:
Tatsuo Ishii wrote:
We have developed patches which relaxes the character 
validation so that PostgreSQL accepts invalid characters. It 
works like this:

That is just plain 100% wrong!!
Under no circumstances should there be invalid data in a database.
And if you're trying to make a database of invalid data, then at 
least encode it using a valid encoding.

In fact, I've proposed strengthening the validation routines for UTF-8.
... John
  Under most circumstances I would agree with you completely. In my 
case though I have to decide between risking a loss of a user's data or 
attempt to store the file name in some manner that would return the same 
name used by the file system.

  The user (or one of his/her users in the case of an admin) may be 
completely unaware of the file name being an invalid unicode name. The 
file itself though may still be quite valid and contain information 
worthy of backing up. I could notify the user/admin that the name is not 
valid but there is no way I could rely on the name being changed. Given 
the choices, I would prefer to attempt to store/use the file name with 
the invalid unicode character than simply ignore the file.

  Is there a way to store the name in raw binary? If so, would this not 
be safe because to postgresql it should no longer matter what data is or 
represents, right? Maybe there is a third option I am not yet concidering?

Madison
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Madison Kelly (Digimer)
TLE-BU, The Linux Experience; Back Up
http://tle-bu.thelinuxexperience.com
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org

Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2005-05-08 Thread Madison Kelly

Thank you, I would!
(B
(B  What versions have you tested the patch against? I am sorry but I am
(Bnot too familiar with applying patches against the main program, is
(Bthere documentation on how to apply the patch? Is there a way to roll
(Bthe patch back/remove it? Would I be able to script the installation of
(Bthe patch (I would expect so).
(B
(B  The reason for the last question is that I expect (hope) many people
(Bwill use it and I want to make it as easy as possible for a user to
(Bsimply select or unselect the patch if it works well. If I can script
(Bthe install and removal of this patch then I can do just this and that
(Bwould be wonderful.
(B
(B  Thank you again!
(B
(B  $B$I$&$b(B $B$"$j$,$H$&(B $B$4$6$$$^$9(B! (I hope that is right, my 
(BJapanese is
(Bstill elementary. :) )
(B
(B  Madison
(B
(B
(BTatsuo Ishii wrote:
(B We have developed patches which relaxes the character validation so
(B that PostgreSQL accepts invalid characters. It works like this:
(B 
(B 1) new postgresql.conf item "mbstr_check" added.
(B 2) if mbstr_check = 0 then invalid characters are not accepted
(B(same as current PostgreSQL behavior). This is the default.
(B 3) if mbstr_check = 1 then invalid characters are accepted with
(BWARNING
(B 4) if mbstr_check = 2 then invalid characters are accepted without any
(Bwarnings
(B 5) We have checked PostgreSQL source code if accepting invalid
(Bcharacters makes some troubles. We have found that we need to fix a
(Bplace and the fix is included in the patches.
(B 
(B Madison,
(B If you are interested in the patches, I could send it to you.
(B 
(B Hackers,
(B Do you think the functionality something like above is worth to add to
(B PostgreSQL?
(B --
(B Tatsuo Ishii
(B 
(B 
(BHi all,
(B
(B   I've been chasing down a bug and from what I have learned it may be 
(Bbecause of how postgreSQL (8.0.2 on Fedora Core 4 test 2) handles 
(Binvalid unicode. I've been given some ideas on how to try to catch 
(Binvalid unicode but it seems expensive so I am hoping there is a 
(Bpostgresql way to deal with this problem.
(B
(B   I've run into a problem where a bulk postgres "COPY..." statement is 
(Bdieing because one of the lines contains a file name with an invalid 
(Bunicode character. In nautilus this file has '(invalid encoding)' and 
(Bthe postgres error is 'CONTEXT:  COPY file_info_3, line 228287, column 
(Bfile_name: "Femme Fatal\u.url"'.
(B
(B   To actually look at the file from the shell (bash) shows what appears 
(Bto be a whitespace but when I copy/paste the file name I get the 
(B'\u' you see above.
(B
(B   I could, with the help of the TLUG people, use regex to match for an 
(Binvalid character and skip the file but that is not ideal. The reason is 
(Bthat this is for my backup program and invalid unicode or not, the 
(Bcontents of the file may still be important and I would prefer to have 
(Bit in the database so that it is later copied. I can copy and move the 
(Bfile in the shell so the file isn't apparently in an of itself corrupt.
(B
(B   So then, is there a way I can tell postresql to accept the invalid 
(Bunicode name? Here is a copy of my schema:
(B
(Btle-bu= \d file_info_2
(B   Table "public.file_info_2"
(B Column| Type |Modifiers
(B--+--+-
(B  file_group_name  | text |
(B  file_group_uid   | bigint   | not null
(B  file_mod_time| bigint   | not null
(B  file_name| text | not null
(B  file_parent_dir  | text | not null
(B  file_perm| text | not null
(B  file_size| bigint   | not null
(B  file_type| character varying(2) | not null default 
(B'f'::character varying
(B  file_user_name   | text |
(B  file_user_uid| bigint   | not null
(B  file_backup  | boolean  | not null default true
(B  file_display | boolean  | not null default false
(B  file_restore_display | boolean  | not null default false
(B  file_restore | boolean  | not null default false
(BIndexes:
(B "file_info_2_display_idx" btree (file_type, file_parent_dir, file_name)
(B
(B   'file_name' and 'file_parent_dir' are the columns that could have 
(Bentries with the invalid unicode characters. Maybe I could/should use 
(Bsomething other than 'text'? These columns could contain anything that a 
(Bfile or directory name could be.
(B
(B   Thanks!
(B
(BMadison
(B
(B-- 
(B-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
(BMadison Kelly (Digimer)
(BTLE-BU, The Linux Experience; Back Up
(Bhttp://tle-bu.thelinuxexperience.com

Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

Re: [HACKERS] [GENERAL] Invalid unicode in COPY problem

2 matches

Site Navigation

Mail list logo

Footer information