Re: [HACKERS] Upcoming PG re-releases

2005-12-10 Thread Robert Treat
Was thinking if someone could summarize this all it would make a really good 
FAQ entry. 

Robert Treat

On Friday 09 December 2005 13:28, Martijn van Oosterhout wrote:
 On Fri, Dec 09, 2005 at 12:38:21PM -0500, Bruce Momjian wrote:
   This means someone who is planning on upgrading to 8.1 in two months
   can use this function now to weed out the bad data before the upgrade
   even starts.
 
  Oh, so you back-load it into the old database.  Interesting.  I assume
  to be useful you would have to write something that checked every column
  values in every table and database.

 Umm, yeah. I was thinking about how to do that. pl/pgsql is not the
 best language to do that in. In any case I found a bug in the version I
 posted and also added a function that does:

 test=# select * from db_utf8_verify();
  tab  | fld | location
 --+-+--
  tbl1 | foo | (12,3)
 (1 row)

 It gives the table, field and ctid of any values that failed. It skips
 pg_catalog. It's also *really* slow for long strings. Just executing it
 on the pg_rewrite in the default installation takes forever. If someone
 really wanted this for a large database maybe they should recode it in
 C.

 http://svana.org/kleptog/pgsql/utf8_verify.sql

 Have a nice day,

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Upcoming PG re-releases

2005-12-10 Thread Bruce Momjian

I don't see it asked very often, and I think our 8.1 releae note
addition (plus a mention in the 8.1.1 notes) will complete this.

---

Robert Treat wrote:
 Was thinking if someone could summarize this all it would make a really good 
 FAQ entry. 
 
 Robert Treat
 
 On Friday 09 December 2005 13:28, Martijn van Oosterhout wrote:
  On Fri, Dec 09, 2005 at 12:38:21PM -0500, Bruce Momjian wrote:
This means someone who is planning on upgrading to 8.1 in two months
can use this function now to weed out the bad data before the upgrade
even starts.
  
   Oh, so you back-load it into the old database.  Interesting.  I assume
   to be useful you would have to write something that checked every column
   values in every table and database.
 
  Umm, yeah. I was thinking about how to do that. pl/pgsql is not the
  best language to do that in. In any case I found a bug in the version I
  posted and also added a function that does:
 
  test=# select * from db_utf8_verify();
   tab  | fld | location
  --+-+--
   tbl1 | foo | (12,3)
  (1 row)
 
  It gives the table, field and ctid of any values that failed. It skips
  pg_catalog. It's also *really* slow for long strings. Just executing it
  on the pg_rewrite in the default installation takes forever. If someone
  really wanted this for a large database maybe they should recode it in
  C.
 
  http://svana.org/kleptog/pgsql/utf8_verify.sql
 
  Have a nice day,
 
 -- 
 Robert Treat
 Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Upcoming PG re-releases

2005-12-10 Thread Joshua D. Drake

Bruce Momjian wrote:

I don't see it asked very often, and I think our 8.1 releae note
addition (plus a mention in the 8.1.1 notes) will complete this.

  
Actually a upgrade FAQ is probably a good idea. Something that says 
what really happens

when foo changes in 8.1 or how foo is different then 8.0.

The idea that there is a practical (for those that have practical 
implications) resource for finding

out what it really means that the UTF-8 stuff changed .

Joshua D. Drake



---

Robert Treat wrote:
  
Was thinking if someone could summarize this all it would make a really good 
FAQ entry. 


Robert Treat

On Friday 09 December 2005 13:28, Martijn van Oosterhout wrote:


On Fri, Dec 09, 2005 at 12:38:21PM -0500, Bruce Momjian wrote:
  

This means someone who is planning on upgrading to 8.1 in two months
can use this function now to weed out the bad data before the upgrade
even starts.
  

Oh, so you back-load it into the old database.  Interesting.  I assume
to be useful you would have to write something that checked every column
values in every table and database.


Umm, yeah. I was thinking about how to do that. pl/pgsql is not the
best language to do that in. In any case I found a bug in the version I
posted and also added a function that does:

test=# select * from db_utf8_verify();
 tab  | fld | location
--+-+--
 tbl1 | foo | (12,3)
(1 row)

It gives the table, field and ctid of any values that failed. It skips
pg_catalog. It's also *really* slow for long strings. Just executing it
on the pg_rewrite in the default installation takes forever. If someone
really wanted this for a large database maybe they should recode it in
C.

http://svana.org/kleptog/pgsql/utf8_verify.sql

Have a nice day,
  

--
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL




  



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Upcoming PG re-releases

2005-12-09 Thread Martijn van Oosterhout
On Thu, Dec 08, 2005 at 05:54:35PM -0500, Gregory Maxwell wrote:
 No, what is needed for people who care about fixing their data is a
 loadable strip_invalid_utf8() that works in older versions.. then just
 select * from bar where foo != strip_invalid_utf8(foo);  The function
 would be useful in general, for example, if you have an application
 which doesn't already have much utf8 logic, you want to use a text
 field, and stripping is the behaviour you want. For example, lots of
 simple web applications.

Would something like the following work? It's written in pl/pgsql and
does (AFAICS) the same checking as the backend in recent releases.
Except the backend only supports up to 4-byte UTF-8 whereas this
function checks upto six byte. For a six byte UTF-8 character, who is
wrong?

In any case, people should be able to do something like:

SELECT field FROM table WHERE NOT utf8_verify(field,4);

To check conformance with PostgreSQL 8.1. Note, I don't have large
chunks of UTF-8 to test with but it works for the characters I tried
with. Tested with 7.4.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.
create or replace function utf8_verify(bytea,integer) returns bool as '
DECLARE
   str ALIAS FOR $1;
   maxlen ALIAS FOR $2;
   strlen INTEGER;
   i integer;
   j INTEGER;
   len integer;
   chr integer;
   wchr integer;
BEGIN
   i := 0;
   strlen := length(str);

   WHILE i  strlen LOOP

 -- Check leading byte
 chr := get_byte(str,i);

 IF chr  128 THEN -- 0x00 - 0x80   - single byte
len := 1;
wchr := chr;
 ELSIF chr  192 THEN  -- 0x80 - 0xC0   - illegal
RETURN false;
 ELSIF chr  224 THEN  -- 0xC0 - 0xE0   - two bytes
len := 2;
wchr := chr - 192;
 ELSIF chr  240 THEN  -- 0xE0 - 0xF0   - three bytes
len := 3;
wchr := chr - 224;
 ELSIF chr  248 THEN  -- 0xF0 - 0xF8   - four bytes
len := 4;
wchr := chr - 240;
 ELSIF chr  252 THEN  -- 0xF8 - 0xFC   - five bytes
len := 5;
wchr := chr - 248;
 ELSIF chr  254 THEN  -- 0xFC - 0xFE   - six bytes
len := 6;
wchr := chr - 252;
 ELSE
RETURN false;   -- FE and FF not currently defined
 END IF;

 IF i + len  strlen THEN
RETURN false;
 END IF;

 IF len  maxlen THEN
RETURN false;
 END IF;

 -- Check remaining characters
 j := 1;
 WHILE len  j LOOP
chr := get_byte(str, i+j);
IF chr  128 OR chr = 192 THEN
RETURN false;
END IF;
wchr := (wchr  6) + (chr - 192);
j := j+1;
 END LOOP;

 -- Verify shortest possible string
 IF len = 1 AND wchr = 128 THEN
RETURN false;
 ELSIF len = 2 AND (wchr  128 OR wchr = 2048) THEN
RETURN false;
 ELSIF len = 3 AND (wchr  2048 OR wchr = 65536) THEN
RETURN false;
 ELSIF len = 4 AND (wchr  65536 OR wchr = 2097152) THEN
RETURN false;
 ELSIF len = 5 AND (wchr  2097152 OR wchr = 67108864) THEN
RETURN false;
 ELSIF len = 6 AND (wchr  67108864 OR wchr = 2147483648) THEN
RETURN false;
 END IF;

-- RAISE NOTICE ''Checked char offset %, OK (wchr=%,len=%)'', i, wchr, len;

 i := i+len;
   END LOOP;

  RETURN true;
END;
' language plpgsql;


pgpBVYFeuKe6z.pgp
Description: PGP signature


Re: [HACKERS] Upcoming PG re-releases

2005-12-09 Thread Bruce Momjian
Martijn van Oosterhout wrote:
-- Start of PGP signed section.
 On Thu, Dec 08, 2005 at 05:54:35PM -0500, Gregory Maxwell wrote:
  No, what is needed for people who care about fixing their data is a
  loadable strip_invalid_utf8() that works in older versions.. then just
  select * from bar where foo != strip_invalid_utf8(foo);  The function
  would be useful in general, for example, if you have an application
  which doesn't already have much utf8 logic, you want to use a text
  field, and stripping is the behaviour you want. For example, lots of
  simple web applications.
 
 Would something like the following work? It's written in pl/pgsql and
 does (AFAICS) the same checking as the backend in recent releases.
 Except the backend only supports up to 4-byte UTF-8 whereas this
 function checks upto six byte. For a six byte UTF-8 character, who is
 wrong?
 
 In any case, people should be able to do something like:
 
 SELECT field FROM table WHERE NOT utf8_verify(field,4);
 
 To check conformance with PostgreSQL 8.1. Note, I don't have large
 chunks of UTF-8 to test with but it works for the characters I tried
 with. Tested with 7.4.

I think the problem with any kind of function-call detection is that the
data has to get into the database first, and it isn't clear how someone
loading a failed dump would do that aside from modifying the column to
bytea in the dump, loading it in, then fixing it.  The iconv idea has
the advantage that it can be fixed before loading into the database.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Upcoming PG re-releases

2005-12-09 Thread Martijn van Oosterhout
On Fri, Dec 09, 2005 at 11:34:22AM -0500, Bruce Momjian wrote:
 I think the problem with any kind of function-call detection is that the
 data has to get into the database first, and it isn't clear how someone
 loading a failed dump would do that aside from modifying the column to
 bytea in the dump, loading it in, then fixing it.  The iconv idea has
 the advantage that it can be fixed before loading into the database.

The point of this function is to test the data *before* you even create
the dump, while it is still running on 7.4 or 8.0.

This means someone who is planning on upgrading to 8.1 in two months
can use this function now to weed out the bad data before the upgrade
even starts.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpjabhiDQ4L4.pgp
Description: PGP signature


Re: [HACKERS] Upcoming PG re-releases

2005-12-09 Thread Bruce Momjian
Martijn van Oosterhout wrote:
-- Start of PGP signed section.
 On Fri, Dec 09, 2005 at 11:34:22AM -0500, Bruce Momjian wrote:
  I think the problem with any kind of function-call detection is that the
  data has to get into the database first, and it isn't clear how someone
  loading a failed dump would do that aside from modifying the column to
  bytea in the dump, loading it in, then fixing it.  The iconv idea has
  the advantage that it can be fixed before loading into the database.
 
 The point of this function is to test the data *before* you even create
 the dump, while it is still running on 7.4 or 8.0.
 
 This means someone who is planning on upgrading to 8.1 in two months
 can use this function now to weed out the bad data before the upgrade
 even starts.

Oh, so you back-load it into the old database.  Interesting.  I assume
to be useful you would have to write something that checked every column
values in every table and database.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Upcoming PG re-releases

2005-12-09 Thread Martijn van Oosterhout
On Fri, Dec 09, 2005 at 12:38:21PM -0500, Bruce Momjian wrote:
  This means someone who is planning on upgrading to 8.1 in two months
  can use this function now to weed out the bad data before the upgrade
  even starts.
 
 Oh, so you back-load it into the old database.  Interesting.  I assume
 to be useful you would have to write something that checked every column
 values in every table and database.

Umm, yeah. I was thinking about how to do that. pl/pgsql is not the
best language to do that in. In any case I found a bug in the version I
posted and also added a function that does:

test=# select * from db_utf8_verify();
 tab  | fld | location 
--+-+--
 tbl1 | foo | (12,3)
(1 row)

It gives the table, field and ctid of any values that failed. It skips
pg_catalog. It's also *really* slow for long strings. Just executing it
on the pg_rewrite in the default installation takes forever. If someone
really wanted this for a large database maybe they should recode it in
C.

http://svana.org/kleptog/pgsql/utf8_verify.sql

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpqR9Zm4nxcG.pgp
Description: PGP signature


Re: [HACKERS] Upcoming PG re-releases

2005-12-08 Thread Gavin Sherry
On Tue, 6 Dec 2005, Bruce Momjian wrote:


 Exactly what does vim do that iconv does not?  Fuzzy encoding sounds
 scary to me.


Right. It actually makes assumptions about the source encoding. People who
care about their data need, unfortunately, to spend a bit of time on this
problem. I've been discussing the same issue on the slony1 mailing list,
because the issue can affect people's ability upgrade using slony1.

http://gborg.postgresql.org/pipermail/slony1-general/2005-December/003430.html

It would be good if had the script I suggest in the email:

A script which identifies non-utf-8 characters and provides some
context, line numbers, etc, will greatly speed up the process of
remedying the situation.

Thoughts?

Gavin

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Upcoming PG re-releases

2005-12-08 Thread Bruce Momjian
Gavin Sherry wrote:
 On Tue, 6 Dec 2005, Bruce Momjian wrote:
 
 
  Exactly what does vim do that iconv does not?  Fuzzy encoding sounds
  scary to me.
 
 
 Right. It actually makes assumptions about the source encoding. People who
 care about their data need, unfortunately, to spend a bit of time on this
 problem. I've been discussing the same issue on the slony1 mailing list,
 because the issue can affect people's ability upgrade using slony1.
 
 http://gborg.postgresql.org/pipermail/slony1-general/2005-December/003430.html
 
 It would be good if had the script I suggest in the email:
 
   A script which identifies non-utf-8 characters and provides some
   context, line numbers, etc, will greatly speed up the process of
   remedying the situation.

I think the best we can do is the iconv -c with the diff idea, which
is already in the release notes.  I suppose we could merge the iconv and
diff into a single command, but I don't see a portable way to output the
iconv output to stdout., /dev/stdin not being portable.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Upcoming PG re-releases

2005-12-08 Thread Gregory Maxwell
On 12/8/05, Bruce Momjian pgman@candle.pha.pa.us wrote:
A script which identifies non-utf-8 characters and provides some
context, line numbers, etc, will greatly speed up the process of
remedying the situation.

 I think the best we can do is the iconv -c with the diff idea, which
 is already in the release notes.  I suppose we could merge the iconv and
 diff into a single command, but I don't see a portable way to output the
 iconv output to stdout., /dev/stdin not being portable.

No, what is needed for people who care about fixing their data is a
loadable strip_invalid_utf8() that works in older versions.. then just
select * from bar where foo != strip_invalid_utf8(foo);  The function
would be useful in general, for example, if you have an application
which doesn't already have much utf8 logic, you want to use a text
field, and stripping is the behaviour you want. For example, lots of
simple web applications.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Bruce Momjian

I have added your suggestions to the 8.1.X release notes.

---

Paul Lindner wrote:
-- Start of PGP signed section.
 On Sat, Dec 03, 2005 at 10:54:08AM -0500, Bruce Momjian wrote:
  Neil Conway wrote:
   On Wed, 2005-11-30 at 10:56 -0500, Tom Lane wrote:
It's been about a month since 8.1.0 was released, and we've found about
the usual number of bugs for a new release, so it seems like it's time
for 8.1.1.
   
   I think one fix that should be made in time for 8.1.1 is adding a note
   to the version migration section of the 8.1 release notes describing
   the invalid UTF-8 byte sequence problems that some people have run
   into when upgrading from prior versions. I'm not familiar enough with
   the problem or its remedies to add the note myself, though.
  
  Agreed, but I don't understand the problem well enough either.  Does
  anyone?
 
 There was a thread a couple of weeks back about this problem.  Here's
 my sample writeup -- I give my permission for anyone to use it as they
 see fit:
 
 
 Upgrading UNICODE databases to 8.1
 
 Postgres 8.1 includes a number of bug-fixes and improvements to
 Unicode and UTF-8 character handling.  Unfortunately previous releases
 would accept character sequences that were not valid UTF-8.  This
 may cause problems when upgrading your database using
 pg_dump/pg_restore resulting in an error message like this:
 
   Invalid UNICODE byte sequence detected near byte ...
 
 To convert your pre-8.1 database to 8.1 you may have to remove and/or
 fix the offending characters.  One simple way to fix the problem is to
 run your pg_dump output through the iconv command like this:
 
   iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql
 
 The -c flag tells iconv to omit invalid characters from output.
 
 There is one problem with this.  Most versions of iconv try to read
 the entire input file into memory.  If you dump is quite large you
 will need to split the dump into multiple files and convert each one
 individually.  You must use the -l flag for split to insure that the
 unicode byte sequences are not split.
 
split -l 1 dump.sql
 
 Another possible solution is to use the --inserts flag to pg_dump.
 When you load the resulting data dump in 8.1 this will result in the
 problem rows showing up in your error log.
 
 -- 
 Paul Lindner| | | | |  |  |  |   |   |
 [EMAIL PROTECTED]
-- End of PGP section, PGP failed!

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Tom Lane
Bruce Momjian pgman@candle.pha.pa.us writes:
 I have added your suggestions to the 8.1.X release notes.

Did you read the followup discussion?  Recommending -c without a large
warning seems a very bad idea.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian pgman@candle.pha.pa.us writes:
  I have added your suggestions to the 8.1.X release notes.
 
 Did you read the followup discussion?  Recommending -c without a large
 warning seems a very bad idea.

Well, I said it would remove invalid sequences.  What else should we
say?

This will remove invalid character sequences.

I saw no clear solution that allowed sequences to be corrected.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Bruce Momjian
Bruce Momjian wrote:
 Tom Lane wrote:
  Bruce Momjian pgman@candle.pha.pa.us writes:
   I have added your suggestions to the 8.1.X release notes.
  
  Did you read the followup discussion?  Recommending -c without a large
  warning seems a very bad idea.
 
 Well, I said it would remove invalid sequences.  What else should we
 say?
 
   This will remove invalid character sequences.
 
 I saw no clear solution that allowed sequences to be corrected.

The release note text is:

Some users are having problems loading literalUTF8/ data into 8.1.X.
 This is because previous versions allowed invalid literalUTF8/
sequences to be entered into the database, and this release properly
accepts only valid literalUTF8/ sequences.  One way to correct a
dumpfile is to use commandiconv -c -f UTF-8 -t UTF-8/. This will
remove invalid character sequences. commandiconv/ reads the entire
input file into memory so it might be necessary to commandsplit/ the
dump into multiple smaller files for processing.

One nice solution would be if iconv would report the lines with errors
and you could correct them, but I see no way to do that.  The only thing
you could do is to diff the old and new files to see the problems.  Is
that helpful?  Here is new text I have used:

Some users are having problems loading literalUTF8/ data into 8.1.X.
 This is because previous versions allowed invalid literalUTF8/
sequences to be entered into the database, and this release properly
accepts only valid literalUTF8/ sequences.  One way to correct a
dumpfile is to use commandiconv -c -f UTF-8 -t UTF-8 -o cleanfile.sql
dumpfile.sql/.  The literal-c/ option removes invalid character
sequences.  A diff of the two files will show the sequences that are
invalid.  commandiconv/ reads the entire input file into memory so
it might be necessary to commandsplit/ the dump into multiple
smaller files for processing.

It highlights the 'diff' idea.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Peter Eisentraut
Bruce Momjian wrote:
 One nice solution would be if iconv would report the lines with
 errors and you could correct them, but I see no way to do that.  The
 only thing you could do is to diff the old and new files to see the
 problems.  Is that helpful?  Here is new text I have used:

I think this is nice.  It users see a big mess, they will have to clean
it up by hand anyway.

How about this for better wording:

diff -u -3 -p -r1.400.2.4 release.sgml
--- doc/src/sgml/release.sgml   6 Dec 2005 20:26:02 -   1.400.2.4
+++ doc/src/sgml/release.sgml   6 Dec 2005 20:44:26 -
@@ -528,15 +528,16 @@ psql -t -f fixseq.sql db1 | psql -e db1

  listitem
   para
-   Some users are having problems loading literalUTF8/ data into
-   8.1.X.  This is because previous versions allowed invalid 
literalUTF8/
+   Some users are having problems loading UTF-8 data into
+   8.1.X.  This is because previous versions allowed invalid UTF-8 byte
sequences to be entered into the database, and this release
-   properly accepts only valid literalUTF8/ sequences.  One
-   way to correct a dumpfile is to use commandiconv -c -f UTF-8 -t UTF-8
+   properly accepts only valid UTF-8 sequences.  One
+   way to correct a dumpfile is to run the command commandiconv -c -f 
UTF-8 -t UTF-8
-o cleanfile.sql dumpfile.sql/.  The literal-c/ option removes
invalid character sequences.  A diff of the two files will show the
sequences that are invalid.  commandiconv/ reads the entire input
-   file into memory so it might be necessary to commandsplit/ the dump
+   file into memory so it might be necessary to use commandsplit/
+   to break up the dump
into multiple smaller files for processing.
   /para
  /listitem

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Bruce Momjian

Nice, updated.

---

Peter Eisentraut wrote:
 Bruce Momjian wrote:
  One nice solution would be if iconv would report the lines with
  errors and you could correct them, but I see no way to do that.  The
  only thing you could do is to diff the old and new files to see the
  problems.  Is that helpful?  Here is new text I have used:
 
 I think this is nice.  It users see a big mess, they will have to clean
 it up by hand anyway.
 
 How about this for better wording:
 
 diff -u -3 -p -r1.400.2.4 release.sgml
 --- doc/src/sgml/release.sgml   6 Dec 2005 20:26:02 -   1.400.2.4
 +++ doc/src/sgml/release.sgml   6 Dec 2005 20:44:26 -
 @@ -528,15 +528,16 @@ psql -t -f fixseq.sql db1 | psql -e db1
 
   listitem
para
 -   Some users are having problems loading literalUTF8/ data into
 -   8.1.X.  This is because previous versions allowed invalid 
 literalUTF8/
 +   Some users are having problems loading UTF-8 data into
 +   8.1.X.  This is because previous versions allowed invalid UTF-8 byte
 sequences to be entered into the database, and this release
 -   properly accepts only valid literalUTF8/ sequences.  One
 -   way to correct a dumpfile is to use commandiconv -c -f UTF-8 -t 
 UTF-8
 +   properly accepts only valid UTF-8 sequences.  One
 +   way to correct a dumpfile is to run the command commandiconv -c -f 
 UTF-8 -t UTF-8
 -o cleanfile.sql dumpfile.sql/.  The literal-c/ option removes
 invalid character sequences.  A diff of the two files will show the
 sequences that are invalid.  commandiconv/ reads the entire input
 -   file into memory so it might be necessary to commandsplit/ the 
 dump
 +   file into memory so it might be necessary to use commandsplit/
 +   to break up the dump
 into multiple smaller files for processing.
/para
   /listitem
 
 -- 
 Peter Eisentraut
 http://developer.postgresql.org/~petere/
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Gavin Sherry
Hi,

On Tue, 6 Dec 2005, Bruce Momjian wrote:


 Nice, updated.

 ---


I think my suggestion from the other day is useful also.

---

Omar Kilani and I have spent a few hours looking at the problem. For
situations where there is a lot of invalid encoding, manual fixing is just
not viable. The vim project has a kind of fuzzy encoding conversion which
accounts for a lot of the non-UTF8 sequences in UTF8 data. You can use vim
to modify your text dump as follows:

vim -c :wq! ++enc=utf8 fixed.dump original.dump

---

I think this is a viable option for people with a non-trivial amount of
data and don't see manual fixing or potentially losing data as a viable
option.

Thanks,

Gavin

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Upcoming PG re-releases

2005-12-06 Thread Bruce Momjian

Exactly what does vim do that iconv does not?  Fuzzy encoding sounds
scary to me.

---

Gavin Sherry wrote:
 Hi,
 
 On Tue, 6 Dec 2005, Bruce Momjian wrote:
 
 
  Nice, updated.
 
  ---
 
 
 I think my suggestion from the other day is useful also.
 
 ---
 
 Omar Kilani and I have spent a few hours looking at the problem. For
 situations where there is a lot of invalid encoding, manual fixing is just
 not viable. The vim project has a kind of fuzzy encoding conversion which
 accounts for a lot of the non-UTF8 sequences in UTF8 data. You can use vim
 to modify your text dump as follows:
 
 vim -c :wq! ++enc=utf8 fixed.dump original.dump
 
 ---
 
 I think this is a viable option for people with a non-trivial amount of
 data and don't see manual fixing or potentially losing data as a viable
 option.
 
 Thanks,
 
 Gavin
 
 ---(end of broadcast)---
 TIP 5: don't forget to increase your free space map settings
 

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Upcoming PG re-releases

2005-12-04 Thread Paul Lindner
On Sat, Dec 03, 2005 at 10:54:08AM -0500, Bruce Momjian wrote:
 Neil Conway wrote:
  On Wed, 2005-11-30 at 10:56 -0500, Tom Lane wrote:
   It's been about a month since 8.1.0 was released, and we've found about
   the usual number of bugs for a new release, so it seems like it's time
   for 8.1.1.
  
  I think one fix that should be made in time for 8.1.1 is adding a note
  to the version migration section of the 8.1 release notes describing
  the invalid UTF-8 byte sequence problems that some people have run
  into when upgrading from prior versions. I'm not familiar enough with
  the problem or its remedies to add the note myself, though.
 
 Agreed, but I don't understand the problem well enough either.  Does
 anyone?

There was a thread a couple of weeks back about this problem.  Here's
my sample writeup -- I give my permission for anyone to use it as they
see fit:


Upgrading UNICODE databases to 8.1

Postgres 8.1 includes a number of bug-fixes and improvements to
Unicode and UTF-8 character handling.  Unfortunately previous releases
would accept character sequences that were not valid UTF-8.  This
may cause problems when upgrading your database using
pg_dump/pg_restore resulting in an error message like this:

  Invalid UNICODE byte sequence detected near byte ...

To convert your pre-8.1 database to 8.1 you may have to remove and/or
fix the offending characters.  One simple way to fix the problem is to
run your pg_dump output through the iconv command like this:

  iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql

The -c flag tells iconv to omit invalid characters from output.

There is one problem with this.  Most versions of iconv try to read
the entire input file into memory.  If you dump is quite large you
will need to split the dump into multiple files and convert each one
individually.  You must use the -l flag for split to insure that the
unicode byte sequences are not split.

   split -l 1 dump.sql

Another possible solution is to use the --inserts flag to pg_dump.
When you load the resulting data dump in 8.1 this will result in the
problem rows showing up in your error log.

-- 
Paul Lindner| | | | |  |  |  |   |   |
[EMAIL PROTECTED]


pgpmSxHGkaqYp.pgp
Description: PGP signature


Re: [HACKERS] Upcoming PG re-releases

2005-12-04 Thread Tom Lane
Paul Lindner [EMAIL PROTECTED] writes:
 To convert your pre-8.1 database to 8.1 you may have to remove and/or
 fix the offending characters.  One simple way to fix the problem is to
 run your pg_dump output through the iconv command like this:

   iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql

Is that really a one-size-fits-all solution?  Especially with -c?

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Upcoming PG re-releases

2005-12-04 Thread Paul Lindner
On Sun, Dec 04, 2005 at 11:34:16AM -0500, Tom Lane wrote:
 Paul Lindner [EMAIL PROTECTED] writes:
  To convert your pre-8.1 database to 8.1 you may have to remove and/or
  fix the offending characters.  One simple way to fix the problem is to
  run your pg_dump output through the iconv command like this:
 
iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql
 
 Is that really a one-size-fits-all solution?  Especially with -c?
 

I'd say yes, and the -c flag is needed so iconv strips out the
invalid characters.  

This technique worked for some smaller databases I converted and
croaked with out-of-memory on the larger ones.

It certainly doesn't make the problem worse.

If one wanted to fix this in the general case one could duplicate the
iconv behavior in the Postgres code via some kind of special
flag/setting that is only used for imports..

  set strip_bad_utf8 = on


-- 
Paul Lindner| | | | |  |  |  |   |   |
[EMAIL PROTECTED]


pgpuI4FgL4cek.pgp
Description: PGP signature


Re: [HACKERS] Upcoming PG re-releases

2005-12-04 Thread Tom Lane
Paul Lindner [EMAIL PROTECTED] writes:
 On Sun, Dec 04, 2005 at 11:34:16AM -0500, Tom Lane wrote:
 Paul Lindner [EMAIL PROTECTED] writes:
 iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql

 Is that really a one-size-fits-all solution?  Especially with -c?

 I'd say yes, and the -c flag is needed so iconv strips out the
 invalid characters.

That's exactly what's bothering me about it.  If we recommend that
we had better put a large THIS WILL DESTROY YOUR DATA warning first.
The problem is that the data is not invalid from the user's point
of view --- more likely, it's in some non-UTF8 encoding --- and so
just throwing away some of the characters is unlikely to make people
happy.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Upcoming PG re-releases

2005-12-04 Thread Martijn van Oosterhout
On Sun, Dec 04, 2005 at 12:19:32PM -0500, Gregory Maxwell wrote:
  That's exactly what's bothering me about it.  If we recommend that
  we had better put a large THIS WILL DESTROY YOUR DATA warning first.
  The problem is that the data is not invalid from the user's point
  of view --- more likely, it's in some non-UTF8 encoding --- and so
  just throwing away some of the characters is unlikely to make people
  happy.
 
 Nor is it even guarenteed to make the data load: If the column is
 unique constrained and the removal of the non-UTF characters makes two
 rows have the same data where they didn't before...
 
 The way to preserve the data is to switch the column to be a bytea.

Additionally, it's hard to suggest anything better without specific
knowledge of the characters that are incorrect and how they got there.

The ideal solution would be a way for people to identify problem data
*before* they dump so they have an opportunity to fix it. Something
like a module they can load and say:

select val from table where not utf8_validate(val);

This would allow people to examine the data while the system is still
running and fix it. Maybe we can code something up in plpgsql? Slow as
molasses but you'll be able to run it anywhere.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
 tool for doing 5% of the work and then sitting around waiting for someone
 else to do the other 95% so you can sue them.


pgpsYRKpfIZLw.pgp
Description: PGP signature


Re: [HACKERS] Upcoming PG re-releases

2005-12-04 Thread Gavin Sherry
Hi all,

On Sun, 4 Dec 2005, Tom Lane wrote:

 Paul Lindner [EMAIL PROTECTED] writes:
  To convert your pre-8.1 database to 8.1 you may have to remove and/or
  fix the offending characters.  One simple way to fix the problem is to
  run your pg_dump output through the iconv command like this:

iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql

 Is that really a one-size-fits-all solution?  Especially with -c?


It's definately not a one size fits all. The reassuring thing is that
others have tried to deal with this problem before.

Omar Kilani and I have spent a few hours looking at the problem. For
situations where there is a lot of invalid encoding, manual fixing is just
not viable. The vim project has a kind of fuzzy encoding conversion which
accounts for a lot of the non-UTF8 sequences in UTF8 data. You can use vim
to modify your text dump as follows:

vim -c :wq! ++enc=utf8 fixed.dump original.dump

Now, our testing of this is far from exhaustive but it's a lot better than
just cutting the data from the original dump. Those suffering the problem
should definately check this out, particularly if you have a non-trivial
amount of data.

Thanks,

Gavin

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Upcoming PG re-releases

2005-12-03 Thread Neil Conway
On Wed, 2005-11-30 at 10:56 -0500, Tom Lane wrote:
 It's been about a month since 8.1.0 was released, and we've found about
 the usual number of bugs for a new release, so it seems like it's time
 for 8.1.1.

I think one fix that should be made in time for 8.1.1 is adding a note
to the version migration section of the 8.1 release notes describing
the invalid UTF-8 byte sequence problems that some people have run
into when upgrading from prior versions. I'm not familiar enough with
the problem or its remedies to add the note myself, though.

-Neil



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Upcoming PG re-releases

2005-12-03 Thread Bruce Momjian
Neil Conway wrote:
 On Wed, 2005-11-30 at 10:56 -0500, Tom Lane wrote:
  It's been about a month since 8.1.0 was released, and we've found about
  the usual number of bugs for a new release, so it seems like it's time
  for 8.1.1.
 
 I think one fix that should be made in time for 8.1.1 is adding a note
 to the version migration section of the 8.1 release notes describing
 the invalid UTF-8 byte sequence problems that some people have run
 into when upgrading from prior versions. I'm not familiar enough with
 the problem or its remedies to add the note myself, though.

Agreed, but I don't understand the problem well enough either.  Does
anyone?

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  pgman@candle.pha.pa.us   |  (610) 359-1001
  +  If your life is a hard drive, |  13 Roberts Road
  +  Christ can be your backup.|  Newtown Square, Pennsylvania 19073

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Upcoming PG re-releases

2005-11-30 Thread Andrew Dunstan



Tom Lane wrote:


We will
at the same time be making new dot-releases in the 7.3, 7.4, and 8.0
branches, principally to fix the SLRU race condition reported by Jim
Nasby and Robert Creager.


 



Was there a conclusion out of the recent discussion on EOL policy? The 
consensus seemed to be something like: We will maintain releases to the 
best of our ability for at least 2 years plus 1 release cycle. After 
that, support may be dropped at any time when maintenance becomes 
difficult.


Have we actually officially stopped supporting the 7.2 series?

All this needs some announcement from the core trsam, IMNSHO - there has 
been some confusion over it (e.g. I saw someone recently saying we had 
stopped supporting the 7.3 series, which the above would seem to 
indicate is not true).


cheers

andrew

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Upcoming PG re-releases

2005-11-30 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Have we actually officially stopped supporting the 7.2 series?

Yeah, we have.  It reached the too difficult to support point already
(the VACUUM/ctid bug back in August --- the patch used in the later
branches wouldn't apply at all, IIRC).

 All this needs some announcement from the core trsam, IMNSHO - there has 
 been some confusion over it (e.g. I saw someone recently saying we had 
 stopped supporting the 7.3 series, which the above would seem to 
 indicate is not true).

Personally I expect to keep supporting 7.3 for a long while, because Red
Hat pays me to ;-) ... and the EOL date for RHEL3 is a long way away yet.
The PG community may stop bothering with 7.3 releases before that.  But
I think Marc and Bruce figure as long as the patches are in our CVS we
may as well put out a release.

We hashed all this out in the pghackers list back in August, but I agree
there ought to be something about it on the website.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Upcoming PG re-releases

2005-11-30 Thread Robert Treat
On Wednesday 30 November 2005 11:40, Tom Lane wrote:
 Personally I expect to keep supporting 7.3 for a long while, because Red
 Hat pays me to ;-) ... and the EOL date for RHEL3 is a long way away yet.
 The PG community may stop bothering with 7.3 releases before that.  But
 I think Marc and Bruce figure as long as the patches are in our CVS we
 may as well put out a release.


Yeah, thats one of the reasons I am skeptical about having official policies 
on this type of thing.  If Sun decided they wanted to maintain 7.2 and were 
going to dedicate developers and testing for it, would we really turn that 
away?  OK, I don't really want to have this discussion again, but as of now I 
think we are all agreed that 7.2 is unsupported. 

 We hashed all this out in the pghackers list back in August, but I agree
 there ought to be something about it on the website.


We've been kicking it around but haven't moved much on this...

Marc, can you move the 7.2 branches in the FTP under the OLD directory?
http://www.postgresql.org/ftp/source/

We need to do the same with 7.2 documentation, moving them into the Manual 
Archive http://www.postgresql.org/docs/manuals/archive.html.  We can also 
change the caption on the main documentation page to note these are manuals 
for the current supported versions. 

-- 
Robert Treat
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Upcoming PG re-releases

2005-11-30 Thread Andrew Dunstan
Tom Lane said:


 We hashed all this out in the pghackers list back in August, but I
 agree there ought to be something about it on the website.


The reason I asked again is that, notwithstanding the recent discussion, I
have observed confusion about the matter (including Jan telling me he didn't
think there was any agreed policy).

cheers

andrew



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Upcoming PG re-releases

2005-11-30 Thread David Fetter
On Wed, Nov 30, 2005 at 01:23:38PM -0500, Robert Treat wrote:
 On Wednesday 30 November 2005 11:40, Tom Lane wrote:
  Personally I expect to keep supporting 7.3 for a long while,
  because Red Hat pays me to ;-) ... and the EOL date for RHEL3 is a
  long way away yet.  The PG community may stop bothering with 7.3
  releases before that.  But I think Marc and Bruce figure as long
  as the patches are in our CVS we may as well put out a release.
 
 Yeah, thats one of the reasons I am skeptical about having official
 policies on this type of thing.

I see this as an excellent reason to draw a bright, sharp line between
what vendors support and what the community as a whole does,
especially where individual community members wear another hat.

 If Sun decided they wanted to maintain 7.2 and were going to
 dedicate developers and testing for it, would we really turn that
 away?

If any company chooses to support versions that the community is no
longer supporting, that can be part of their value-add or more
properly, their headache.  Making commitments on behalf of the
community--which will be held responsible for them no matter what
happens--based on what some company says it's going to do this week is
*extremely* ill-advised.

 OK, I don't really want to have this discussion again, but as of now
 I think we are all agreed that 7.2 is unsupported. 

And it's good that we're making more definite moves to show that we no
longer support it :)

  We hashed all this out in the pghackers list back in August, but I agree
  there ought to be something about it on the website.
 
 
 We've been kicking it around but haven't moved much on this...
 
 Marc, can you move the 7.2 branches in the FTP under the OLD directory?
 http://www.postgresql.org/ftp/source/
 
 We need to do the same with 7.2 documentation, moving them into the Manual 
 Archive http://www.postgresql.org/docs/manuals/archive.html.  We can also 
 change the caption on the main documentation page to note these are manuals 
 for the current supported versions. 

Excellent :)

Cheers,
D
-- 
David Fetter [EMAIL PROTECTED] http://fetter.org/
phone: +1 415 235 3778

Remember to vote!

---(end of broadcast)---
TIP 6: explain analyze is your friend