Re: [GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

2011-06-17 Thread Albe Laurenz
BRUSSER Michael wrote:
 Is there a way to find the records with the text field containing
Unicode bytes 0xedbebf?
 Unfortunately this is a very old version 7.3.10

 This should work on 7.3 (according to the documentation):
 SELECT id FROM nlsdata WHERE position('\360\235\204\236'::bytea IN
val::bytea) = 1;

 Albe, thanks for pointing this out!

 I made a minor change, added decode since text cannot be cast to bytea
and tried something like this:
  SELECT id FROM myTable WHERE position('\360\235\204\236'::bytea IN
decode(myTextField, 'escape')) !=0
  ERROR:  decode: Bad input string for type bytea

Hrm. I didn't know that there was no cast from text to bytea in 7.3.

 Maybe this explains why?
 testdb=# select decode('\360\235\204\236'::text, 'escape');
 ERROR:  Unicode = 0x1 is not supported

No, that is an error on my side. I gave you the wrong byte sequence.

For 0xedbebf you should actually write '\355\276\277'. But that's no
valid UTF-8 sequence.

 but I'm not ready to give up yet...

If you know the byte sequence that causes trouble, you could also use
something like sed
to search and replace it in the dump file.

Or (if there are not too many) you could search for the pattern and
identify the rows
in the database. Then you know which database rows to update.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

2011-06-16 Thread Albe Laurenz
BRUSSER Michael wrote:
 Is there a way to find the records with the text field containing
Unicode bytes 0xedbebf?

 Unfortunately this is a very old version 7.3.10

This should work on 7.3 (according to the documentation):

SELECT id FROM nlsdata WHERE position('\360\235\204\236'::bytea IN
val::bytea) = 1;

Here val is the column containing the values in question.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

2011-06-16 Thread BRUSSER Michael
-Original Message-
From: Albe Laurenz [mailto:laurenz.a...@wien.gv.at]
Sent: Thursday, June 16, 2011 5:16 AM
To: BRUSSER Michael; pgsql-general@postgresql.org
Subject: RE: [GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

BRUSSER Michael wrote:
 Is there a way to find the records with the text field containing Unicode 
 bytes 0xedbebf?
 Unfortunately this is a very old version 7.3.10

This should work on 7.3 (according to the documentation):
SELECT id FROM nlsdata WHERE position('\360\235\204\236'::bytea IN val::bytea) 
= 1;



Albe, thanks for pointing this out!

I made a minor change, added decode since text cannot be cast to bytea and 
tried something like this:
  SELECT id FROM myTable WHERE position('\360\235\204\236'::bytea IN 
decode(myTextField, 'escape')) != 0
  ERROR:  decode: Bad input string for type bytea

If I limit query to some healthy records -  AND id between 100 and 110 it 
works and returns empty result.
So the problem now is that without decode myTextField cannot be converted to 
bytea, with decode it breaks on the first 'bad' value.
Maybe this explains why?
testdb=# select decode('\360\235\204\236'::text, 'escape');
ERROR:  Unicode = 0x1 is not supported

Another thought is that if I get this to work I may need to search for anything 
outside of the standard utf range,
rather than any specific sequence. I am beginning to understand why many people 
dealt with this in the dump file,
but I'm not ready to give up yet...

As usual, any ideas are appreciated!
Thanks.


This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.

If you are not one of the named recipients or have received this email in error,

(i) you should not read, disclose, or copy it,

(ii) please notify sender of your receipt by reply email and delete this email 
and all attachments,

(iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.

For other languages, go to http://www.3ds.com/terms/email-disclaimer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

2011-06-15 Thread BRUSSER Michael
This is a follow-up on my previous message 
http://archives.postgresql.org/pgsql-general/2011-06/msg00054.php

I think I have now some understanding of what's causing the problem, but I 
don't have a good solution, instead more questions.
The release notes for v8.1 at 
http://www.postgresql.org/docs/current/interactive/release-8-1.html
make a good suggestion on using iconv to convert the plain-text dump file into 
utf8
On Linux this did not work, the input and output file were identical. The iconv 
on Solaris  refused to open the input file
(probably too big),  although it worked with a chunk of it and reported 
conversion error.

Unless there's no other options I don't want to use sed or break file into 
pieces, if possible, I would prefer to identify the bad records on the database.
I tried SELECT with everything  I could think of:  ~*, SIMILAR TO, and the 
likes of them, but I never got it right.

Is there a way to find the records with the text field containing Unicode bytes 
0xedbebf?
Unfortunately this is a very old version 7.3.10

Thank you.
Michael.

This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.

If you are not one of the named recipients or have received this email in error,

(i) you should not read, disclose, or copy it,

(ii) please notify sender of your receipt by reply email and delete this email 
and all attachments,

(iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.

For other languages, go to http://www.3ds.com/terms/email-disclaimer


Re: [GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

2011-06-15 Thread Alan Hodgson
On June 15, 2011 01:18:27 PM BRUSSER Michael wrote:
 Unless there's no other options I don't want to use sed or break file into
 pieces, if possible,

iconv loads everything into RAM. You can use split, convert the pieces, and 
then recombine, I did that when converting a large database to utf-8 and it 
worked.

-- 
Obama has now fired more cruise missiles than all other Nobel Peace prize 
winners combined.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

2011-06-15 Thread BRUSSER Michael
-Original Message-
From: pgsql-general-ow...@postgresql.org 
[mailto:pgsql-general-ow...@postgresql.org] On Behalf Of Alan Hodgson
Sent: Wednesday, June 15, 2011 5:37 PM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Invalid byte sequence for encoding UTF8: 0xedbebf

On June 15, 2011 01:18:27 PM BRUSSER Michael wrote:
 Unless there's no other options I don't want to use sed or break file into
 pieces, if possible,

--

 iconv loads everything into RAM. You can use split, convert the pieces, and
 then recombine, I did that when converting a large database to utf-8 and it
 worked.

-

- Thanks, but this is exactly what I am trying to avoid!
Using split is good if you have one database to upgrade and no external 
customers.
(Not to mention other problems with this approach)

This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.

If you are not one of the named recipients or have received this email in error,

(i) you should not read, disclose, or copy it,

(ii) please notify sender of your receipt by reply email and delete this email 
and all attachments,

(iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.

For other languages, go to http://www.3ds.com/terms/email-disclaimer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general