Re: [GENERAL] finding bogus UTF-8

2011-02-16 Thread Vick Khera
On Tue, Feb 15, 2011 at 5:06 PM, Geoffrey Myers li...@serioustechnology.com wrote: I toyed with tr for a bit, but could not get it to work.  The above did not work for me either.  Not exactly sure what it's doing, but here's a couple of diff lines: check your shell escaping. You may need \\

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Geoffrey Myers
Glenn Maynard wrote: On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe scott_r...@elevated-dev.com mailto:scott_r...@elevated-dev.com wrote: I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Marko Kreen
On Thu, Feb 10, 2011 at 9:02 PM, Scott Ribe scott_r...@elevated-dev.com wrote: I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Vick Khera
On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers li...@serioustechnology.com wrote: comments would be appreciated. If all you're doing is filtering stdin to stdout and deleting a range of characters, it seems that tr would be a faster tool: cat foo.txt | tr -d '\000-\008\013-\037\177-\377'

Re: [GENERAL] finding bogus UTF-8

2011-02-15 Thread Geoffrey Myers
Vick Khera wrote: On Tue, Feb 15, 2011 at 11:09 AM, Geoffrey Myers li...@serioustechnology.com wrote: comments would be appreciated. If all you're doing is filtering stdin to stdout and deleting a range of characters, it seems that tr would be a faster tool: cat foo.txt | tr -d

[GENERAL] finding bogus UTF-8

2011-02-10 Thread Scott Ribe
I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a query to find such values? -- Scott Ribe scott_r...@elevated-dev.com

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread dennis jenkins
I'm working on a project to convert a large database form SQL_ASCII to UTF-8. I am using this procedure: 1) pg_dump the SQL_ASCII database to an SQL text file. 2) Run through a small (efficient) C program that logs each line that contains ANY unclean ASCII text. 3) Parse that log with a small

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread dennis jenkins
On Thu, Feb 10, 2011 at 1:02 PM, Scott Ribe scott_r...@elevated-dev.com wrote: I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread dennis jenkins
If you are interested, I can email to you the C and Perl source. It runs like this: # time pg_restore /db-dumps/some_ascii_pgdump.bin | ./ascii-tester | ./bad-ascii-report.pl unclean-ascii.rpt http://www.ecoligames.com/~djenkins/pgsql/ Disclaimer: I offer NO warranty. Use at your own

Re: [GENERAL] finding bogus UTF-8

2011-02-10 Thread Glenn Maynard
On Thu, Feb 10, 2011 at 2:02 PM, Scott Ribe scott_r...@elevated-dev.comwrote: I know that I have at least one instance of a varchar that is not valid UTF-8, imported from a source with errors (AMA CPT files, actually) before PG's checking was as stringent as it is today. Can anybody suggest a