Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Keith Fiske
On Mon, Apr 16, 2018 at 2:26 PM, Adrian Klaver 
wrote:

> On 04/16/2018 10:18 AM, Keith Fiske wrote:
>
>>
>>
>> On Mon, Apr 16, 2018 at 12:21 PM, Tom Lane  t...@sss.pgh.pa.us>> wrote:
>>
>>
>
>> So playing around with the "client_encoding" configuration option in
>> postgresql.conf. According to the docs, setting this should set the default
>> encoding for any client that connects, right?
>>
>> https://www.postgresql.org/docs/10/static/runtime-config-cli
>> ent.html#GUC-CLIENT-ENCODING > cs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING>
>>
>> That description also seems misleading in saying that the default client
>> encoding is what the database encoding is as well. At least as far as psql
>> is concerned, right? I've tried setting that value in postgresql.conf but
>> psql keeps setting it to UTF8 when I connect to the SQL_ASCII database
>> unless I then specifically run "set client_encoding = 'SQL_ASCII';"
>>
>
> Realized that while my previous suggestion of setting PGCLIENTENCODING to
> 'SQL_ASCII' would work it would also affect libpq programs beside psql.
>
> For a finer grained approach set \encoding 'SQL_ASCII' in a .psqlrc file.:
>
> https://www.postgresql.org/docs/10/static/app-psql.html
>
> "Files
>
> psqlrc and ~/.psqlrc
>
> Unless it is passed an -X option, psql attempts to read and execute
> commands from the system-wide startup file (psqlrc) and then the user's
> personal startup file (~/.psqlrc), after connecting to the database but
> before accepting normal commands. These files can be used to set up the
> client and/or the server to taste, typically with \set and SET commands.
>
> The system-wide startup file is named psqlrc and is sought in the
> installation's “system configuration” directory, which is most reliably
> identified by running pg_config --sysconfdir. By default this directory
> will be ../etc/ relative to the directory containing the PostgreSQL
> executables. The name of this directory can be set explicitly via the
> PGSYSCONFDIR environment variable.
>
> The user's personal startup file is named .psqlrc and is sought in the
> invoking user's home directory. On Windows, which lacks such a concept, the
> personal startup file is named %APPDATA%\postgresql\psqlrc.conf. The
> location of the user's startup file can be set explicitly via the PSQLRC
> environment variable.
>
> Both the system-wide startup file and the user's personal startup file
> can be made psql-version-specific by appending a dash and the PostgreSQL
> major or minor release number to the file name, for example ~/.psqlrc-9.2
> or ~/.psqlrc-9.2.5. The most specific version-matching file will be read in
> preference to a non-version-specific file.
>
> "
>
>
>> --
>> Keith Fiske
>> Senior Database Engineer
>> Crunchy Data - http://crunchydata.com
>>
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Adrian,

Thanks for all the info! This will definitely help a lot getting the new
environment set up until we can start working on migrating off SQL_ASCII.

-- 
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com


Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Adrian Klaver

On 04/16/2018 10:18 AM, Keith Fiske wrote:



On Mon, Apr 16, 2018 at 12:21 PM, Tom Lane > wrote:






So playing around with the "client_encoding" configuration option in 
postgresql.conf. According to the docs, setting this should set the 
default encoding for any client that connects, right?


https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING 



That description also seems misleading in saying that the default client 
encoding is what the database encoding is as well. At least as far as 
psql is concerned, right? I've tried setting that value in 
postgresql.conf but psql keeps setting it to UTF8 when I connect to the 
SQL_ASCII database unless I then specifically run "set client_encoding = 
'SQL_ASCII';"


Realized that while my previous suggestion of setting PGCLIENTENCODING 
to 'SQL_ASCII' would work it would also affect libpq programs beside psql.


For a finer grained approach set \encoding 'SQL_ASCII' in a .psqlrc file.:

https://www.postgresql.org/docs/10/static/app-psql.html

"Files

psqlrc and ~/.psqlrc

Unless it is passed an -X option, psql attempts to read and execute 
commands from the system-wide startup file (psqlrc) and then the user's 
personal startup file (~/.psqlrc), after connecting to the database but 
before accepting normal commands. These files can be used to set up the 
client and/or the server to taste, typically with \set and SET commands.


The system-wide startup file is named psqlrc and is sought in the 
installation's “system configuration” directory, which is most reliably 
identified by running pg_config --sysconfdir. By default this directory 
will be ../etc/ relative to the directory containing the PostgreSQL 
executables. The name of this directory can be set explicitly via the 
PGSYSCONFDIR environment variable.


The user's personal startup file is named .psqlrc and is sought in 
the invoking user's home directory. On Windows, which lacks such a 
concept, the personal startup file is named 
%APPDATA%\postgresql\psqlrc.conf. The location of the user's startup 
file can be set explicitly via the PSQLRC environment variable.


Both the system-wide startup file and the user's personal startup 
file can be made psql-version-specific by appending a dash and the 
PostgreSQL major or minor release number to the file name, for example 
~/.psqlrc-9.2 or ~/.psqlrc-9.2.5. The most specific version-matching 
file will be read in preference to a non-version-specific file.

"



--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com



--
Adrian Klaver
adrian.kla...@aklaver.com



Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Adrian Klaver

On 04/16/2018 10:18 AM, Keith Fiske wrote:







So playing around with the "client_encoding" configuration option in 
postgresql.conf. According to the docs, setting this should set the 
default encoding for any client that connects, right?


https://www.postgresql.org/docs/10/static/runtime-config-client.html#GUC-CLIENT-ENCODING 



That description also seems misleading in saying that the default client 
encoding is what the database encoding is as well. At least as far as 
psql is concerned, right? I've tried setting that value in 
postgresql.conf but psql keeps setting it to UTF8 when I connect to the 
SQL_ASCII database unless I then specifically run "set client_encoding = 
'SQL_ASCII';"


The order that parameters are evaluated for libpq programs is:

https://www.postgresql.org/docs/10/static/libpq-connect.html

"PQconnectdbParams

...


If any parameter is NULL or an empty string, the corresponding 
environment variable (see Section 33.14) is checked. If the environment 
variable is not set either, then the indicated built-in defaults are used.


...

"



--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com



--
Adrian Klaver
adrian.kla...@aklaver.com



Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Keith Fiske
On Mon, Apr 16, 2018 at 12:30 PM, Vick Khera  wrote:

> Hi Keith,
>
> Not sure if this will help but a couple of years ago I migrated from an
> SQL_ASCII encoding to UTF8. The data was primarily ASCII with some windows
> garbage, and a little bit of UTF8 from customers filling out forms that
> were not specifically encoded anything.
>
> I wrote a utility that in-place scans and updates the tables in your
> SQL_ASCII-encoded database and ensures that everything is 100% UTF8 NFC at
> the end. For us, there were some characters in some bizarre local
> encodings, and we had to either toss or make educated guesses for them.
>
> After the cleaning, you dump with client encoding UTF8, then restore into
> the final database with UTF8 encoding.
>
> You can find it on my github along with documentation and tests to verify
> it works: https://github.com/khera/utf8-inline-cleaner
>
> On Mon, Apr 16, 2018 at 11:16 AM, Keith Fiske  > wrote:
>
>> Running into an issue with helping a client upgrade from 8.3 to 10 (yes,
>> I know, please keep the out of support comments to a minimum, thanks :).
>>
>> The old database was in SQL_ASCII and it needs to stay that way for now
>> unfortunately. The dump and restore itself works fine, but we're now
>> running into issues with some data returning encoding errors unless we
>> specifically set the client_encoding value to SQL_ASCII.
>>
>> Looking at the 8.3 database, it has the client_encoding value set to UTF8
>> and queries seem to work fine. Is this just a bug in the old 8.3 not
>> enforcing encoding properly?
>>
>> The other thing I noticed on the 10 instance was that, while the LOCALE
>> was set to SQL_ASCII, the COLLATE and CTYPE values for the restored
>> databases were en_US.UTF-8. Could this be having an affect? Is there any
>> way to see what these values were on the old 8.3 database? The pg_database
>> catalog does not have these values stored back then.
>>
>> --
>> Keith Fiske
>> Senior Database Engineer
>> Crunchy Data - http://crunchydata.com
>>
>
>

Thanks Vick! We will hopefully be helping them to get off SQL_ASCII after
this upgrade. Was challenging enough to get the upgrade itself done, so
doing the encoding migration at the same time unfortunately wasn't
possible. It's more than just the database that needs fixing, it's an
entire data ingestion process as well.

-- 
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com


Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Keith Fiske
On Mon, Apr 16, 2018 at 12:21 PM, Tom Lane  wrote:

> Keith Fiske  writes:
> > On Mon, Apr 16, 2018 at 12:09 PM, Tom Lane  wrote:
> >> This is not a great idea, no.  You could be getting strange misbehaviors
> >> in e.g. string comparison, because strcoll() will expect UTF8 data and
> >> will likely not cope well with data that isn't valid in that encoding.
>
> > And pg_controldata was able to show that the CTYPE and COLLATE were UTF8
> on
> > the old system. If that's the case, do you still think it's a good idea
> to
> > set the COLLATE and CTYPE to "C"?
>
> Well, if the customer's been happy with the behavior of the system so far,
> maybe it's all right.  But this is sure the first thing I'd look at if
> there are any gripes about its behavior with non-UTF8 strings.  I'd be
> especially worried about this if you try to migrate the database to any
> new platform, as it's a bet about the behavior of libc not PG itself.
>
> regards, tom lane
>


This is going from RHEL 6.7 to RHEL 7.4

It is a dump and restore upgrade as well.

-- 
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com


Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Vick Khera
Hi Keith,

Not sure if this will help but a couple of years ago I migrated from an
SQL_ASCII encoding to UTF8. The data was primarily ASCII with some windows
garbage, and a little bit of UTF8 from customers filling out forms that
were not specifically encoded anything.

I wrote a utility that in-place scans and updates the tables in your
SQL_ASCII-encoded database and ensures that everything is 100% UTF8 NFC at
the end. For us, there were some characters in some bizarre local
encodings, and we had to either toss or make educated guesses for them.

After the cleaning, you dump with client encoding UTF8, then restore into
the final database with UTF8 encoding.

You can find it on my github along with documentation and tests to verify
it works: https://github.com/khera/utf8-inline-cleaner

On Mon, Apr 16, 2018 at 11:16 AM, Keith Fiske 
wrote:

> Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I
> know, please keep the out of support comments to a minimum, thanks :).
>
> The old database was in SQL_ASCII and it needs to stay that way for now
> unfortunately. The dump and restore itself works fine, but we're now
> running into issues with some data returning encoding errors unless we
> specifically set the client_encoding value to SQL_ASCII.
>
> Looking at the 8.3 database, it has the client_encoding value set to UTF8
> and queries seem to work fine. Is this just a bug in the old 8.3 not
> enforcing encoding properly?
>
> The other thing I noticed on the 10 instance was that, while the LOCALE
> was set to SQL_ASCII, the COLLATE and CTYPE values for the restored
> databases were en_US.UTF-8. Could this be having an affect? Is there any
> way to see what these values were on the old 8.3 database? The pg_database
> catalog does not have these values stored back then.
>
> --
> Keith Fiske
> Senior Database Engineer
> Crunchy Data - http://crunchydata.com
>


Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Tom Lane
Keith Fiske  writes:
> On Mon, Apr 16, 2018 at 12:09 PM, Tom Lane  wrote:
>> This is not a great idea, no.  You could be getting strange misbehaviors
>> in e.g. string comparison, because strcoll() will expect UTF8 data and
>> will likely not cope well with data that isn't valid in that encoding.

> And pg_controldata was able to show that the CTYPE and COLLATE were UTF8 on
> the old system. If that's the case, do you still think it's a good idea to
> set the COLLATE and CTYPE to "C"?

Well, if the customer's been happy with the behavior of the system so far,
maybe it's all right.  But this is sure the first thing I'd look at if
there are any gripes about its behavior with non-UTF8 strings.  I'd be
especially worried about this if you try to migrate the database to any
new platform, as it's a bet about the behavior of libc not PG itself.

regards, tom lane



Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Keith Fiske
On Mon, Apr 16, 2018 at 12:09 PM, Tom Lane  wrote:

> Keith Fiske  writes:
> > Running into an issue with helping a client upgrade from 8.3 to 10 (yes,
> I
> > know, please keep the out of support comments to a minimum, thanks :).
>
> > The old database was in SQL_ASCII and it needs to stay that way for now
> > unfortunately. The dump and restore itself works fine, but we're now
> > running into issues with some data returning encoding errors unless we
> > specifically set the client_encoding value to SQL_ASCII.
>
> I'm guessing you might be hitting this 9.1 change:
>
> * Have psql set the client encoding from the operating system locale
>   by default (Heikki Linnakangas)
>
>   This only happens if the PGCLIENTENCODING environment variable is
>   not set.
>
> I think the previous default was to set client encoding equal to the
> server encoding.
>
> > Looking at the 8.3 database, it has the client_encoding value set to UTF8
> > and queries seem to work fine. Is this just a bug in the old 8.3 not
> > enforcing encoding properly?
>
> Somewhere along the line we made SQL_ASCII -> something else conversions
> check that the data was valid per the other encoding, even though no
> actual data change happens.
>
> > The other thing I noticed on the 10 instance was that, while the LOCALE
> was
> > set to SQL_ASCII,
>
> You mean encoding, I assume.
>
> > the COLLATE and CTYPE values for the restored databases
> > were en_US.UTF-8. Could this be having an affect?
>
> This is not a great idea, no.  You could be getting strange misbehaviors
> in e.g. string comparison, because strcoll() will expect UTF8 data and
> will likely not cope well with data that isn't valid in that encoding.
>
> If you can't sanitize the encoding of your data, I'd suggest running
> with lc_collate and lc_ctype set to "C".
>
> regards, tom lane
>


Thanks to both of you Adrian & Tom.

It is the 9.1 change to the psql client that seems to be causing this.

And pg_controldata was able to show that the CTYPE and COLLATE were UTF8 on
the old system. If that's the case, do you still think it's a good idea to
set the COLLATE and CTYPE to "C"?


-- 
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com


Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Tom Lane
Keith Fiske  writes:
> Running into an issue with helping a client upgrade from 8.3 to 10 (yes, I
> know, please keep the out of support comments to a minimum, thanks :).

> The old database was in SQL_ASCII and it needs to stay that way for now
> unfortunately. The dump and restore itself works fine, but we're now
> running into issues with some data returning encoding errors unless we
> specifically set the client_encoding value to SQL_ASCII.

I'm guessing you might be hitting this 9.1 change:

* Have psql set the client encoding from the operating system locale
  by default (Heikki Linnakangas)

  This only happens if the PGCLIENTENCODING environment variable is
  not set.

I think the previous default was to set client encoding equal to the
server encoding.

> Looking at the 8.3 database, it has the client_encoding value set to UTF8
> and queries seem to work fine. Is this just a bug in the old 8.3 not
> enforcing encoding properly?

Somewhere along the line we made SQL_ASCII -> something else conversions
check that the data was valid per the other encoding, even though no
actual data change happens.

> The other thing I noticed on the 10 instance was that, while the LOCALE was
> set to SQL_ASCII,

You mean encoding, I assume.

> the COLLATE and CTYPE values for the restored databases
> were en_US.UTF-8. Could this be having an affect?

This is not a great idea, no.  You could be getting strange misbehaviors
in e.g. string comparison, because strcoll() will expect UTF8 data and
will likely not cope well with data that isn't valid in that encoding.

If you can't sanitize the encoding of your data, I'd suggest running
with lc_collate and lc_ctype set to "C".

regards, tom lane



Re: client_encoding issue with SQL_ASCII on 8.3 to 10 upgrade

2018-04-16 Thread Adrian Klaver

On 04/16/2018 08:16 AM, Keith Fiske wrote:
Running into an issue with helping a client upgrade from 8.3 to 10 (yes, 
I know, please keep the out of support comments to a minimum, thanks :).


The old database was in SQL_ASCII and it needs to stay that way for now 
unfortunately. The dump and restore itself works fine, but we're now 
running into issues with some data returning encoding errors unless we 
specifically set the client_encoding value to SQL_ASCII.


Looking at the 8.3 database, it has the client_encoding value set to 
UTF8 and queries seem to work fine. Is this just a bug in the old 8.3 
not enforcing encoding properly?e


AFAIK, SQL_ASCII basically means no encoding:

https://www.postgresql.org/docs/10/static/multibyte.html

"The SQL_ASCII setting behaves considerably differently from the other 
settings. When the server character set is SQL_ASCII, the server 
interprets byte values 0-127 according to the ASCII standard, while byte 
values 128-255 are taken as uninterpreted characters. No encoding 
conversion will be done when the setting is SQL_ASCII. Thus, this 
setting is not so much a declaration that a specific encoding is in use, 
as a declaration of ignorance about the encoding. In most cases, if you 
are working with any non-ASCII data, it is unwise to use the SQL_ASCII 
setting because PostgreSQL will be unable to help you by converting or 
validating non-ASCII characters."



What client are you working with?

If psql then its behavior has changed between 8.3 and 10:

https://www.postgresql.org/docs/10/static/release-9-1.html#id-1.11.6.121.3

"

Have psql set the client encoding from the operating system locale by 
default (Heikki Linnakangas)


This only happens if the PGCLIENTENCODING environment variable is not set.
"

https://www.postgresql.org/docs/10/static/app-psql.html

"If both standard input and standard output are a terminal, then psql 
sets the client encoding to “auto”, which will detect the appropriate 
client encoding from the locale settings (LC_CTYPE environment variable 
on Unix systems). If this doesn't work out as expected, the client 
encoding can be overridden using the environment variable PGCLIENTENCODING."





The other thing I noticed on the 10 instance was that, while the LOCALE 
was set to SQL_ASCII, the COLLATE and CTYPE values for the restored 
databases were en_US.UTF-8. Could this be having an affect? Is there any 
way to see what these values were on the old 8.3 database? The 
pg_database catalog does not have these values stored back then.


--
Keith Fiske
Senior Database Engineer
Crunchy Data - http://crunchydata.com



--
Adrian Klaver
adrian.kla...@aklaver.com