Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2015-01-05 Thread Jonathan Vanasco
On Dec 29, 2014, at 5:36 PM, Mike Cardwell wrote: So the system I've settled with is storing both the originally supplied representation, *and* the lower cased punycode encoded version in a separate column for indexing/search. This seems really hackish to me though. I actually do the same

[GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell
I'd like to store hostnames in a postgres database and I want to fully support IDNs (Internationalised Domain Names) I want to be able to recover the original representation of the hostname, so I can't just encode it with punycode and then store the ascii result. For example, these two are the

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andy Colson
On 12/29/2014 4:36 PM, Mike Cardwell wrote: I'd like to store hostnames in a postgres database and I want to fully support IDNs (Internationalised Domain Names) I want to be able to recover the original representation of the hostname, so I can't just encode it with punycode and then store the

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread David G Johnston
Andy Colson wrote On 12/29/2014 4:36 PM, Mike Cardwell wrote: I'd like to store hostnames in a postgres database and I want to fully support IDNs (Internationalised Domain Names) I want to be able to recover the original representation of the hostname, so I can't just encode it with

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell
* on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote: WHERE hostname='nißan.com' _IF_ Postgres had a punycode function, then you could use: WHERE punycode(hostname) = punycode('nißan.com') If the OP wraps what he is doing up into a function that is what you end up

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan
On Mon, Dec 29, 2014 at 11:50:54PM +, Mike Cardwell wrote: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); This wouldn't work to get the original back if oyu have any IDNA2003 data, because puncode-encoding the UTF-8 under IDNA2003 and then

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread David G Johnston
On Mon, Dec 29, 2014 at 4:51 PM, Mike Cardwell [via PostgreSQL] ml-node+s1045698n583236...@n5.nabble.com wrote: * on the Mon, Dec 29, 2014 at 04:23:22PM -0700, David G Johnston wrote: WHERE hostname='nißan.com http://nissan.com' _IF_ Postgres had a punycode function, then you could

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Tom Lane
David G Johnston david.g.johns...@gmail.com writes: It was also mentioned that using the Perl encoding function was non-performant; which is why caching the data into a memoization table has value. I find it hard to believe that the standards folk would have chosen a hostname encoding method

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell
* on the Mon, Dec 29, 2014 at 07:00:05PM -0500, Andrew Sullivan wrote: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); This wouldn't work to get the original back if oyu have any IDNA200 data, because puncode-encoding the UTF-8 under IDNA2003 and

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan
[resending, because somehow this got routed through my work address the first time.] Hi, I didn't have time to write a short note, so I wrote a long one instead. Sorry. On Mon, Dec 29, 2014 at 10:36:42PM +, Mike Cardwell wrote: can't just encode it with punycode and then store the ascii

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan
On Tue, Dec 30, 2014 at 12:18:58AM +, Mike Cardwell wrote: This is exactly the same method that we commonly use for performing case insensitive text searches using lower() indexes. Hmm. How did you get the original, then? If you have the original Unicode version, why don't you switch to

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell
* on the Mon, Dec 29, 2014 at 05:01:44PM -0700, David G Johnston wrote: CREATE UNIQUE INDEX hostnames_hostname_key ON hostnames (lower(punycode_encode(hostname))); That would prevent adding more than one representation for the same hostname to the column. Except two different hostname

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell
* on the Mon, Dec 29, 2014 at 07:22:21PM -0500, Andrew Sullivan wrote: can't just encode it with punycode and then store the ascii result. For example, these two are the same hostnames thanks to unicode case folding [1]: tesst.ëxämplé.com teßt.ëxämplé.com Well, in IDNA2003 they're

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Mike Cardwell
* on the Mon, Dec 29, 2014 at 07:25:59PM -0500, Andrew Sullivan wrote: This is exactly the same method that we commonly use for performing case insensitive text searches using lower() indexes. Hmm. How did you get the original, then? The original in my case, is the hostname which the end

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2014-12-29 Thread Andrew Sullivan
On Tue, Dec 30, 2014 at 12:53:42AM +, Mike Cardwell wrote: Hmm. How did you get the original, then? The original in my case, is the hostname which the end user supplied. Essentially, when I display it back to them, I want to make sure it is displayed the same way that it was when they