Re: [HACKERS] unaccent module - two params function should be immutable

2013-11-18 Thread Bruce Momjian
On Fri, Nov  8, 2013 at 06:00:53PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > [ mark unaccent functions immutable ]
> 
> > Applied.
> 
> This patch is flat out wrong and needs to be reverted.
> 
> The functions were correctly marked (by you!) in commit
> c0577c92a84cc477a88fe6868c16c4a7e3348b11 on the basis of the discussion of
> bug #5781,
> http://www.postgresql.org/message-id/201012021544.ob2fitn1041...@wwwmaster.postgresql.org
> which was a request exactly like this one and was denied for good and
> sufficient reasons.  There was absolutely no reasoning given in this
> thread that explained why we should ignore the previous objections.
> 
> In particular, marking the single-argument version of unaccent() as
> immutable is the height of folly because its behavior depends on the
> setting of search_path.  Changing the two-argument function is maybe
> a bit more debatable, but that's not what you did.
> 
> If we were going to change the behavior, this patch would still be wrong
> because it fails to provide an upgrade path.  The objections saying you
> needed a 1.1 migration script were completely correct.

Thanks, patch reverted.  If people still want this, it needs to be
resbumitted with this feedback in mind.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-11-08 Thread Tom Lane
Bruce Momjian  writes:
> [ mark unaccent functions immutable ]

> Applied.

This patch is flat out wrong and needs to be reverted.

The functions were correctly marked (by you!) in commit
c0577c92a84cc477a88fe6868c16c4a7e3348b11 on the basis of the discussion of
bug #5781,
http://www.postgresql.org/message-id/201012021544.ob2fitn1041...@wwwmaster.postgresql.org
which was a request exactly like this one and was denied for good and
sufficient reasons.  There was absolutely no reasoning given in this
thread that explained why we should ignore the previous objections.

In particular, marking the single-argument version of unaccent() as
immutable is the height of folly because its behavior depends on the
setting of search_path.  Changing the two-argument function is maybe
a bit more debatable, but that's not what you did.

If we were going to change the behavior, this patch would still be wrong
because it fails to provide an upgrade path.  The objections saying you
needed a 1.1 migration script were completely correct.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Alvaro Herrera
Bruce Momjian escribió:
> On Tue, Oct  8, 2013 at 02:25:25PM -0300, Alvaro Herrera wrote:
> > Bruce Momjian escribió:
> > 
> > > Do we need to update any version or anything?  I didn't think so.
> > 
> > I think there should be an 1.1 version here.  That way, if somebody is
> > using the existing definition from the 1.0 module, they can get the new
> > definition by doing an extension upgrade.
> 
> Uh, how would they get this new version?  By compiling 9.4 and
> installing it in 9.3?

Oh, is this only in 9.4?  Then there's no point.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Bruce Momjian
On Tue, Oct  8, 2013 at 02:25:25PM -0300, Alvaro Herrera wrote:
> Bruce Momjian escribió:
> 
> > Do we need to update any version or anything?  I didn't think so.
> 
> I think there should be an 1.1 version here.  That way, if somebody is
> using the existing definition from the 1.0 module, they can get the new
> definition by doing an extension upgrade.

Uh, how would they get this new version?  By compiling 9.4 and
installing it in 9.3?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Alvaro Herrera
Bruce Momjian escribió:

> Do we need to update any version or anything?  I didn't think so.

I think there should be an 1.1 version here.  That way, if somebody is
using the existing definition from the 1.0 module, they can get the new
definition by doing an extension upgrade.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Pavel Stehule
2013/10/8 Bruce Momjian 

> On Tue, Oct  8, 2013 at 06:38:30PM +0200, Pavel Stehule wrote:
> > I am not sure - does pg_upgrade change of flag after upgrade without
> increasing
> > version number?
>
> What happens in pg_upgrade is that the CREATE EXTENSION command is
> pg_dump'ed, and run by pg_uprade, and it then pulls from the SQL file to
> create the new function signature.
>

ok, then it is ok


>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
>   + Everyone has their own god. +
>


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Bruce Momjian
On Tue, Oct  8, 2013 at 06:38:30PM +0200, Pavel Stehule wrote:
> I am not sure - does pg_upgrade change of flag after upgrade without 
> increasing
> version number?

What happens in pg_upgrade is that the CREATE EXTENSION command is
pg_dump'ed, and run by pg_uprade, and it then pulls from the SQL file to
create the new function signature.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Pavel Stehule
2013/10/8 Bruce Momjian 

> On Tue, Oct  8, 2013 at 06:31:03PM +0200, Pavel Stehule wrote:
> >
> >
> >
> > 2013/10/8 Bruce Momjian 
> >
> > On Tue, Sep 24, 2013 at 05:36:58PM -0400, Bruce Momjian wrote:
> > > On Tue, Sep 17, 2013 at 10:15:47AM -0400, Robert Haas wrote:
> > > > On Sat, Sep 14, 2013 at 9:42 AM, Pavel Stehule <
> pavel.steh...@gmail.com
> > > wrote:
> > > > >> I have developed the attached patch based on your suggestion.
>  I did
> > not
> > > > >> see anything in the code that would make it STABLE, except a
> lookup
> > of a
> > > > >> dictionary library:
> > > > >>
> > > > >> dictOid = get_ts_dict_oid(stringToQualifiedNameList
> > ("unaccent"),
> > > > >> false);
> > > > >
> > > > > yes, it risk, but similar is with timezones, and other
> external data.
> > > >
> > > > That's a catalog lookup - not a reliance on external data.
> > >
> > > Sorry, I was wrong.  Only unaccent_dict() calls get_ts_dict_oid(),
> and
> > > we aren't changing the signature of that function.
> >
> > Applied.
> >
> >
> > nice
> >
> > thank you
>
> Do we need to update any version or anything?  I didn't think so.
>

I am not sure - does pg_upgrade change of flag after upgrade without
increasing version number?

Regards

Pavel


>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
>   + Everyone has their own god. +
>


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Bruce Momjian
On Tue, Oct  8, 2013 at 06:31:03PM +0200, Pavel Stehule wrote:
> 
> 
> 
> 2013/10/8 Bruce Momjian 
> 
> On Tue, Sep 24, 2013 at 05:36:58PM -0400, Bruce Momjian wrote:
> > On Tue, Sep 17, 2013 at 10:15:47AM -0400, Robert Haas wrote:
> > > On Sat, Sep 14, 2013 at 9:42 AM, Pavel Stehule 
>  > wrote:
> > > >> I have developed the attached patch based on your suggestion.  I 
> did
> not
> > > >> see anything in the code that would make it STABLE, except a lookup
> of a
> > > >> dictionary library:
> > > >>
> > > >>         dictOid = get_ts_dict_oid(stringToQualifiedNameList
> ("unaccent"),
> > > >> false);
> > > >
> > > > yes, it risk, but similar is with timezones, and other external 
> data.
> > >
> > > That's a catalog lookup - not a reliance on external data.
> >
> > Sorry, I was wrong.  Only unaccent_dict() calls get_ts_dict_oid(), and
> > we aren't changing the signature of that function.
> 
> Applied.
> 
> 
> nice
> 
> thank you

Do we need to update any version or anything?  I didn't think so.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Pavel Stehule
2013/10/8 Bruce Momjian 

> On Tue, Sep 24, 2013 at 05:36:58PM -0400, Bruce Momjian wrote:
> > On Tue, Sep 17, 2013 at 10:15:47AM -0400, Robert Haas wrote:
> > > On Sat, Sep 14, 2013 at 9:42 AM, Pavel Stehule <
> pavel.steh...@gmail.com> wrote:
> > > >> I have developed the attached patch based on your suggestion.  I
> did not
> > > >> see anything in the code that would make it STABLE, except a lookup
> of a
> > > >> dictionary library:
> > > >>
> > > >> dictOid =
> get_ts_dict_oid(stringToQualifiedNameList("unaccent"),
> > > >> false);
> > > >
> > > > yes, it risk, but similar is with timezones, and other external data.
> > >
> > > That's a catalog lookup - not a reliance on external data.
> >
> > Sorry, I was wrong.  Only unaccent_dict() calls get_ts_dict_oid(), and
> > we aren't changing the signature of that function.
>
> Applied.
>

nice

thank you

Regards

Pavel Stehule


>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
>   + It's impossible for everything to be true. +
>


Re: [HACKERS] unaccent module - two params function should be immutable

2013-10-08 Thread Bruce Momjian
On Tue, Sep 24, 2013 at 05:36:58PM -0400, Bruce Momjian wrote:
> On Tue, Sep 17, 2013 at 10:15:47AM -0400, Robert Haas wrote:
> > On Sat, Sep 14, 2013 at 9:42 AM, Pavel Stehule  
> > wrote:
> > >> I have developed the attached patch based on your suggestion.  I did not
> > >> see anything in the code that would make it STABLE, except a lookup of a
> > >> dictionary library:
> > >>
> > >> dictOid = get_ts_dict_oid(stringToQualifiedNameList("unaccent"),
> > >> false);
> > >
> > > yes, it risk, but similar is with timezones, and other external data.
> > 
> > That's a catalog lookup - not a reliance on external data.
> 
> Sorry, I was wrong.  Only unaccent_dict() calls get_ts_dict_oid(), and
> we aren't changing the signature of that function.

Applied.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-09-24 Thread Bruce Momjian
On Tue, Sep 17, 2013 at 10:15:47AM -0400, Robert Haas wrote:
> On Sat, Sep 14, 2013 at 9:42 AM, Pavel Stehule  
> wrote:
> >> I have developed the attached patch based on your suggestion.  I did not
> >> see anything in the code that would make it STABLE, except a lookup of a
> >> dictionary library:
> >>
> >> dictOid = get_ts_dict_oid(stringToQualifiedNameList("unaccent"),
> >> false);
> >
> > yes, it risk, but similar is with timezones, and other external data.
> 
> That's a catalog lookup - not a reliance on external data.

Sorry, I was wrong.  Only unaccent_dict() calls get_ts_dict_oid(), and
we aren't changing the signature of that function.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-09-17 Thread Robert Haas
On Sat, Sep 14, 2013 at 9:42 AM, Pavel Stehule  wrote:
>> I have developed the attached patch based on your suggestion.  I did not
>> see anything in the code that would make it STABLE, except a lookup of a
>> dictionary library:
>>
>> dictOid = get_ts_dict_oid(stringToQualifiedNameList("unaccent"),
>> false);
>
> yes, it risk, but similar is with timezones, and other external data.

That's a catalog lookup - not a reliance on external data.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent module - two params function should be immutable

2013-09-14 Thread Pavel Stehule
2013/9/11 Bruce Momjian 

> On Tue, Feb 19, 2013 at 08:30:29AM +0100, Pavel Stehule wrote:
> > Hello
> >
> > There was a proposal to change flag of function to immutable - should
> > be used in indexes
> >
> > CREATE FUNCTION unaccent(regdictionary, text)
> > RETURNS text
> > AS 'MODULE_PATHNAME', 'unaccent_dict'
> > LANGUAGE C STABLE STRICT;
> >
> >
> > is there any progress?
>
> I have developed the attached patch based on your suggestion.  I did not
> see anything in the code that would make it STABLE, except a lookup of a
> dictionary library:
>
> dictOid = get_ts_dict_oid(stringToQualifiedNameList("unaccent"),
> false);
>

yes, it risk, but similar is with timezones, and other external data.

Regards

Pavel


>
> --
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
>
>   + It's impossible for everything to be true. +
>


Re: [HACKERS] unaccent module - two params function should be immutable

2013-09-10 Thread Bruce Momjian
On Tue, Feb 19, 2013 at 08:30:29AM +0100, Pavel Stehule wrote:
> Hello
> 
> There was a proposal to change flag of function to immutable - should
> be used in indexes
> 
> CREATE FUNCTION unaccent(regdictionary, text)
> RETURNS text
> AS 'MODULE_PATHNAME', 'unaccent_dict'
> LANGUAGE C STABLE STRICT;
> 
> 
> is there any progress?

I have developed the attached patch based on your suggestion.  I did not
see anything in the code that would make it STABLE, except a lookup of a
dictionary library:

dictOid = get_ts_dict_oid(stringToQualifiedNameList("unaccent"), false);

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +
diff --git a/contrib/unaccent/unaccent--1.0.sql b/contrib/unaccent/unaccent--1.0.sql
new file mode 100644
index 9085ca4..072b749
*** a/contrib/unaccent/unaccent--1.0.sql
--- b/contrib/unaccent/unaccent--1.0.sql
***
*** 6,17 
  CREATE FUNCTION unaccent(regdictionary, text)
  	RETURNS text
  	AS 'MODULE_PATHNAME', 'unaccent_dict'
! 	LANGUAGE C STABLE STRICT;
  
  CREATE FUNCTION unaccent(text)
  	RETURNS text
  	AS 'MODULE_PATHNAME', 'unaccent_dict'
! 	LANGUAGE C STABLE STRICT;
  
  CREATE FUNCTION unaccent_init(internal)
  	RETURNS internal
--- 6,17 
  CREATE FUNCTION unaccent(regdictionary, text)
  	RETURNS text
  	AS 'MODULE_PATHNAME', 'unaccent_dict'
! 	LANGUAGE C IMMUTABLE STRICT;
  
  CREATE FUNCTION unaccent(text)
  	RETURNS text
  	AS 'MODULE_PATHNAME', 'unaccent_dict'
! 	LANGUAGE C IMMUTABLE STRICT;
  
  CREATE FUNCTION unaccent_init(internal)
  	RETURNS internal
diff --git a/contrib/unaccent/unaccent--unpackaged--1.0.sql b/contrib/unaccent/unaccent--unpackaged--1.0.sql
new file mode 100644
index abd0698..5bff74e
*** a/contrib/unaccent/unaccent--unpackaged--1.0.sql
--- b/contrib/unaccent/unaccent--unpackaged--1.0.sql
*** ALTER EXTENSION unaccent ADD function un
*** 10,16 
  ALTER EXTENSION unaccent ADD text search template unaccent;
  ALTER EXTENSION unaccent ADD text search dictionary unaccent;
  
! -- These functions are marked as stable in 9.1, were not before:
  
! ALTER FUNCTION unaccent(regdictionary, text) STABLE;
! ALTER FUNCTION unaccent(text) STABLE;
--- 10,16 
  ALTER EXTENSION unaccent ADD text search template unaccent;
  ALTER EXTENSION unaccent ADD text search dictionary unaccent;
  
! -- These functions were marked as stable in 9.1; they were now marked as immutable
  
! ALTER FUNCTION unaccent(regdictionary, text) IMMUTABLE;
! ALTER FUNCTION unaccent(text) IMMUTABLE;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unaccent performance

2013-06-22 Thread Andres Freund
On 2013-06-21 22:52:04 +0100, Thom Brown wrote:
> > CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
> >  RETURNS text
> >  LANGUAGE sql
> >  IMMUTABLE
> > AS $function$
> > SELECT
> > replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
> > ;
> > $function$
> >

> Another test passing in a string of 10 characters gives the following
> timings:
> 
> unaccent: 240619.395 ms
> myunaccent: 785.505 ms

The reason for that is that unaccent is 'stable' while your function is
'immutable', so the planner recognizes that and computes it only once
since you're always passing the same text string to it.

> Another test inserting long text strings into a text column of a table
> 100,000 times, then updating another column to have that unaccented value
> using both methods:
> 
> unaccent: 3867.306 ms
> myunaccent: 43611.732 ms

Whereas it cannot recognize that in this case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unaccent performance

2013-06-21 Thread Thom Brown
On 21 June 2013 19:04, Thom Brown  wrote:

> Hi,
>
> The unaccent extension is great, especially with its customisability, but
> it's not always easy to recommend.  I witnessed a customer using no less
> than 56 nested replace functions in an SQL function.  I looked to see how
> much this can be mitigated by unaccent.  It turns out that not all the
> characters they were replacing can be replaced by unaccent, either because
> they replace more than 1 character at a time, or the character they're
> replacing, for some reason, isn't processed by unaccent, even with a custom
> rules file.
>
> So there were 20 characters I could identify that they were replacing.  I
> made a custom rules file and compared its performance to the
> difficult-to-manage set of nested replace calls.
>
> CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
>  RETURNS text
>  LANGUAGE sql
>  IMMUTABLE
> AS $function$
> SELECT
> replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
> ;
> $function$
>
> postgres=# SELECT myunaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,100)) x OFFSET
> 99 LIMIT 1;
>   myunaccent
> --
>  AAAaaaAaAaAa
> (1 row)
>
> Time: 726.282 ms
> postgres=# SELECT unaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,100)) x OFFSET
> 99 LIMIT 1;
>unaccent
> --
>  AAAaaaAaAaAa
> (1 row)
>
> Time: 3305.252 ms
>
> The timings are actually pretty much the same even if I introduce 187
> nested replace calls for every line in the unaccent.rules file for 187
> characters.  But the same character set with unaccent increases to 7418.526
> ms with the same type of query as above.  That's 10 times more expensive.
>
> Is there a way to boost the performance to make its adoption more
> palatable?
>

Another test passing in a string of 10 characters gives the following
timings:

unaccent: 240619.395 ms
myunaccent: 785.505 ms

I guess this must indicate that unaccent is processing all rows, and
myunaccent is only being run on the 1 select row?  I can't account for
myunaccent always being almost the same duration regardless of string
length otherwise.  This is probably an incorrect assessment of performance.

Another test inserting long text strings into a text column of a table
100,000 times, then updating another column to have that unaccented value
using both methods:

unaccent: 3867.306 ms
myunaccent: 43611.732 ms

So I guess this complaint about performance is all just noise.

However, pushing that pointless complaint to one side, I would like to have
the ability to have unaccent support more characters that it doesn't
currently seem to support, such as bullet points, ellipses etc., and also
more than 1 character being replaced.  Naturally these aren't appropriate
to fall under the unaccent function itself, but the rules file is good
starting point.  It would be a bit like translate, except it would use a
rules file instead of providing strings of single characters to convert.

So say we wanted "(trademark)" to be converted into "™" just as an example,
or ";" to ".".  We can't do that with unaccent, but in order to avoid a
huge list of replace functions, a function like unaccent, with a few
adaptations, would solve the problem.

e.g.:

SELECT transform(my_custom_dictionary, 'Commodore Amiga(trademark);')

would return

Commodore Amiga™.

This would ideally somehow cater for replacing tabs and spaces too.

-- 
Thom


Re: [HACKERS] unaccent extension missing some accents

2011-11-10 Thread Tom Lane
Bruce Momjian  writes:
> Tom Lane wrote:
>> However, the bigger picture is that OS X's UTF8 locales are broken
>> through-and-through, and most of their other problems are not feasible
>> to work around.

> If Apple's low-level code came from FreeBSD and NetBSD, how did they get
> so broken?

AFAIK, they're broken in the BSDen too, or at least were when Apple
branched off from whichever BSD they started from (which was years ago).
There may be a better solution available upstream by now, but it doesn't
appear to me that Apple has any interest in fixing this area.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-10 Thread Bruce Momjian
Tom Lane wrote:
> J Smith  writes:
> > I've attached a patch against master for unaccent.c that uses swscanf
> > along with char2wchar and wchar2char instead of sscanf directly to
> > initialize the unaccent extension and it appears to fix the problem in
> > both the master and 9.1 branches.
> 
> swscanf doesn't seem like an acceptable approach: it's a function that
> is relied on nowhere else in PG, so it adds new portability risks of its
> own.  It doesn't exist on some platforms that we support (like the one
> I'm typing this message on) and there's no real good reason to assume
> that it's not broken in its own ways on others.
> 
> If you really want to pursue this, I'd suggest parsing the line
> manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
> be very many more lines than what you've got here.
> 
> However, the bigger picture is that OS X's UTF8 locales are broken
> through-and-through, and most of their other problems are not feasible
> to work around.  So basically you can't use them for anything
> interesting, and it's not clear that it's worth putting any time into
> solving individual problems.  In the particular case here, the issue
> presumably is that sscanf is relying on isspace() ... but we rely on
> isspace() directly, in quite a lot of places, so how much is it going
> to fix to dodge it right here?

If Apple's low-level code came from FreeBSD and NetBSD, how did they get
so broken?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread J Smith
On Mon, Nov 7, 2011 at 11:53 AM, Florian Pflug  wrote:
>
> Various issues with OSX and UTF-8 locales seems to come up quite often, yet
> we're not really in a position to do anything about them.
>
> Thus, I think we should warn about these issues and save people the trouble
> of finding out about this the hard way. The only question is, what would be
> a good place for such a warning?
>

Hmm, I suppose one place could be on initdb if initializing with a
UTF-8 locale, although that only really helps with fixed settings like
LC_COLLATE and LC_CTYPE and the defaults for the user-configurable
settings, right? For the configurable settings I guess logging as
warnings on server start up or when they're changed via SET. But then
there's all of the other places, like when using COLLATE and using
locale-aware functions and so on and so forth. It would be a lot to
cover, I'll bet.

I guess maybe initdb, start up and SET warnings and a note in the docs
would hopefully suffice?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Tom Lane
J Smith  writes:
> Would it even really be worth it to look into any of the other locale
> issues on OSX, given that PostgreSQL is now included in their default
> installs starting with 10.7, or would this really be more of a case of
> hoping Apple some day fixes the issue upstream?

To my mind, the killer issue is the incorrect sorting --- there's
basically no way we can work around that.  If Apple were to fix that,
we might be able to nibble at the margins of the other stuff.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread J Smith
On Mon, Nov 7, 2011 at 11:59 AM, Tom Lane  wrote:
>
> If you have time to check that the patch I just committed fixes your
> problem, it'd be worth doing.  I did not test it on OS X ...

Looks good to me, thanks.

Would it even really be worth it to look into any of the other locale
issues on OSX, given that PostgreSQL is now included in their default
installs starting with 10.7, or would this really be more of a case of
hoping Apple some day fixes the issue upstream? It doesn't seem like
that's going to happen any time soon, mind you, as apparently this is
a rather long-standing issue in their libc, but who knows. I know OSX
isn't exactly the most popular database server OS out there, but it
seems to be becoming more popular for developers these days at the
very least.

Anyways, cheers, and thanks again.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Tom Lane
J Smith  writes:
> Anyways, lemme know if there's anything else I could help with or
> could test and whatnot. Cheers.

If you have time to check that the patch I just committed fixes your
problem, it'd be worth doing.  I did not test it on OS X ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread J Smith
On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane  wrote:
> I looked at this a bit and realized that sscanf is actually doing a
> couple of critical things for us, which are lost in translation when
> doing it like this:
>
> 1. It ignores whitespace other than the dividing tab.  If we don't
> continue to do that, we'll likely break existing config files.
>
> 2. It ensures that src and trg each consist of at least one (nonblank)
> character.  placeChar() is critically dependent on the assumption that
> src is not empty.
>
> However, after looking around a bit at the other tsearch config-file-
> reading functions, I noted that they all use t_isspace() to identify
> whitespace ... and that function in fact should be okay on OS X,
> because it uses iswspace in multibyte encodings.
>
> So it's fairly simple to improve this code to reject whitespace that
> way.  I don't like the existing code anyway because of its potential
> vulnerability to buffer overrun.  I'll fix it up and commit.
>
>> As for the other problems with isspace and such on OSX, it might be
>> worth looking at the python portability fixes.
>
> If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
> work), I might be tempted to try to do it that way, but I still fail
> to see the point.  After reviewing the code I feel that unaccent needs
> to be fixed because it's not consistent with the other tsearch config
> file parsers, and not so much because it works or doesn't work on any
> specific platform.
>

Yeah, I never knew there was such a problem with OSX and UTF8 before
running into it here but it's good to know. When I noticed the
unnaccent extension in more recent PostgreSQL versions, I figured it
would perform better than our current plperl-based accent stripping
function (which it surely does) and just noticed the results on my
machine were a little off, but our linux-based servers were fine and
dandy and yadda yadda yadda.

Anyways, lemme know if there's anything else I could help with or
could test and whatnot. Cheers.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Florian Pflug
On Nov7, 2011, at 17:46 , J Smith wrote:
> On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane  wrote:
>> If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
>> work), I might be tempted to try to do it that way, but I still fail
>> to see the point.  After reviewing the code I feel that unaccent needs
>> to be fixed because it's not consistent with the other tsearch config
>> file parsers, and not so much because it works or doesn't work on any
>> specific platform.
> 
> Yeah, I never knew there was such a problem with OSX and UTF8 before
> running into it here but it's good to know.

Various issues with OSX and UTF-8 locales seems to come up quite often, yet
we're not really in a position to do anything about them.

Thus, I think we should warn about these issues and save people the trouble
of finding out about this the hard way. The only question is, what would be
a good place for such a warning?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-07 Thread Tom Lane
J Smith  writes:
> Alright, I wrote up another patch that uses strchr to parse out the
> lines of the unaccent.rules file, foregoing sscanf completely.
> Hopefully this looks a bit better than using swscanf.

I looked at this a bit and realized that sscanf is actually doing a
couple of critical things for us, which are lost in translation when
doing it like this:

1. It ignores whitespace other than the dividing tab.  If we don't
continue to do that, we'll likely break existing config files.

2. It ensures that src and trg each consist of at least one (nonblank)
character.  placeChar() is critically dependent on the assumption that
src is not empty.

However, after looking around a bit at the other tsearch config-file-
reading functions, I noted that they all use t_isspace() to identify
whitespace ... and that function in fact should be okay on OS X,
because it uses iswspace in multibyte encodings.

So it's fairly simple to improve this code to reject whitespace that
way.  I don't like the existing code anyway because of its potential
vulnerability to buffer overrun.  I'll fix it up and commit.

> As for the other problems with isspace and such on OSX, it might be
> worth looking at the python portability fixes.

If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
work), I might be tempted to try to do it that way, but I still fail
to see the point.  After reviewing the code I feel that unaccent needs
to be fixed because it's not consistent with the other tsearch config
file parsers, and not so much because it works or doesn't work on any
specific platform.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
Alright, I wrote up another patch that uses strchr to parse out the
lines of the unaccent.rules file, foregoing sscanf completely.
Hopefully this looks a bit better than using swscanf.

As for the other problems with isspace and such on OSX, it might be
worth looking at the python portability fixes. I played briefly with
the isspace and friends macros they have and they looked okay, but I
certainly can't speak for how well they'd work for the rest of the
PostgreSQL code base.

Cheers.


0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
On 2011-11-06, at 7:15 PM, Tom Lane  wrote:
>
> swscanf doesn't seem like an acceptable approach: it's a function that
> is relied on nowhere else in PG, so it adds new portability risks of its
> own.  It doesn't exist on some platforms that we support (like the one
> I'm typing this message on) and there's no real good reason to assume
> that it's not broken in its own ways on others.
>
> If you really want to pursue this, I'd suggest parsing the line
> manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
> be very many more lines than what you've got here.
>
> However, the bigger picture is that OS X's UTF8 locales are broken
> through-and-through, and most of their other problems are not feasible
> to work around.  So basically you can't use them for anything
> interesting, and it's not clear that it's worth putting any time into
> solving individual problems.  In the particular case here, the issue
> presumably is that sscanf is relying on isspace() ... but we rely on
> isspace() directly, in quite a lot of places, so how much is it going
> to fix to dodge it right here?
>
>regards, tom lane

There are some fixes for isspace and friend that I've seen python
using so perhaps in those cases a similar fix could be applied. For
instance, maybe something like the code around line 674 here:

http://svn.python.org/view/python/trunk/Include/pyport.h?revision=81029&view=markup

Perhaps that would be suitable on OSX at least in the case of isspace
et al. As far as I can tell scanf doesn't seem to use isspace on my
system so that would only be a partial fix for this an whatever other
situations isspace is used in. (on a mobile now so I can't check a the
moment.)

This isn't really a huge deal for me but I'll try to get a chance to
write up a little parser anyways just for kicks.

Cheers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread Tom Lane
J Smith  writes:
> I've attached a patch against master for unaccent.c that uses swscanf
> along with char2wchar and wchar2char instead of sscanf directly to
> initialize the unaccent extension and it appears to fix the problem in
> both the master and 9.1 branches.

swscanf doesn't seem like an acceptable approach: it's a function that
is relied on nowhere else in PG, so it adds new portability risks of its
own.  It doesn't exist on some platforms that we support (like the one
I'm typing this message on) and there's no real good reason to assume
that it's not broken in its own ways on others.

If you really want to pursue this, I'd suggest parsing the line
manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
be very many more lines than what you've got here.

However, the bigger picture is that OS X's UTF8 locales are broken
through-and-through, and most of their other problems are not feasible
to work around.  So basically you can't use them for anything
interesting, and it's not clear that it's worth putting any time into
solving individual problems.  In the particular case here, the issue
presumably is that sscanf is relying on isspace() ... but we rely on
isspace() directly, in quite a lot of places, so how much is it going
to fix to dodge it right here?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
On Sun, Nov 6, 2011 at 1:18 PM, Florian Pflug  wrote:
>
> What's the locale of the database you're seeing this in, and which charset
> does it use?
>
> I think scanf() uses isspace() and friends, and last time I looked the
> locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
> decides that 0xA0 is whitespace.
>

Ah, that does it: the locale I was using in the test code was just
plain ol' C locale, while in the database it was en_CA.UTF-8. Changing
the locale in my test code produced the wonky results. I should have
figured it was a locale problem. Sure enough, in a UTF-8 locale, it
believes that both 0xa0 and 0x85 are spaces. Pretty wonky behaviour
indeed.

Apparently this is a known OSX issue that has its roots in and older
version of FreeBSD's libc I guess, eh? I've found various bug reports
that allude to the problem and they all seem to point that way.

I've attached a patch against master for unaccent.c that uses swscanf
along with char2wchar and wchar2char instead of sscanf directly to
initialize the unaccent extension and it appears to fix the problem in
both the master and 9.1 branches.

I haven't added any tests in the expected output file 'cause I'm not
exactly sure what I should be testing against, but I could take a
crack at that, too, if the patch looks reasonable and is usable.

Cheers.


0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread Florian Pflug
On Nov6, 2011, at 18:43 , J Smith wrote:
> I put some elog debugging lines into unaccent.c and found that sscanf
> sometimes reads the scanned line by finding only one byte for the for
> the source character rather than the two required for the complete
> UTF-8 code point. It appears that the following characters are causing
> the problem, along with the code points and such:
> 
> 'Å' => 'A' | c3,85 => 41
> 'à' => 'a' | c3,a0 => 61
> 'ą' => 'a' | c4,85 => 61
> 'Ġ' => 'G' | c4,a0 => 47
> 'Ņ' => 'N' | c5,85 => 4e
> 'Š' => 'S' | c5,a0 => 53
> 
> In each case, one byte was being read in the source string rather than
> two, leading to the "duplicate TO" warnings above. This later leads to
> the characters that produced the warning being ignored when unaccent
> is called and left in the output.

What's the locale of the database you're seeing this in, and which charset
does it use?

I think scanf() uses isspace() and friends, and last time I looked the
locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
decides that 0xA0 is whitespace.

> I haven't been able to reproduce in a smaller example, and haven't
> been able to reproduce on a CentOS server, so at this point I'm at a
> loss as to the problem.

Have you tried to set the same locale as postgres (using setlocale()) in
your tests?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent extension missing some accents

2011-11-06 Thread J Smith
Gah! Accidentally hit Send. Let me finish that last message before
sending this time!


G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"à      a
       "
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Ġ      G
       "
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Š      S
       "
===

I've dug around through the unaccent.c code a bit and I've noticed
that the sscanf it does when reading the file is producing some odd
output. I've tried with a minimal example using the same sort of
sscanf code reading from the same unaccent.rules file, but the minimal
example doesn't produce the same output.

I put some elog debugging lines into unaccent.c and found that sscanf
sometimes reads the scanned line by finding only one byte for the for
the source character rather than the two required for the complete
UTF-8 code point. It appears that the following characters are causing
the problem, along with the code points and such:

'Å' => 'A' | c3,85 => 41
'à' => 'a' | c3,a0 => 61
'ą' => 'a' | c4,85 => 61
'Ġ' => 'G' | c4,a0 => 47
'Ņ' => 'N' | c5,85 => 4e
'Š' => 'S' | c5,a0 => 53

In each case, one byte was being read in the source string rather than
two, leading to the "duplicate TO" warnings above. This later leads to
the characters that produced the warning being ignored when unaccent
is called and left in the output.

I haven't been able to reproduce in a smaller example, and haven't
been able to reproduce on a CentOS server, so at this point I'm at a
loss as to the problem.

Anybody got any ideas?

Cheers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent contrib

2011-09-26 Thread Oleg Bartunov

On Wed, 21 Sep 2011, Tom Lane wrote:


Euler Taveira de Oliveira  writes:

On 21-09-2011 13:28, Daniel VАzquez wrote:

"unaccent" is compatible with postgresql 8.4 (but not is in their contrib
version distribution)



No, it is not. AFAICS it is necessary to add some backend code that is not in 
8.4.


[ pokes at it ]  Yeah, you are right.  The version of unaccent that is
in our source tree is a filtering dictionary, and therefore cannot
possibly work with backends older than 9.0 (when the filtering
dictionary feature was added).

So I'm wondering where the OP read that it was compatible with 8.4.
Our own documentation about it certainly does not say that.  It's
possible that Oleg and Teodor had some prototype version, different
from what got committed to our tree, that would work in 8.4.


AFAIR, the last version without filtering feature is 
http://www.sigaev.ru/misc/unaccent-0.2.tar.gz


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent contrib

2011-09-22 Thread Robert Haas
2011/9/22 Daniel Vázquez :
> Before 9.x, how do unaccent full text searches ?

It seems that Oleg has published something on his web site that
supposedly works with 8.4:

http://www.sai.msu.su/~megera/wiki/unaccent

But I'm not really sure how it works, or even where the source code
is.  I would suggest that you Google "postgresql 8.4 unaccent" or
something like that and click through the results.

Also, this mailing list is for discussions of PostgreSQL development,
so I think this discussion is quite a bit off-topic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent contrib

2011-09-22 Thread Euler Taveira de Oliveira

On 22-09-2011 12:39, Daniel Vázquez wrote:

Before 9.x, how do unaccent full text searches ?

Perform pre-processing (normalization) of the string *before* inserting and 
*before* searching.



--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent contrib

2011-09-22 Thread Daniel Vázquez
Before 9.x, how do unaccent full text searches ?
i

2011/9/21 Tom Lane 

> Euler Taveira de Oliveira  writes:
> > On 21-09-2011 13:28, Daniel Vázquez wrote:
> >> "unaccent" is compatible with postgresql 8.4 (but not is in their
> contrib
> >> version distribution)
>
> > No, it is not. AFAICS it is necessary to add some backend code that is
> not in 8.4.
>
> [ pokes at it ]  Yeah, you are right.  The version of unaccent that is
> in our source tree is a filtering dictionary, and therefore cannot
> possibly work with backends older than 9.0 (when the filtering
> dictionary feature was added).
>
> So I'm wondering where the OP read that it was compatible with 8.4.
> Our own documentation about it certainly does not say that.  It's
> possible that Oleg and Teodor had some prototype version, different
> from what got committed to our tree, that would work in 8.4.
>
>regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Daniel Vázquez
SICONET (A Bull Group Company)
Torre Agbar. Avda. Diagonal, 211 - planta 23
08018 - Barcelona
telf: + 34 93 2272727 (Ext. 2952)
fax: + 34 93 2272728
www.bull.es - www.siconet.es
daniel.vazq...@bull.es


Re: [HACKERS] unaccent contrib

2011-09-21 Thread Tom Lane
Euler Taveira de Oliveira  writes:
> On 21-09-2011 13:28, Daniel Vázquez wrote:
>> "unaccent" is compatible with postgresql 8.4 (but not is in their contrib
>> version distribution)

> No, it is not. AFAICS it is necessary to add some backend code that is not in 
> 8.4.

[ pokes at it ]  Yeah, you are right.  The version of unaccent that is
in our source tree is a filtering dictionary, and therefore cannot
possibly work with backends older than 9.0 (when the filtering
dictionary feature was added).

So I'm wondering where the OP read that it was compatible with 8.4.
Our own documentation about it certainly does not say that.  It's
possible that Oleg and Teodor had some prototype version, different
from what got committed to our tree, that would work in 8.4.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent contrib

2011-09-21 Thread Devrim GÜNDÜZ
On Wed, 2011-09-21 at 18:28 +0200, Daniel Vázquez wrote:

> Can Global Development Group, make some acumulative rpm  for contrib
> modules that are backward compatible???

No (as the RPM maintainer).
-- 
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] unaccent contrib

2011-09-21 Thread Euler Taveira de Oliveira

On 21-09-2011 15:23, Daniel Vázquez wrote:

No alternatives for unaccent on 8.4?


Not that I know of.


--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent contrib

2011-09-21 Thread Daniel Vázquez
... ok
No alternatives for unaccent on 8.4?

2011/9/21 Euler Taveira de Oliveira 

> On 21-09-2011 13:28, Daniel Vázquez wrote:
>
>> "unaccent" is compatible with postgresql 8.4 (but not is in their contrib
>> version distribution)
>>
>>  No, it is not. AFAICS it is necessary to add some backend code that is
> not in 8.4.
>
>
> --
>   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
>   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/**mailpref/pgsql-hackers
>



-- 
Daniel Vázquez
SICONET (A Bull Group Company)
Torre Agbar. Avda. Diagonal, 211 - planta 23
08018 - Barcelona
telf: + 34 93 2272727 (Ext. 2952)
fax: + 34 93 2272728
www.bull.es - www.siconet.es
daniel.vazq...@bull.es


Re: [HACKERS] unaccent contrib

2011-09-21 Thread Euler Taveira de Oliveira

On 21-09-2011 13:28, Daniel Vázquez wrote:

"unaccent" is compatible with postgresql 8.4 (but not is in their contrib
version distribution)


No, it is not. AFAICS it is necessary to add some backend code that is not in 
8.4.


--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] unaccent

2002-09-19 Thread Peter Eisentraut

[EMAIL PROTECTED] writes:

> Not "to_ascii", since there are so many extended UNICODE characters that
> doesn't have any accent and should not be converted to an ASCII character.

Really, the accent conversion should be part of the character set
conversion routines.  At least my local iconv does that.

In general, the determination of what is an accent and how to convert it
is both dependent on locale and the intended usage.  It's not clear how
that should be handled.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] unaccent

2002-09-18 Thread nngodinh

The txt2txtidx function works fine with unac. The problem is with the trigger:

create trigger txtidxupdate before update or insert on titles for each row
execute procedure tsearch(titleidx, title);

As you know tsearch(titleidx, unac(title)) doesn't work.

>-- Messaggio Originale --
>Date: Wed, 18 Sep 2002 17:04:56 +0300 (GMT)
>From: Oleg Bartunov <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Cc: [EMAIL PROTECTED]
>Subject: Re: [HACKERS] unaccent
>
>
>On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:
>
>> The best way to use it is quite simple. If you want to index the table
>"titles"
>> and "title" is the field containing the text to be indexed, you can create
>> another unaccented field, for instance "utitle".
>>
>> UPDATE titles SET utitle = unac(title);
>>
>> Of course you can set it up as a trigger function. Then you can use utitle
>> with txt2txtidx and tsearch.
>>
>> Another solution is to generate the txtidx field (i.e. titleidx) directly
>> using unac:
>>
>> UPDATE titles SET titleidx = txt2txtidx(unac(title));
>>
>> But the problem is that I've not succeeded using it with tsearch because
>> (of course) it doesn't allow functions as parameters. So my first idea
>was
>> to integrate unac in tsearch.
>
>what's exactly a problem ?
>UPDATE titles SET titleidx = txt2txtidx(unac(title));
>works fine. Perhaps, you have a problem with query ?
>
>>
>> Bye.
>>
>> >-- Messaggio Originale --
>> >Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
>> >From: Oleg Bartunov <[EMAIL PROTECTED]>
>> >To: [EMAIL PROTECTED]
>> >Cc: [EMAIL PROTECTED]
>> >Subject: Re: [HACKERS] unaccent
>> >
>> >
>> >On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:
>> >
>> >> Greetings,
>> >>
>> >> As far as I use the txtidx data structure in conjunction with gist
indexing
>> >> to make a word indexing of a very large UNICODE db, I've implemented
>> a
>> >PostgreSQL
>> >> function that uses libunac to unaccent TEXT fileds.
>> >>
>> >> The resulting text is in UTF-8, but you can modify it in the sources
>> with
>> >> an appropriate value (using iconv charset names).
>> >>
>> >> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>> >>
>> >> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
>> >> shared libraries dir.
>> >>
>> >> Link it in postgresql:
>> >>
>> >> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
>> >> C;
>> >>
>> >> What about integrating unaccent libraries directly in tsearch? It
is
>> useful
>> >> for french search engines (for instance).
>> >
>> >I think better to have separate module contrib/unac and document using
>> >it with tsearch. Please write us a couple of lines about using
>> >your function and we'll add them into tsearch documentation.
>> >
>> >btw, use palloc instead of malloc in postgresql functions .
>> >
>> >>
>> >> Bye.
>> >>
>> >> Nhan NGO DINH
>> >>
>> >>
>> >> __
>> >> Tiscali Ricaricasa
>> >> la prima prepagata per navigare in Internet a meno di un'urbana e
>> >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
>> >> nessun costo di attivazione n? di ricarica!
>> >> http://ricaricasaonline.tiscali.it/
>> >>
>> >>
>> >>
>> >>
>> >
>> >Regards,
>> >Oleg
>> >_
>> >Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>> >Sternberg Astronomical Institute, Moscow University (Russia)
>> >Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
>> >phone: +007(095)939-16-83, +007(095)939-23-83
>> >
>> >
>> >---(end of broadcast)---
>> >TIP 3: if posting/reading through Usenet, please send an appropriate
>> >subscribe-nomail command to [EMAIL PROTECTED] so that your
>> >message can get through to the mailing list cleanly
>>
>>
>>
>> __
>> Tiscali Ricaricasa
>> la prim

Re: [HACKERS] unaccent

2002-09-18 Thread Oleg Bartunov

On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:

> The best way to use it is quite simple. If you want to index the table "titles"
> and "title" is the field containing the text to be indexed, you can create
> another unaccented field, for instance "utitle".
>
> UPDATE titles SET utitle = unac(title);
>
> Of course you can set it up as a trigger function. Then you can use utitle
> with txt2txtidx and tsearch.
>
> Another solution is to generate the txtidx field (i.e. titleidx) directly
> using unac:
>
> UPDATE titles SET titleidx = txt2txtidx(unac(title));
>
> But the problem is that I've not succeeded using it with tsearch because
> (of course) it doesn't allow functions as parameters. So my first idea was
> to integrate unac in tsearch.

what's exactly a problem ?
UPDATE titles SET titleidx = txt2txtidx(unac(title));
works fine. Perhaps, you have a problem with query ?

>
> Bye.
>
> >-- Messaggio Originale --
> >Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
> >From: Oleg Bartunov <[EMAIL PROTECTED]>
> >To: [EMAIL PROTECTED]
> >Cc: [EMAIL PROTECTED]
> >Subject: Re: [HACKERS] unaccent
> >
> >
> >On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:
> >
> >> Greetings,
> >>
> >> As far as I use the txtidx data structure in conjunction with gist indexing
> >> to make a word indexing of a very large UNICODE db, I've implemented
> a
> >PostgreSQL
> >> function that uses libunac to unaccent TEXT fileds.
> >>
> >> The resulting text is in UTF-8, but you can modify it in the sources
> with
> >> an appropriate value (using iconv charset names).
> >>
> >> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
> >>
> >> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> >> shared libraries dir.
> >>
> >> Link it in postgresql:
> >>
> >> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
> >> C;
> >>
> >> What about integrating unaccent libraries directly in tsearch? It is
> useful
> >> for french search engines (for instance).
> >
> >I think better to have separate module contrib/unac and document using
> >it with tsearch. Please write us a couple of lines about using
> >your function and we'll add them into tsearch documentation.
> >
> >btw, use palloc instead of malloc in postgresql functions .
> >
> >>
> >> Bye.
> >>
> >> Nhan NGO DINH
> >>
> >>
> >> __
> >> Tiscali Ricaricasa
> >> la prima prepagata per navigare in Internet a meno di un'urbana e
> >> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
> >> nessun costo di attivazione n? di ricarica!
> >> http://ricaricasaonline.tiscali.it/
> >>
> >>
> >>
> >>
> >
> > Regards,
> > Oleg
> >_
> >Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
> >Sternberg Astronomical Institute, Moscow University (Russia)
> >Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
> >phone: +007(095)939-16-83, +007(095)939-23-83
> >
> >
> >---(end of broadcast)---
> >TIP 3: if posting/reading through Usenet, please send an appropriate
> >subscribe-nomail command to [EMAIL PROTECTED] so that your
> >message can get through to the mailing list cleanly
>
>
>
> __
> Tiscali Ricaricasa
> la prima prepagata per navigare in Internet a meno di un'urbana e
> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
> nessun costo di attivazione né di ricarica!
> http://ricaricasaonline.tiscali.it/
>
>
>
>
> ---(end of broadcast)---
> TIP 4: Don't 'kill -9' the postmaster
>

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] unaccent

2002-09-18 Thread Oleg Bartunov

On Wed, 18 Sep 2002, Karel Zak wrote:

> On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
> > On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:
> > >
> > > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
> > >
> > > Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> > > shared libraries dir.
> > >
> > I think better to have separate module contrib/unac and document using
> > it with tsearch. Please write us a couple of lines about using
> > your function and we'll add them into tsearch documentation.
>
>  I think about --with-unaccent for PostgreSQL and to_ascii() in
>  main tree. Comment?

Hmm, it'd require linking yet another library. contrib module is
a standard way to test/develope possible future feature.

>
> Karel
>
>

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] unaccent

2002-09-18 Thread nngodinh

Not "to_ascii", since there are so many extended UNICODE characters that
doesn't have any accent and should not be converted to an ASCII character.

>-- Messaggio Originale --
>Date: Wed, 18 Sep 2002 14:24:26 +0200
>From: Karel Zak <[EMAIL PROTECTED]>
>To: Oleg Bartunov <[EMAIL PROTECTED]>
>Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
>Subject: Re: [HACKERS] unaccent
>
>
>On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
>> On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:
>> >
>> > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>> >
>> > Extract the archive, compile it (make). Move pg_unac.so to your postgresql
>> > shared libraries dir.
>> >
>> I think better to have separate module contrib/unac and document using
>> it with tsearch. Please write us a couple of lines about using
>> your function and we'll add them into tsearch documentation.
>
> I think about --with-unaccent for PostgreSQL and to_ascii() in
> main tree. Comment?
>
>Karel
>
>--
> Karel Zak  <[EMAIL PROTECTED]>
> http://home.zf.jcu.cz/~zakkr/
>
> C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
>
>---(end of broadcast)---
>TIP 6: Have you searched our list archives?
>
>http://archives.postgresql.org


__
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/




---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] unaccent

2002-09-18 Thread nngodinh

The best way to use it is quite simple. If you want to index the table "titles"
and "title" is the field containing the text to be indexed, you can create
another unaccented field, for instance "utitle".

UPDATE titles SET utitle = unac(title);

Of course you can set it up as a trigger function. Then you can use utitle
with txt2txtidx and tsearch.

Another solution is to generate the txtidx field (i.e. titleidx) directly
using unac:

UPDATE titles SET titleidx = txt2txtidx(unac(title));

But the problem is that I've not succeeded using it with tsearch because
(of course) it doesn't allow functions as parameters. So my first idea was
to integrate unac in tsearch.

Bye.

>-- Messaggio Originale --
>Date: Wed, 18 Sep 2002 15:08:59 +0300 (GMT)
>From: Oleg Bartunov <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Cc: [EMAIL PROTECTED]
>Subject: Re: [HACKERS] unaccent
>
>
>On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:
>
>> Greetings,
>>
>> As far as I use the txtidx data structure in conjunction with gist indexing
>> to make a word indexing of a very large UNICODE db, I've implemented
a
>PostgreSQL
>> function that uses libunac to unaccent TEXT fileds.
>>
>> The resulting text is in UTF-8, but you can modify it in the sources
with
>> an appropriate value (using iconv charset names).
>>
>> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>>
>> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
>> shared libraries dir.
>>
>> Link it in postgresql:
>>
>> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
>> C;
>>
>> What about integrating unaccent libraries directly in tsearch? It is
useful
>> for french search engines (for instance).
>
>I think better to have separate module contrib/unac and document using
>it with tsearch. Please write us a couple of lines about using
>your function and we'll add them into tsearch documentation.
>
>btw, use palloc instead of malloc in postgresql functions .
>
>>
>> Bye.
>>
>> Nhan NGO DINH
>>
>>
>> __
>> Tiscali Ricaricasa
>> la prima prepagata per navigare in Internet a meno di un'urbana e
>> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
>> nessun costo di attivazione n? di ricarica!
>> http://ricaricasaonline.tiscali.it/
>>
>>
>>
>>
>
>   Regards,
>   Oleg
>_
>Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>Sternberg Astronomical Institute, Moscow University (Russia)
>Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
>phone: +007(095)939-16-83, +007(095)939-23-83
>
>
>---(end of broadcast)---
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to [EMAIL PROTECTED] so that your
>message can get through to the mailing list cleanly



__
Tiscali Ricaricasa
la prima prepagata per navigare in Internet a meno di un'urbana e
risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
nessun costo di attivazione né di ricarica!
http://ricaricasaonline.tiscali.it/




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [HACKERS] unaccent

2002-09-18 Thread Karel Zak

On Wed, Sep 18, 2002 at 03:08:59PM +0300, Oleg Bartunov wrote:
> On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:
> >
> > Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
> >
> > Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> > shared libraries dir.
> >
> I think better to have separate module contrib/unac and document using
> it with tsearch. Please write us a couple of lines about using
> your function and we'll add them into tsearch documentation.

 I think about --with-unaccent for PostgreSQL and to_ascii() in
 main tree. Comment?

Karel

-- 
 Karel Zak  <[EMAIL PROTECTED]>
 http://home.zf.jcu.cz/~zakkr/
 
 C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz

---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org



Re: [HACKERS] unaccent

2002-09-18 Thread Oleg Bartunov

On Wed, 18 Sep 2002 [EMAIL PROTECTED] wrote:

> Greetings,
>
> As far as I use the txtidx data structure in conjunction with gist indexing
> to make a word indexing of a very large UNICODE db, I've implemented a PostgreSQL
> function that uses libunac to unaccent TEXT fileds.
>
> The resulting text is in UTF-8, but you can modify it in the sources with
> an appropriate value (using iconv charset names).
>
> Get libunac from: http://www.nongnu.org/unac/ (it uses iconv)
>
> Extract the archive, compile it (make). Move pg_unac.so to your postgresql
> shared libraries dir.
>
> Link it in postgresql:
>
> CREATE FUNCTION unac(TEXT) RETURNS TEXT AS 'path_to_pg_unac.so' LANGUAGE
> C;
>
> What about integrating unaccent libraries directly in tsearch? It is useful
> for french search engines (for instance).

I think better to have separate module contrib/unac and document using
it with tsearch. Please write us a couple of lines about using
your function and we'll add them into tsearch documentation.

btw, use palloc instead of malloc in postgresql functions .

>
> Bye.
>
> Nhan NGO DINH
>
>
> __
> Tiscali Ricaricasa
> la prima prepagata per navigare in Internet a meno di un'urbana e
> risparmiare su tutte le tue telefonate. Acquistala on line e non avrai
> nessun costo di attivazione né di ricarica!
> http://ricaricasaonline.tiscali.it/
>
>
>
>

Regards,
Oleg
_
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly