Re: [HACKERS] TS: Limited cover density ranking

2012-01-28 Thread Oleg Bartunov
I suggest you work on more general approach, see 
http://www.sai.msu.su/~megera/wiki/2009-08-12 for example.


btw, I don't like you changed ts_rank_cd arguments.

Oleg
On Fri, 27 Jan 2012, karave...@mail.bg wrote:


Hello,

I have developed a variation of cover density ranking functions that counts 
only covers that are lesser than a specified limit. It is useful for finding 
combinations of terms that appear nearby one another. Here is an example of 
usage:

-- normal cover density ranking : not changed
luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'), 
to_tsquery('ad'));
ts_rank_cd

 0.033
(1 row)

-- limited to 2
luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'), 
to_tsquery('ad'));
ts_rank_cd

 0
(1 row)

luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'), 
to_tsquery('ad'));
ts_rank_cd

   0.1
(1 row)

-- limited to 3
luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'), 
to_tsquery('ad'));
ts_rank_cd

 0.033
(1 row)

luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'), 
to_tsquery('ad'));
ts_rank_cd

  0.13
(1 row)

Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a 
separate extension because it is only 1 statement change in calc_rank_cd 
function. If I have to make an extension a lot of code would be duplicated 
between backend/utils/adt/tsrank.c and the extension.

I have some questions:

1. Is it interesting to develop it further (documentation, cleanup, etc) for 
inclusion in one of the next versions? If this is the case, there are some 
further questions:

- should I overload ts_rank_cd (as in examples above and the patch) or should I 
define new set of functions, for example ts_rank_lcd ?
- should I define define this new sql level functions in core or should I go 
only with this 2 lines change in calc_rank_cd() and define the new functions as 
an extension? If we prefer the later, could I overload core functions with 
functions defined in extensions?
- and finally there is always the possibility to duplicate the code and make an 
independent extension.

2. If I run the patched version on cluster that was initialized with unpatched 
server, is there a way to register the new functions in the system catalog 
without reinitializing the cluster?

Best regards
luben

--
Luben Karavelov


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TS: Limited cover density ranking

2012-01-28 Thread karavelov
- Цитат от Oleg Bartunov (o...@sai.msu.su), на 28.01.2012 в 21:04 - 

 I suggest you work on more general approach, see 
 http://www.sai.msu.su/~megera/wiki/2009-08-12 for example. 
 
 btw, I don't like you changed ts_rank_cd arguments. 

Hello Oleg, 

Thanks for the feedback. 

Is it OK to begin with adding an exta argument and check in calc_rank_cd? 

I could change the function names in order not to overload ts_rank_cd 
arguments. My proposition is : 

at sql level: 
ts_rank_lcd([weights], tsvector, tsquery, limit, [method]) 

at C level: 
ts_ranklcd_wttlf 
ts_ranklcd_wttl 
ts_ranklcd_ttlf 
ts_ranklcd_ttl 

Adding the functions could be done as an extension but they are just 
trampolines into calc_rank_cd(). 

I agree that what you describe in the wiki page is more general approach. So 
this : 

SELECT ts_rank_lcd(to_tsvector('a b c'), to_tsquery('ac'),2 )0; 

could be replaced with 

SELECT to_tsvector('a b c') @@ to_tsquery('(a ?2 c)|(c ?2 a) '); 

but if we need to look for 3 or more nearby terms without order the tsquery 
with '?' operator will became quite complicated. For example 

SELECT tsvec @@ 
'(a ? b ? c) | (a ? c ? b) | (b ? a ? c) | (b ? c ? a) | (c ? a ? b) | (c ? b ? 
a)'::tsquery; 

is the same as 

SELECT ts_rank_lcd(tsvec, 'abc'::tsquery,2)0; 

So this is the reason to think that the general approach does not exclude the 
the 
usefulness of the approach that I am proposing. 

Best regards 

-- 
Luben Karavelov


Re: [HACKERS] TS: Limited cover density ranking

2012-01-27 Thread karavelov
And here is the patch, that I forgot to attach
 Hello,
 
 I have developed a variation of cover density ranking functions that counts 
 only covers that are lesser than a specified limit. It is useful for finding 
 combinations of terms that appear nearby one another. Here is an example of 
 usage:

...

 
 Find attached a path agains 9.1.2 sources. I preferred to make a patch, not a 
 separate extension because it is only 1 statement change in calc_rank_cd 
 function. If I have to make an extension a lot of code would be duplicated 
 between backend/utils/adt/tsrank.c and the extension.
 
--
Luben Karavelovdiff -pur postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c
--- postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c	2011-12-01 23:47:20.0 +0200
+++ /usr/src/postgresql-9.1-9.1.2/src/backend/utils/adt/tsrank.c	2012-01-27 07:45:34.558028176 +0200
@@ -724,7 +724,7 @@ get_docrep(TSVector txt, QueryRepresenta
 }
 
 static float4
-calc_rank_cd(float4 *arrdata, TSVector txt, TSQuery query, int method)
+calc_rank_cd(int limit, float4 *arrdata, TSVector txt, TSQuery query, int method)
 {
 	DocRepresentation *doc;
 	int			len,
@@ -768,6 +768,9 @@ calc_rank_cd(float4 *arrdata, TSVector t
 		int			nNoise;
 		DocRepresentation *ptr = ext.begin;
 
+if (limit  0  ext.end-pos - ext.begin-pos  limit) 
+continue;
+
 		while (ptr = ext.end)
 		{
 			InvSum += invws[ptr-wclass];
@@ -834,7 +837,7 @@ ts_rankcd_wttf(PG_FUNCTION_ARGS)
 	int			method = PG_GETARG_INT32(3);
 	float		res;
 
-	res = calc_rank_cd(getWeights(win), txt, query, method);
+	res = calc_rank_cd(0, getWeights(win), txt, query, method);
 
 	PG_FREE_IF_COPY(win, 0);
 	PG_FREE_IF_COPY(txt, 1);
@@ -850,7 +853,7 @@ ts_rankcd_wtt(PG_FUNCTION_ARGS)
 	TSQuery		query = PG_GETARG_TSQUERY(2);
 	float		res;
 
-	res = calc_rank_cd(getWeights(win), txt, query, DEF_NORM_METHOD);
+	res = calc_rank_cd(0, getWeights(win), txt, query, DEF_NORM_METHOD);
 
 	PG_FREE_IF_COPY(win, 0);
 	PG_FREE_IF_COPY(txt, 1);
@@ -866,7 +869,7 @@ ts_rankcd_ttf(PG_FUNCTION_ARGS)
 	int			method = PG_GETARG_INT32(2);
 	float		res;
 
-	res = calc_rank_cd(getWeights(NULL), txt, query, method);
+	res = calc_rank_cd(0, getWeights(NULL), txt, query, method);
 
 	PG_FREE_IF_COPY(txt, 0);
 	PG_FREE_IF_COPY(query, 1);
@@ -880,9 +883,75 @@ ts_rankcd_tt(PG_FUNCTION_ARGS)
 	TSQuery		query = PG_GETARG_TSQUERY(1);
 	float		res;
 
-	res = calc_rank_cd(getWeights(NULL), txt, query, DEF_NORM_METHOD);
+	res = calc_rank_cd(0, getWeights(NULL), txt, query, DEF_NORM_METHOD);
 
 	PG_FREE_IF_COPY(txt, 0);
 	PG_FREE_IF_COPY(query, 1);
 	PG_RETURN_FLOAT4(res);
 }
+
+Datum
+ts_rankcd_lwttf(PG_FUNCTION_ARGS)
+{
+int limit = PG_GETARG_INT32(0);
+	ArrayType  *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1));
+	TSVector	txt = PG_GETARG_TSVECTOR(2);
+	TSQuery		query = PG_GETARG_TSQUERY(3);
+	int			method = PG_GETARG_INT32(4);
+	float		res;
+
+	res = calc_rank_cd(limit, getWeights(win), txt, query, method);
+
+	PG_FREE_IF_COPY(win, 1);
+	PG_FREE_IF_COPY(txt, 2);
+	PG_FREE_IF_COPY(query, 3);
+	PG_RETURN_FLOAT4(res);
+}
+
+Datum
+ts_rankcd_lwtt(PG_FUNCTION_ARGS)
+{
+int limit = PG_GETARG_INT32(0);
+	ArrayType  *win = (ArrayType *) PG_DETOAST_DATUM(PG_GETARG_DATUM(1));
+	TSVector	txt = PG_GETARG_TSVECTOR(2);
+	TSQuery		query = PG_GETARG_TSQUERY(3);
+	float		res;
+
+	res = calc_rank_cd(limit, getWeights(win), txt, query, DEF_NORM_METHOD);
+
+	PG_FREE_IF_COPY(win, 1);
+	PG_FREE_IF_COPY(txt, 2);
+	PG_FREE_IF_COPY(query, 3);
+	PG_RETURN_FLOAT4(res);
+}
+
+Datum
+ts_rankcd_lttf(PG_FUNCTION_ARGS)
+{
+int limit = PG_GETARG_INT32(0);
+	TSVector	txt = PG_GETARG_TSVECTOR(1);
+	TSQuery		query = PG_GETARG_TSQUERY(2);
+	int			method = PG_GETARG_INT32(3);
+	float		res;
+
+	res = calc_rank_cd(limit, getWeights(NULL), txt, query, method);
+
+	PG_FREE_IF_COPY(txt, 1);
+	PG_FREE_IF_COPY(query, 2);
+	PG_RETURN_FLOAT4(res);
+}
+
+Datum
+ts_rankcd_ltt(PG_FUNCTION_ARGS)
+{
+int limit = PG_GETARG_INT32(0);
+	TSVector	txt = PG_GETARG_TSVECTOR(1);
+	TSQuery		query = PG_GETARG_TSQUERY(2);
+	float		res;
+
+	res = calc_rank_cd(limit, getWeights(NULL), txt, query, DEF_NORM_METHOD);
+
+	PG_FREE_IF_COPY(txt, 1);
+	PG_FREE_IF_COPY(query, 2);
+	PG_RETURN_FLOAT4(res);
+}
diff -pur postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h
--- postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h	2011-12-01 23:47:20.0 +0200
+++ /usr/src/postgresql-9.1-9.1.2/src/include/catalog/pg_proc.h	2012-01-27 05:45:53.944979678 +0200
@@ -4159,6 +4159,15 @@ DATA(insert OID = 3709 (  ts_rank_cd	PGN
 DESCR(relevance);
 DATA(insert OID = 3710 (  ts_rank_cd	PGNSP PGUID 12 1 0 0 f f f t f i 2 0 700 3614 3615 _null_ _null_ _null_ _null_ ts_rankcd_tt _null_ _null_ _null_ ));
 DESCR(relevance);
+DATA(insert OID = 3675 (  ts_rank_cd	PGNSP PGUID 12 1 0 0 f f f t f i 5 0 700 23 1021 3614 

Re: [HACKERS] TS: Limited cover density ranking

2012-01-27 Thread Sushant Sinha
The rank counts 1/coversize. So bigger covers will not have much impact
anyway. What is the need of the patch?

-Sushant.

On Fri, 2012-01-27 at 18:06 +0200, karave...@mail.bg wrote:
 Hello, 
 
 I have developed a variation of cover density ranking functions that
 counts only covers that are lesser than a specified limit. It is
 useful for finding combinations of terms that appear nearby one
 another. Here is an example of usage: 
 
 -- normal cover density ranking : not changed 
 luben= select ts_rank_cd(to_tsvector('a b c d e g h i j k'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.033 
 (1 row) 
 
 -- limited to 2 
 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0 
 (1 row) 
 
 luben= select ts_rank_cd(2, to_tsvector('a b c d e g h i j k a d'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.1 
 (1 row) 
 
 -- limited to 3 
 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.033 
 (1 row) 
 
 luben= select ts_rank_cd(3, to_tsvector('a b c d e g h i j k a d'),
 to_tsquery('ad')); 
 ts_rank_cd 
  
 0.13 
 (1 row) 
 
 Find attached a path agains 9.1.2 sources. I preferred to make a
 patch, not a separate extension because it is only 1 statement change
 in calc_rank_cd function. If I have to make an extension a lot of code
 would be duplicated between backend/utils/adt/tsrank.c and the
 extension. 
 
 I have some questions: 
 
 1. Is it interesting to develop it further (documentation, cleanup,
 etc) for inclusion in one of the next versions? If this is the case,
 there are some further questions: 
 
 - should I overload ts_rank_cd (as in examples above and the patch) or
 should I define new set of functions, for example ts_rank_lcd ? 
 - should I define define this new sql level functions in core or
 should I go only with this 2 lines change in calc_rank_cd() and define
 the new functions as an extension? If we prefer the later, could I
 overload core functions with functions defined in extensions? 
 - and finally there is always the possibility to duplicate the code
 and make an independent extension. 
 
 2. If I run the patched version on cluster that was initialized with
 unpatched server, is there a way to register the new functions in the
 system catalog without reinitializing the cluster? 
 
 Best regards 
 luben 
 
 -- 
 Luben Karavelov



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TS: Limited cover density ranking

2012-01-27 Thread karavelov
- Цитат от Sushant Sinha (sushant...@gmail.com), на 27.01.2012 в 18:32 -

 The rank counts 1/coversize. So bigger covers will not have much impact
 anyway. What is the need of the patch?
 
 -Sushant.


If you want to find only combinations of words that are close one to another, 
with the patch you could use something as:

WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, to_tsquery('ad') 
AS query) 
SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)0;

I could not find another way to make this type of queries. If there is an 
alternative, I am open to suggestions

Best regards
--
Luben Karavelov

Re: [HACKERS] TS: Limited cover density ranking

2012-01-27 Thread karavelov
- Цитат от karave...@mail.bg, на 27.01.2012 в 18:48 -

 - Цитат от Sushant Sinha (sushant...@gmail.com), на 27.01.2012 в 18:32 
 -
 
 The rank counts 1/coversize. So bigger covers will not have much impact
 anyway. What is the need of the patch?
 
 -Sushant.

 
 If you want to find only combinations of words that are close one to another, 
 with the patch you could use something as:
 
 WITH a AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, 
 to_tsquery('ad') AS query) 
 SELECT * FROM a WHERE vec @@ query AND ts_rank_cd(3,vec,query)0;
 

Another example, if you want to match 'b c d' only, you could use:

WITH A AS (SELECT to_tsvector('a b c d e g h i j k') AS vec, 
to_tsquery('bcd') AS query) 
SELECT * FROM A WHERE vec @@ query AND ts_rank_cd(2,vec,query)0;

The catch is that it will match also 'b d c', 'd c b', 'd b c', 'c d b' and 'd 
b d', so it is not a
replacement for exact phrase match but something that I find useful

--
Luben Karavelov