Re: [HACKERS] Tsvector editing functions

2016-03-11 Thread Teodor Sigaev

I saw errors on windows, here is the fix:

Thank you, pushed

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-03-11 Thread Stas Kelvich
> 
> On 11 Mar 2016, at 16:13, Stas Kelvich  wrote:
> 
> 
>> On 10 Mar 2016, at 20:29, Teodor Sigaev  wrote:
>> 
>> I would like to suggest rename both functions to array_to_tsvector and 
>> tsvector_to_array to have consistent name. Later we could add 
>> to_tsvector([regconfig, ], text[]) with morphological processing.
>> 
>> Thoughts?
>> 

Hi, thanks for commit.

I saw errors on windows, here is the fix:



tsvector.fix.patch
Description: Binary data

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-03-11 Thread Stas Kelvich

> On 10 Mar 2016, at 20:29, Teodor Sigaev  wrote:
> 
> I would like to suggest rename both functions to array_to_tsvector and 
> tsvector_to_array to have consistent name. Later we could add 
> to_tsvector([regconfig, ], text[]) with morphological processing.
> 
> Thoughts?
> 

Seems reasonable, done.



tsvector_ops-v6.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-03-10 Thread Teodor Sigaev

Thanks! Fixed and added tests.

Thank you!

I did some patch cleanup/fix, but I have some doubt with function's names:

1 to_tsvector:
# \df to_tsvector
 List of functions
   Schema   |Name | Result data type | Argument data types |  Type
+-+--+-+
 pg_catalog | to_tsvector | tsvector | regconfig, text | normal
 pg_catalog | to_tsvector | tsvector | text| normal
 pg_catalog | to_tsvector | tsvector | text[]  | normal

First two variants of to_tsvector make a morphological processing, last one 
doesn't.

2 to_array
# \df *to_array
  List of functions
   Schema   | Name  | Result data type | Argument data types | 
 Type

+---+--+-+
 pg_catalog | regexp_split_to_array | text[]   | text, text  | 
normal
 pg_catalog | regexp_split_to_array | text[]   | text, text, text| 
normal
 pg_catalog | string_to_array   | text[]   | text, text  | 
normal
 pg_catalog | string_to_array   | text[]   | text, text, text| 
normal
 pg_catalog | to_array  | text[]   | tsvector| 
normal


Seems, to_array is not a right name compared to other *to_array.

I would like to suggest rename both functions to array_to_tsvector and 
tsvector_to_array to have consistent name. Later we could add 
to_tsvector([regconfig, ], text[]) with morphological processing.


Thoughts?



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4b5ee81..ed0b6be 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9211,16 +9211,29 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
  
   setweight
  
- setweight(tsvector, "char")
+ setweight(vector tsvector, weight "char")
 
 tsvector
-assign weight to each element of tsvector
+assign weight to each element of vector
 setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A')
 'cat':3A 'fat':2A,4A 'rat':5A


 
  
+  setweight
+  setweight by filter
+ 
+ setweight(vector tsvector, weight "char", lexemes "text"[])
+
+tsvector
+assign weight to elements of vector that are listed in lexemes array
+setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}')
+'cat':3A 'fat':2,4 'rat':5A
+   
+   
+
+ 
   strip
  
  strip(tsvector)
@@ -9233,6 +9246,81 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple

 
  
+  delete
+  delete lemexeme
+ 
+ delete(vector tsvector, lexeme text)
+
+tsvector
+remove given lexeme from vector
+delete('fat:2,4 cat:3 rat:5A'::tsvector, 'fat')
+'cat':3 'rat':5A
+   
+   
+
+ 
+  delete
+  delete lemexemes array
+ 
+ delete(vector tsvector, lexemes text[])
+
+tsvector
+remove any occurrence of lexemes in lexemes array from vector
+delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])
+'cat':3
+   
+   
+
+ 
+  unnest
+ 
+ unnest(tsvector, OUT lexeme text, OUT positions smallint[], OUT weights text)
+
+setof record
+expand a tsvector to a set of rows
+unnest('fat:2,4 cat:3 rat:5A'::tsvector)
+(cat,{3},{D}) ...
+   
+   
+
+ 
+  to_array
+ 
+ to_array(tsvector)
+
+text[]
+convert tsvector to array of lexemes
+to_array('fat:2,4 cat:3 rat:5A'::tsvector)
+{cat,fat,rat}
+   
+   
+
+ 
+  to_tsvector
+  array to tsvector
+ 
+ to_tsvector(text[])
+
+tsvector
+convert array of lexemes to tsvector
+to_tsvector('{fat,cat,rat}'::text[])
+'fat' 'cat' 'rat'
+   
+   
+
+ 
+  filter
+ 
+ filter(vector tsvector, weights "char"[])
+
+tsvector
+Select only elements with given weights from vector
+filter('fat:2,4 cat:3b rat:5A'::tsvector, '{a,b}')
+'cat':3B 'rat':5A
+   
+   
+
+ 
   to_tsquery
  
  to_tsquery( config regconfig ,  query text)
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index ff99976..ea3abc9 100644
--- 

Re: [HACKERS] Tsvector editing functions

2016-02-17 Thread Stas Kelvich
> 
> On 02 Feb 2016, at 20:10, Teodor Sigaev  wrote:
> 
> Some notices:
> 
> 1 tsin in documentation doesn't look like a good name. Changed to vector 
> similar to other places.
> 
> 2 I did some editorization about freeing memory/forgotten names etc

Thanks.

> 
> 3 It seems to me that tsvector_unnest() could be seriously optimized for
> large tsvectors: with current coding it detoasts/decompresses tsvector value 
> on each call. Much better to do it once in
> multi_call_memory_ctx context at first call init

Done, moved detoasting to first SRF call.

> 4 It seems debatable returning empty array for position/weight if they are 
> absent:
> =# select * from unnest('a:1 b'::tsvector);
> lexeme | positions | weights
> +---+-
> a  | {1}   | {D}
> b  | {}| {}
> I think, it's better to return NULL in this case
> 

Okay, done.

> 5 
> array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter
>  doesn't check or pay attention to NULL elements in input arrays
> 

Thanks! Fixed and added tests.

> -- 
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>   WWW: http://www.sigaev.ru/
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



tsvector_ops-v4.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-02-02 Thread Teodor Sigaev

Some notices:

1 tsin in documentation doesn't look like a good name. Changed to vector similar 
to other places.


2 I did some editorization about freeing memory/forgotten names etc

3 It seems to me that tsvector_unnest() could be seriously optimized for
large tsvectors: with current coding it detoasts/decompresses tsvector value on 
each call. Much better to do it once in

multi_call_memory_ctx context at first call init

4 It seems debatable returning empty array for position/weight if they are 
absent:
=# select * from unnest('a:1 b'::tsvector);
 lexeme | positions | weights
+---+-
 a  | {1}   | {D}
 b  | {}| {}
I think, it's better to return NULL in this case

5 
array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter 
doesn't check or pay attention to NULL elements in input arrays






--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-02-02 Thread Teodor Sigaev

Some notices:

1 tsin in documentation doesn't look like a good name. Changed to vector similar
to other places.

2 I did some editorization about freeing memory/forgotten names etc


Ooops, forgot to attach
--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 139aa2b..9c294e3 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9168,16 +9168,29 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple
  
   setweight
  
- setweight(tsvector, "char")
+ setweight(vector tsvector, weight "char")
 
 tsvector
-assign weight to each element of tsvector
+assign weight to each element of vector
 setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A')
 'cat':3A 'fat':2A,4A 'rat':5A


 
  
+  setweight
+  setweight by filter
+ 
+ setweight(vector tsvector, weight "char", lexemes "text"[])
+
+tsvector
+assign weight to elements of vector that are listed in lexemes array
+setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}')
+'cat':3A 'fat':2,4 'rat':5A
+   
+   
+
+ 
   strip
  
  strip(tsvector)
@@ -9190,6 +9203,84 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple

 
  
+  delete
+  delete lemexeme
+ 
+ delete(vector tsvector, lexeme text)
+
+tsvector
+remove given lexeme from vector
+delete('fat:2,4 cat:3 rat:5A'::tsvector, 'fat')
+'cat':3 'rat':5A
+   
+   
+
+ 
+  delete
+  delete lemexemes array
+ 
+ delete(vector tsvector, lexemes text[])
+
+tsvector
+remove any occurrence of lexemes in lexemes array from vector
+delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat'])
+'cat':3
+   
+   
+
+ 
+  unnest
+ 
+ unnest(tsvector)
+
+setof anyelement
+expand a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights.
+unnest('fat:2,4 cat:3 rat:5A'::tsvector)
+cat{3}{A}
+fat{2,4}  {D,D}
+rat{5}{A}
+(3 rows)
+   
+   
+
+ 
+  to_array
+ 
+ to_array(tsvector)
+
+text[]
+convert tsvector to array of lexemes
+to_array('fat:2,4 cat:3 rat:5A'::tsvector)
+{cat,fat,rat}
+   
+   
+
+ 
+  to_tsvector
+  array to tsvector
+ 
+ to_tsvector(text[])
+
+tsvector
+convert array of lexemes to tsvector
+to_tsvector('{fat,cat,rat}'::text[])
+'fat' 'cat' 'rat'
+   
+   
+
+ 
+  filter
+ 
+ filter(vector tsvector, weights "char"[])
+
+tsvector
+Select only elements with given weights from vector
+filter('fat:2,4 cat:3b rat:5A'::tsvector, '{a,b}')
+'cat':3B 'rat':5A
+   
+   
+
+ 
   to_tsquery
  
  to_tsquery( config regconfig ,  query text)
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index d66b4d5..32033fa 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -1326,6 +1326,10 @@ FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
 

 
+   
+Full list of tsvector-related functions available in .
+   
+
   
 
   
diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c
index a3f1c361..e7ea270 100644
--- a/src/backend/utils/adt/tsvector_op.c
+++ b/src/backend/utils/adt/tsvector_op.c
@@ -14,6 +14,7 @@
 
 #include "postgres.h"
 
+#include "access/htup_details.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
 #include "commands/trigger.h"
@@ -65,6 +66,7 @@ typedef struct
 #define STATHDRSIZE (offsetof(TSVectorStat, data))
 
 static Datum tsvector_update_trigger(PG_FUNCTION_ARGS, bool config_column);
+static int tsvector_bsearch(TSVector tsin, char *lexin, int lexin_len);
 
 /*
  * Order: haspos, len, word, for all positions (pos, weight)
@@ -251,6 +253,81 @@ tsvector_setweight(PG_FUNCTION_ARGS)
 	PG_RETURN_POINTER(out);
 }
 
+/*
+ * setweight(tsin tsvector, char_weight "char", lexemes "text"[])
+ *
+ * Assign weight w to elements of tsin that are listed in lexemes.
+ */
+Datum
+tsvector_setweight_by_filter(PG_FUNCTION_ARGS)
+{
+	TSVector	tsin = PG_GETARG_TSVECTOR(0);
+	char		char_weight = PG_GETARG_CHAR(1);
+	ArrayType  *lexemes = PG_GETARG_ARRAYTYPE_P(2);
+
+	TSVector	tsout;
+	int			i,
+j,
+nlexemes,
+

Re: [HACKERS] Tsvector editing functions

2016-01-27 Thread Stas Kelvich
Hi

> On 22 Jan 2016, at 19:03, Tomas Vondra  wrote:
> OK, although I do recommend using more sensible variable names, i.e. why how 
> to use 'lexemes' instead of 'lexarr' for example? Similarly for the other 
> functions.


Changed. With old names I tried to follow conventions in surrounding code, but 
probably that is a good idea to switch to more meaningful names in new code.

>> 
>> 
>> delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions 
>> of tsv_filter from tsin. When lexeme in tsv_filter has no positions function 
>> will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme 
>> have positions function will delete them from positions of matching lexeme 
>> in tsin. If after such removal resulting positions set is empty then 
>> function will delete that lexeme from resulting tsvector.
>> 
> 
> I can't really imagine situation in which I'd need this, but if you do have a 
> use case for it ... although in the initial paragraph you say "... but if 
> somebody wants to delete for example ..." which suggests you may not have 
> such use case.
> 
> Based on bad experience with extending API based on vague ideas, I recommend 
> only really adding functions with existing need. It's easy to add a function 
> later, much more difficult to remove it or change the signature.

I tried to create more or less self-contained api, e.g. have ability to negate 
effect of concatenation. But i’ve also asked people around what they think 
about extending API and everybody convinced that it is better to stick to 
smaller API. So let’s drop it. At least that functions exists in mail list in 
case if somebody will google for such kind of behaviour.

>> 
>> Also if we want some level of completeness of API and taking into account 
>> that concat() function shift positions on second argument I thought that it 
>> can be useful to also add function that can shift all positions of specific 
>> value. This helps to undo concatenation: delete one of concatenating 
>> tsvectors and then shift positions in resulting tsvector. So I also wrote 
>> one another small function:
>> 
>> shift(tsin tsvector,offset int16) — Shift all positions in tsin by given 
>> offset
> 
> That seems rather too low-level. Shouldn't it be really built into delete() 
> directly somehow?


I think it is ambiguous task on delete. But if we are dropping support of 
delete(tsvector, tsvector) I don’t see points in keeping that functions.

>>> 
>>> 7) Some of the functions use intexterm that does not match the function
>>>   name. I see two such cases - to_tsvector and setweight. Is there a
>>>   reason for that?
>>> 
>> 
>> Because sgml compiler wants unique indexterm. Both functions that
>> youmentioned use overloading of arguments and have non-unique name.
> 
> As Michael pointed out, that should probably be handled by using  
> and  tags.


Done.


> On 19 Jan 2016, at 00:21, Alvaro Herrera  wrote:
> 
> 
> It's a bit funny that you reintroduce the "unrecognized weight: %d"
> (instead of %c) in tsvector_setweight_by_filter.
> 


Ah, I was thinking about moving it to separate diff and messed. Fixed and 
attaching diff with same fix for old tsvector_setweight.




tsvector_ops-v2.1.diff
Description: Binary data


tsvector_ops-v2.2.diff
Description: Binary data



---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-01-22 Thread Tomas Vondra



On 12/30/2015 06:49 PM, Stas Kelvich wrote:

Hi, Tomáš! Thanks for comprehensive review.


On 15 Dec 2015, at 06:07, Tomas Vondra  wrote:

1) It's a bit difficult to judge the usefulness of the API, as I've
   always been a mere user of full-text search, and I never had a need
   (or courage) to mess with the tsvectors. OTOH I don't see a good
   reason no to have such API, when there's a need for it.

   The API seems to be reasonably complete, with one exception - when
   looking at editing function we have for 'hstore', we do have these
   variants for delete()

  delete(hstore,text)
  delete(hstore,text[])
  delete(hstore,hstore)

   while this patch only adds delete(tsvector,text). Would it make
   sense to add variants similar to hstore? It probably does not make
   much sense to add delete(tsvector,tsvector), right? But being able
   to delete a bunch of lexemes in one go seems like a good thing.

   What do you think?


That’s a good idea and actually deleting tsvector from tsvector makes perfect 
sense. In delete function I used exact string match between string and lexemes in 
tsvector, but if somebody wants to delete for example “Cats” from tsvector, then 
he should downcase and singularize this word. Easiest way to do it is to just use 
to_tsvector() function. Also we can use this function to delete specific 
positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) -> 'cat:3 
fat:4'::tsvector.

So in attached patch I’ve implemented following:

delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr 
from tsin


OK, although I do recommend using more sensible variable names, i.e. why 
how to use 'lexemes' instead of 'lexarr' for example? Similarly for the 
other functions.




delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of 
tsv_filter from tsin. When lexeme in tsv_filter has no positions function will 
delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have 
positions function will delete them from positions of matching lexeme in tsin. 
If after such removal resulting positions set is empty then function will 
delete that lexeme from resulting tsvector.



I can't really imagine situation in which I'd need this, but if you do 
have a use case for it ... although in the initial paragraph you say 
"... but if somebody wants to delete for example ..." which suggests you 
may not have such use case.


Based on bad experience with extending API based on vague ideas, I 
recommend only really adding functions with existing need. It's easy to 
add a function later, much more difficult to remove it or change the 
signature.



Also if we want some level of completeness of API and taking into account that 
concat() function shift positions on second argument I thought that it can be 
useful to also add function that can shift all positions of specific value. 
This helps to undo concatenation: delete one of concatenating tsvectors and 
then shift positions in resulting tsvector. So I also wrote one another small 
function:

shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset


That seems rather too low-level. Shouldn't it be really built into 
delete() directly somehow?






4) I find it rather annoying that there are pretty much no comments in
   the code. Granted, there are pretty much no comments in the
   surrounding code, but I doubt that's a good reason for not having
   any comments in new code. It makes reviews unnecessarily difficult.



Fixed, I think.


Yep, much better now.





5) tsvector_concat() is not mentioned in docs at all



Concat mentioned in docs as an operator ||.


Ah, OK.





6) Docs don't mention names of the new parameters in function
   signatures, just data types. The functions with a single parameter
   probably don't need to do that, but multi-parameter ones should.



Fixed.


OK, but please let's use variable names clearly identifying the meaning. 
So not 'w' but 'weight' and so on.





7) Some of the functions use intexterm that does not match the function
   name. I see two such cases - to_tsvector and setweight. Is there a
   reason for that?



Because sgml compiler wants unique indexterm. Both functions that
youmentioned use overloading of arguments and have non-unique name.


As Michael pointed out, that should probably be handled by using 
 and  tags.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2016-01-18 Thread Alvaro Herrera
So, Tomas, Teodor, did you like this new version of the patch?

Stas Kelvich wrote:

> > 7) Some of the functions use intexterm that does not match the function
> >   name. I see two such cases - to_tsvector and setweight. Is there a
> >   reason for that?
> 
> Because sgml compiler wants unique indexterm. Both functions that you
> mentioned use overloading of arguments and have non-unique name.

This sounds wrong.  I think what you should really do is use


  foo
  bar


to distinguish the two entries.

It's a bit funny that you reintroduce the "unrecognized weight: %d"
(instead of %c) in tsvector_setweight_by_filter.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-12-30 Thread Stas Kelvich
Hi, Tomáš! Thanks for comprehensive review.

> On 15 Dec 2015, at 06:07, Tomas Vondra  wrote:
> 
> 1) It's a bit difficult to judge the usefulness of the API, as I've
>   always been a mere user of full-text search, and I never had a need
>   (or courage) to mess with the tsvectors. OTOH I don't see a good
>   reason no to have such API, when there's a need for it.
> 
>   The API seems to be reasonably complete, with one exception - when
>   looking at editing function we have for 'hstore', we do have these
>   variants for delete()
> 
>  delete(hstore,text)
>  delete(hstore,text[])
>  delete(hstore,hstore)
> 
>   while this patch only adds delete(tsvector,text). Would it make
>   sense to add variants similar to hstore? It probably does not make
>   much sense to add delete(tsvector,tsvector), right? But being able
>   to delete a bunch of lexemes in one go seems like a good thing.
> 
>   What do you think?

That’s a good idea and actually deleting tsvector from tsvector makes perfect 
sense. In delete function I used exact string match between string and lexemes 
in tsvector, but if somebody wants to delete for example “Cats” from tsvector, 
then he should downcase and singularize this word. Easiest way to do it is to 
just use to_tsvector() function. Also we can use this function to delete 
specific positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) 
-> 'cat:3 fat:4'::tsvector.

So in attached patch I’ve implemented following:

delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr 
from tsin

delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of 
tsv_filter from tsin. When lexeme in tsv_filter has no positions function will 
delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have 
positions function will delete them from positions of matching lexeme in tsin. 
If after such removal resulting positions set is empty then function will 
delete that lexeme from resulting tsvector.

Also if we want some level of completeness of API and taking into account that 
concat() function shift positions on second argument I thought that it can be 
useful to also add function that can shift all positions of specific value. 
This helps to undo concatenation: delete one of concatenating tsvectors and 
then shift positions in resulting tsvector. So I also wrote one another small 
function:

shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset

> 
> 
> 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it
>   currently triggers:
> 
>tsvector_op.c:211:2: warning: ISO C90 forbids mixed ...
>tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but …
> 

fixed

> 3) the patch also touches tsvector_setweight(), only to do change:
> 
>  elog(ERROR, "unrecognized weight: %d", cw);
> 
>   to
> 
>  elog(ERROR, "unrecognized weight: %c", cw);
> 
>   That should probably go get committed separately, as a bugfix.
> 

Okay, i’ll submit that as a separate patch.

> 
> 4) I find it rather annoying that there are pretty much no comments in
>   the code. Granted, there are pretty much no comments in the
>   surrounding code, but I doubt that's a good reason for not having
>   any comments in new code. It makes reviews unnecessarily difficult.
> 

Fixed, I think.

> 
> 5) tsvector_concat() is not mentioned in docs at all
> 

Concat mentioned in docs as an operator ||.

> 
> 6) Docs don't mention names of the new parameters in function
>   signatures, just data types. The functions with a single parameter
>   probably don't need to do that, but multi-parameter ones should.
> 

Fixed.

> 7) Some of the functions use intexterm that does not match the function
>   name. I see two such cases - to_tsvector and setweight. Is there a
>   reason for that?
> 

Because sgml compiler wants unique indexterm. Both functions that you mentioned 
use overloading of arguments and have non-unique name.




tsvector_ops.diff
Description: Binary data


---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-12-23 Thread Michael Paquier
On Tue, Dec 15, 2015 at 12:07 PM, Tomas Vondra
 wrote:
> Hi,
>
> On 12/07/2015 03:05 PM, Stas Kelvich wrote:
>>
>> Hello.
>>
>> Done with the list of suggestions. Also heavily edit delete function.
>>
>
> I did a quick review of the updated patch. I'm not a tsvector-expert so
> hopefully my comments won't be entirely bogus.
>
> 1) It's a bit difficult to judge the usefulness of the API, as I've
>always been a mere user of full-text search, and I never had a need
>(or courage) to mess with the tsvectors. OTOH I don't see a good
>reason no to have such API, when there's a need for it.
>
>The API seems to be reasonably complete, with one exception - when
>looking at editing function we have for 'hstore', we do have these
>variants for delete()
>
>   delete(hstore,text)
>   delete(hstore,text[])
>   delete(hstore,hstore)
>
>while this patch only adds delete(tsvector,text). Would it make
>sense to add variants similar to hstore? It probably does not make
>much sense to add delete(tsvector,tsvector), right? But being able
>to delete a bunch of lexemes in one go seems like a good thing.
>
>What do you think?
>
>
> 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it
>currently triggers:
>
> tsvector_op.c:211:2: warning: ISO C90 forbids mixed ...
> tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but ...
>
> 3) the patch also touches tsvector_setweight(), only to do change:
>
>   elog(ERROR, "unrecognized weight: %d", cw);
>
>to
>
>   elog(ERROR, "unrecognized weight: %c", cw);
>
>That should probably go get committed separately, as a bugfix.
>
>
> 4) I find it rather annoying that there are pretty much no comments in
>the code. Granted, there are pretty much no comments in the
>surrounding code, but I doubt that's a good reason for not having
>any comments in new code. It makes reviews unnecessarily difficult.
>
>
> 5) tsvector_concat() is not mentioned in docs at all
>
>
> 6) Docs don't mention names of the new parameters in function
>signatures, just data types. The functions with a single parameter
>probably don't need to do that, but multi-parameter ones should.
>
> 7) Some of the functions use intexterm that does not match the function
>name. I see two such cases - to_tsvector and setweight. Is there a
>reason for that?

I have marked this patch as returned with feedback based on the
presence of a review and a lack of replies from the author. Stas, if
you are still working on the patch, please feel free to move it to the
next commit fest.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-12-14 Thread Tomas Vondra

Hi,

On 12/07/2015 03:05 PM, Stas Kelvich wrote:

Hello.

Done with the list of suggestions. Also heavily edit delete function.



I did a quick review of the updated patch. I'm not a tsvector-expert so 
hopefully my comments won't be entirely bogus.


1) It's a bit difficult to judge the usefulness of the API, as I've
   always been a mere user of full-text search, and I never had a need
   (or courage) to mess with the tsvectors. OTOH I don't see a good
   reason no to have such API, when there's a need for it.

   The API seems to be reasonably complete, with one exception - when
   looking at editing function we have for 'hstore', we do have these
   variants for delete()

  delete(hstore,text)
  delete(hstore,text[])
  delete(hstore,hstore)

   while this patch only adds delete(tsvector,text). Would it make
   sense to add variants similar to hstore? It probably does not make
   much sense to add delete(tsvector,tsvector), right? But being able
   to delete a bunch of lexemes in one go seems like a good thing.

   What do you think?


2) tsvector_op.c needs a bit of love, to eliminate the two warnings it
   currently triggers:

tsvector_op.c:211:2: warning: ISO C90 forbids mixed ...
tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but ...

3) the patch also touches tsvector_setweight(), only to do change:

  elog(ERROR, "unrecognized weight: %d", cw);

   to

  elog(ERROR, "unrecognized weight: %c", cw);

   That should probably go get committed separately, as a bugfix.


4) I find it rather annoying that there are pretty much no comments in
   the code. Granted, there are pretty much no comments in the
   surrounding code, but I doubt that's a good reason for not having
   any comments in new code. It makes reviews unnecessarily difficult.


5) tsvector_concat() is not mentioned in docs at all


6) Docs don't mention names of the new parameters in function
   signatures, just data types. The functions with a single parameter
   probably don't need to do that, but multi-parameter ones should.

7) Some of the functions use intexterm that does not match the function
   name. I see two such cases - to_tsvector and setweight. Is there a
   reason for that?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-12-07 Thread Stas Kelvich
Hello.

Done with the list of suggestions. Also heavily edit delete function.



tsvector_ops.diff
Description: Binary data

---
Stas Kelvich
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

> On 27 Nov 2015, at 15:13, Teodor Sigaev  wrote:
> 
> Hmm, seems, it will be useful to add two fuctions:
> 
> tsvector filter(tsvector, array_of_weigths) - returns tsvector contains 
> lexemes with given weights
> 
> tsvector setweight(tsvector, weigth, array_of_lexemes) - sets given weight 
> for given lexemes
> 
> Stas Kelvich wrote:
>> Hello.
>> 
>> There is patch that adds some editing routines for tsvector type.
>> 
>> tsvector delete(tsvector, text)
>>  removes entry from tsvector by lexeme name
>> set unnest(tsvector)
>>  expands a tsvector to a set of rows. Each row has following columns: 
>> lexeme, postings, weights.
>> text[] to_array(tsvector)
>>  converts tsvector to array of lexemes
>> tsvector to_tsvector(text[])
>>  converts array of lexemes to tsvector
>> 
>> 
>> 
>> 
>> 
>> 
>> Stas Kelvich
>> Postgres Professional: http://www.postgrespro.com
>> The Russian Postgres Company
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> Teodor Sigaev   E-mail: teo...@sigaev.ru
>   WWW: http://www.sigaev.ru/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-11-27 Thread Teodor Sigaev


1 Please, make patch compilable with current master.
cd ../../../src/include/catalog && '/usr/local/bin/perl' ./duplicate_oids
3315
3316

2 lexin = TextDatumGetCString(PG_GETARG_DATUM(1))
  lexin_len = strlen(lexin)

Why do you use C-string instead of just text? Suppose, much better:
   t = PG_GETARG_TEXT_P(1)
   lexin = VARDATA(t)
   lexin_len = VARSIZE_ANY_EXHDR(t)

3 Why do you use linear search in tsvector instead of binary search? It could 
produce a performance impact


4 Again, using BuildTupleFromCStrings() call is not very optimal

5 printing weights as numbers is not consistent with other usage of weigth's in 
FTS. Lexem's weight are mentioned as one of A,B,C,D and default weight is a D.


Teodor Sigaev wrote:

There is patch that adds some editing routines for tsvector type.

...

When submitting a patch, it's a good idea to explain why someone would
want the feature you are adding.  Maybe that's obvious to you, but it
isn't clear to me why we'd want this.



Some examples:
tsvector delete(tsvector, text)
 remove wronlgy indexed word (may, be a stop word)
text[] to_array(tsvector)
 In my practice, I needed it to work with smlar module.
tsvector to_tsvector(text[])
 Converts list of tags to tsvector, because search in tsvector is more
 flexible and fast than array's equivalents
set unnest(tsvector)
  Count some complicated statistics.

That functions mostly needed in utility processing rather in workflow.




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-11-27 Thread Teodor Sigaev

Hmm, seems, it will be useful to add two fuctions:

tsvector filter(tsvector, array_of_weigths) - returns tsvector contains lexemes 
with given weights


tsvector setweight(tsvector, weigth, array_of_lexemes) - sets given weight for 
given lexemes


Stas Kelvich wrote:

Hello.

There is patch that adds some editing routines for tsvector type.

tsvector delete(tsvector, text)
removes entry from tsvector by lexeme name
set unnest(tsvector)
expands a tsvector to a set of rows. Each row has following columns: 
lexeme, postings, weights.
text[] to_array(tsvector)
converts tsvector to array of lexemes
tsvector to_tsvector(text[])
converts array of lexemes to tsvector






Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company







--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-10-08 Thread Teodor Sigaev

There is patch that adds some editing routines for tsvector type.

...

When submitting a patch, it's a good idea to explain why someone would
want the feature you are adding.  Maybe that's obvious to you, but it
isn't clear to me why we'd want this.



Some examples:
tsvector delete(tsvector, text)
remove wronlgy indexed word (may, be a stop word)
text[] to_array(tsvector)
In my practice, I needed it to work with smlar module.
tsvector to_tsvector(text[])
Converts list of tags to tsvector, because search in tsvector is more
flexible and fast than array's equivalents
set unnest(tsvector)
Count some complicated statistics.

That functions mostly needed in utility processing rather in workflow.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tsvector editing functions

2015-10-06 Thread Robert Haas
On Mon, Oct 5, 2015 at 1:29 PM, Stas Kelvich  wrote:
> Hello.
>
> There is patch that adds some editing routines for tsvector type.
>
> tsvector delete(tsvector, text)
> removes entry from tsvector by lexeme name
> set unnest(tsvector)
> expands a tsvector to a set of rows. Each row has following columns: 
> lexeme, postings, weights.
> text[] to_array(tsvector)
> converts tsvector to array of lexemes
> tsvector to_tsvector(text[])
> converts array of lexemes to tsvector

When submitting a patch, it's a good idea to explain why someone would
want the feature you are adding.  Maybe that's obvious to you, but it
isn't clear to me why we'd want this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Tsvector editing functions

2015-10-05 Thread Stas Kelvich
Hello.

There is patch that adds some editing routines for tsvector type.

tsvector delete(tsvector, text)
removes entry from tsvector by lexeme name
set unnest(tsvector)
expands a tsvector to a set of rows. Each row has following columns: 
lexeme, postings, weights.
text[] to_array(tsvector)
converts tsvector to array of lexemes
tsvector to_tsvector(text[])
converts array of lexemes to tsvector




tsvector_funcs.diff
Description: Binary data


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers