Re: [HACKERS] Tsvector editing functions
I saw errors on windows, here is the fix: Thank you, pushed -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
> > On 11 Mar 2016, at 16:13, Stas Kelvichwrote: > > >> On 10 Mar 2016, at 20:29, Teodor Sigaev wrote: >> >> I would like to suggest rename both functions to array_to_tsvector and >> tsvector_to_array to have consistent name. Later we could add >> to_tsvector([regconfig, ], text[]) with morphological processing. >> >> Thoughts? >> Hi, thanks for commit. I saw errors on windows, here is the fix: tsvector.fix.patch Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
> On 10 Mar 2016, at 20:29, Teodor Sigaevwrote: > > I would like to suggest rename both functions to array_to_tsvector and > tsvector_to_array to have consistent name. Later we could add > to_tsvector([regconfig, ], text[]) with morphological processing. > > Thoughts? > Seems reasonable, done. tsvector_ops-v6.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Thanks! Fixed and added tests. Thank you! I did some patch cleanup/fix, but I have some doubt with function's names: 1 to_tsvector: # \df to_tsvector List of functions Schema |Name | Result data type | Argument data types | Type +-+--+-+ pg_catalog | to_tsvector | tsvector | regconfig, text | normal pg_catalog | to_tsvector | tsvector | text| normal pg_catalog | to_tsvector | tsvector | text[] | normal First two variants of to_tsvector make a morphological processing, last one doesn't. 2 to_array # \df *to_array List of functions Schema | Name | Result data type | Argument data types | Type +---+--+-+ pg_catalog | regexp_split_to_array | text[] | text, text | normal pg_catalog | regexp_split_to_array | text[] | text, text, text| normal pg_catalog | string_to_array | text[] | text, text | normal pg_catalog | string_to_array | text[] | text, text, text| normal pg_catalog | to_array | text[] | tsvector| normal Seems, to_array is not a right name compared to other *to_array. I would like to suggest rename both functions to array_to_tsvector and tsvector_to_array to have consistent name. Later we could add to_tsvector([regconfig, ], text[]) with morphological processing. Thoughts? -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 4b5ee81..ed0b6be 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -9211,16 +9211,29 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple setweight - setweight(tsvector, "char") + setweight(vector tsvector, weight "char") tsvector -assign weight to each element of tsvector +assign weight to each element of vector setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A') 'cat':3A 'fat':2A,4A 'rat':5A + setweight + setweight by filter + + setweight(vector tsvector, weight "char", lexemes "text"[]) + +tsvector +assign weight to elements of vector that are listed in lexemes array +setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}') +'cat':3A 'fat':2,4 'rat':5A + + + + strip strip(tsvector) @@ -9233,6 +9246,81 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple + delete + delete lemexeme + + delete(vector tsvector, lexeme text) + +tsvector +remove given lexeme from vector +delete('fat:2,4 cat:3 rat:5A'::tsvector, 'fat') +'cat':3 'rat':5A + + + + + delete + delete lemexemes array + + delete(vector tsvector, lexemes text[]) + +tsvector +remove any occurrence of lexemes in lexemes array from vector +delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat']) +'cat':3 + + + + + unnest + + unnest(tsvector, OUT lexeme text, OUT positions smallint[], OUT weights text) + +setof record +expand a tsvector to a set of rows +unnest('fat:2,4 cat:3 rat:5A'::tsvector) +(cat,{3},{D}) ... + + + + + to_array + + to_array(tsvector) + +text[] +convert tsvector to array of lexemes +to_array('fat:2,4 cat:3 rat:5A'::tsvector) +{cat,fat,rat} + + + + + to_tsvector + array to tsvector + + to_tsvector(text[]) + +tsvector +convert array of lexemes to tsvector +to_tsvector('{fat,cat,rat}'::text[]) +'fat' 'cat' 'rat' + + + + + filter + + filter(vector tsvector, weights "char"[]) + +tsvector +Select only elements with given weights from vector +filter('fat:2,4 cat:3b rat:5A'::tsvector, '{a,b}') +'cat':3B 'rat':5A + + + + to_tsquery to_tsquery( config regconfig , query text) diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index ff99976..ea3abc9 100644 ---
Re: [HACKERS] Tsvector editing functions
> > On 02 Feb 2016, at 20:10, Teodor Sigaevwrote: > > Some notices: > > 1 tsin in documentation doesn't look like a good name. Changed to vector > similar to other places. > > 2 I did some editorization about freeing memory/forgotten names etc Thanks. > > 3 It seems to me that tsvector_unnest() could be seriously optimized for > large tsvectors: with current coding it detoasts/decompresses tsvector value > on each call. Much better to do it once in > multi_call_memory_ctx context at first call init Done, moved detoasting to first SRF call. > 4 It seems debatable returning empty array for position/weight if they are > absent: > =# select * from unnest('a:1 b'::tsvector); > lexeme | positions | weights > +---+- > a | {1} | {D} > b | {}| {} > I think, it's better to return NULL in this case > Okay, done. > 5 > array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter > doesn't check or pay attention to NULL elements in input arrays > Thanks! Fixed and added tests. > -- > Teodor Sigaev E-mail: teo...@sigaev.ru > WWW: http://www.sigaev.ru/ > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers tsvector_ops-v4.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Some notices: 1 tsin in documentation doesn't look like a good name. Changed to vector similar to other places. 2 I did some editorization about freeing memory/forgotten names etc 3 It seems to me that tsvector_unnest() could be seriously optimized for large tsvectors: with current coding it detoasts/decompresses tsvector value on each call. Much better to do it once in multi_call_memory_ctx context at first call init 4 It seems debatable returning empty array for position/weight if they are absent: =# select * from unnest('a:1 b'::tsvector); lexeme | positions | weights +---+- a | {1} | {D} b | {}| {} I think, it's better to return NULL in this case 5 array_to_tsvector/tsvector_setweight_by_filter/tsvector_delete_arr/tsvector_filter doesn't check or pay attention to NULL elements in input arrays -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Some notices: 1 tsin in documentation doesn't look like a good name. Changed to vector similar to other places. 2 I did some editorization about freeing memory/forgotten names etc Ooops, forgot to attach -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 139aa2b..9c294e3 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -9168,16 +9168,29 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple setweight - setweight(tsvector, "char") + setweight(vector tsvector, weight "char") tsvector -assign weight to each element of tsvector +assign weight to each element of vector setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A') 'cat':3A 'fat':2A,4A 'rat':5A + setweight + setweight by filter + + setweight(vector tsvector, weight "char", lexemes "text"[]) + +tsvector +assign weight to elements of vector that are listed in lexemes array +setweight('fat:2,4 cat:3 rat:5B'::tsvector, 'A', '{cat,rat}') +'cat':3A 'fat':2,4 'rat':5A + + + + strip strip(tsvector) @@ -9190,6 +9203,84 @@ CREATE TYPE rainbow AS ENUM ('red', 'orange', 'yellow', 'green', 'blue', 'purple + delete + delete lemexeme + + delete(vector tsvector, lexeme text) + +tsvector +remove given lexeme from vector +delete('fat:2,4 cat:3 rat:5A'::tsvector, 'fat') +'cat':3 'rat':5A + + + + + delete + delete lemexemes array + + delete(vector tsvector, lexemes text[]) + +tsvector +remove any occurrence of lexemes in lexemes array from vector +delete('fat:2,4 cat:3 rat:5A'::tsvector, ARRAY['fat','rat']) +'cat':3 + + + + + unnest + + unnest(tsvector) + +setof anyelement +expand a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights. +unnest('fat:2,4 cat:3 rat:5A'::tsvector) +cat{3}{A} +fat{2,4} {D,D} +rat{5}{A} +(3 rows) + + + + + to_array + + to_array(tsvector) + +text[] +convert tsvector to array of lexemes +to_array('fat:2,4 cat:3 rat:5A'::tsvector) +{cat,fat,rat} + + + + + to_tsvector + array to tsvector + + to_tsvector(text[]) + +tsvector +convert array of lexemes to tsvector +to_tsvector('{fat,cat,rat}'::text[]) +'fat' 'cat' 'rat' + + + + + filter + + filter(vector tsvector, weights "char"[]) + +tsvector +Select only elements with given weights from vector +filter('fat:2,4 cat:3b rat:5A'::tsvector, '{a,b}') +'cat':3B 'rat':5A + + + + to_tsquery to_tsquery( config regconfig , query text) diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index d66b4d5..32033fa 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -1326,6 +1326,10 @@ FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank + +Full list of tsvector-related functions available in . + + diff --git a/src/backend/utils/adt/tsvector_op.c b/src/backend/utils/adt/tsvector_op.c index a3f1c361..e7ea270 100644 --- a/src/backend/utils/adt/tsvector_op.c +++ b/src/backend/utils/adt/tsvector_op.c @@ -14,6 +14,7 @@ #include "postgres.h" +#include "access/htup_details.h" #include "catalog/namespace.h" #include "catalog/pg_type.h" #include "commands/trigger.h" @@ -65,6 +66,7 @@ typedef struct #define STATHDRSIZE (offsetof(TSVectorStat, data)) static Datum tsvector_update_trigger(PG_FUNCTION_ARGS, bool config_column); +static int tsvector_bsearch(TSVector tsin, char *lexin, int lexin_len); /* * Order: haspos, len, word, for all positions (pos, weight) @@ -251,6 +253,81 @@ tsvector_setweight(PG_FUNCTION_ARGS) PG_RETURN_POINTER(out); } +/* + * setweight(tsin tsvector, char_weight "char", lexemes "text"[]) + * + * Assign weight w to elements of tsin that are listed in lexemes. + */ +Datum +tsvector_setweight_by_filter(PG_FUNCTION_ARGS) +{ + TSVector tsin = PG_GETARG_TSVECTOR(0); + char char_weight = PG_GETARG_CHAR(1); + ArrayType *lexemes = PG_GETARG_ARRAYTYPE_P(2); + + TSVector tsout; + int i, +j, +nlexemes, +
Re: [HACKERS] Tsvector editing functions
Hi > On 22 Jan 2016, at 19:03, Tomas Vondrawrote: > OK, although I do recommend using more sensible variable names, i.e. why how > to use 'lexemes' instead of 'lexarr' for example? Similarly for the other > functions. Changed. With old names I tried to follow conventions in surrounding code, but probably that is a good idea to switch to more meaningful names in new code. >> >> >> delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions >> of tsv_filter from tsin. When lexeme in tsv_filter has no positions function >> will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme >> have positions function will delete them from positions of matching lexeme >> in tsin. If after such removal resulting positions set is empty then >> function will delete that lexeme from resulting tsvector. >> > > I can't really imagine situation in which I'd need this, but if you do have a > use case for it ... although in the initial paragraph you say "... but if > somebody wants to delete for example ..." which suggests you may not have > such use case. > > Based on bad experience with extending API based on vague ideas, I recommend > only really adding functions with existing need. It's easy to add a function > later, much more difficult to remove it or change the signature. I tried to create more or less self-contained api, e.g. have ability to negate effect of concatenation. But i’ve also asked people around what they think about extending API and everybody convinced that it is better to stick to smaller API. So let’s drop it. At least that functions exists in mail list in case if somebody will google for such kind of behaviour. >> >> Also if we want some level of completeness of API and taking into account >> that concat() function shift positions on second argument I thought that it >> can be useful to also add function that can shift all positions of specific >> value. This helps to undo concatenation: delete one of concatenating >> tsvectors and then shift positions in resulting tsvector. So I also wrote >> one another small function: >> >> shift(tsin tsvector,offset int16) — Shift all positions in tsin by given >> offset > > That seems rather too low-level. Shouldn't it be really built into delete() > directly somehow? I think it is ambiguous task on delete. But if we are dropping support of delete(tsvector, tsvector) I don’t see points in keeping that functions. >>> >>> 7) Some of the functions use intexterm that does not match the function >>> name. I see two such cases - to_tsvector and setweight. Is there a >>> reason for that? >>> >> >> Because sgml compiler wants unique indexterm. Both functions that >> youmentioned use overloading of arguments and have non-unique name. > > As Michael pointed out, that should probably be handled by using > and tags. Done. > On 19 Jan 2016, at 00:21, Alvaro Herrera wrote: > > > It's a bit funny that you reintroduce the "unrecognized weight: %d" > (instead of %c) in tsvector_setweight_by_filter. > Ah, I was thinking about moving it to separate diff and messed. Fixed and attaching diff with same fix for old tsvector_setweight. tsvector_ops-v2.1.diff Description: Binary data tsvector_ops-v2.2.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
On 12/30/2015 06:49 PM, Stas Kelvich wrote: Hi, Tomáš! Thanks for comprehensive review. On 15 Dec 2015, at 06:07, Tomas Vondrawrote: 1) It's a bit difficult to judge the usefulness of the API, as I've always been a mere user of full-text search, and I never had a need (or courage) to mess with the tsvectors. OTOH I don't see a good reason no to have such API, when there's a need for it. The API seems to be reasonably complete, with one exception - when looking at editing function we have for 'hstore', we do have these variants for delete() delete(hstore,text) delete(hstore,text[]) delete(hstore,hstore) while this patch only adds delete(tsvector,text). Would it make sense to add variants similar to hstore? It probably does not make much sense to add delete(tsvector,tsvector), right? But being able to delete a bunch of lexemes in one go seems like a good thing. What do you think? That’s a good idea and actually deleting tsvector from tsvector makes perfect sense. In delete function I used exact string match between string and lexemes in tsvector, but if somebody wants to delete for example “Cats” from tsvector, then he should downcase and singularize this word. Easiest way to do it is to just use to_tsvector() function. Also we can use this function to delete specific positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) -> 'cat:3 fat:4'::tsvector. So in attached patch I’ve implemented following: delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr from tsin OK, although I do recommend using more sensible variable names, i.e. why how to use 'lexemes' instead of 'lexarr' for example? Similarly for the other functions. delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of tsv_filter from tsin. When lexeme in tsv_filter has no positions function will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have positions function will delete them from positions of matching lexeme in tsin. If after such removal resulting positions set is empty then function will delete that lexeme from resulting tsvector. I can't really imagine situation in which I'd need this, but if you do have a use case for it ... although in the initial paragraph you say "... but if somebody wants to delete for example ..." which suggests you may not have such use case. Based on bad experience with extending API based on vague ideas, I recommend only really adding functions with existing need. It's easy to add a function later, much more difficult to remove it or change the signature. Also if we want some level of completeness of API and taking into account that concat() function shift positions on second argument I thought that it can be useful to also add function that can shift all positions of specific value. This helps to undo concatenation: delete one of concatenating tsvectors and then shift positions in resulting tsvector. So I also wrote one another small function: shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset That seems rather too low-level. Shouldn't it be really built into delete() directly somehow? 4) I find it rather annoying that there are pretty much no comments in the code. Granted, there are pretty much no comments in the surrounding code, but I doubt that's a good reason for not having any comments in new code. It makes reviews unnecessarily difficult. Fixed, I think. Yep, much better now. 5) tsvector_concat() is not mentioned in docs at all Concat mentioned in docs as an operator ||. Ah, OK. 6) Docs don't mention names of the new parameters in function signatures, just data types. The functions with a single parameter probably don't need to do that, but multi-parameter ones should. Fixed. OK, but please let's use variable names clearly identifying the meaning. So not 'w' but 'weight' and so on. 7) Some of the functions use intexterm that does not match the function name. I see two such cases - to_tsvector and setweight. Is there a reason for that? Because sgml compiler wants unique indexterm. Both functions that youmentioned use overloading of arguments and have non-unique name. As Michael pointed out, that should probably be handled by using and tags. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
So, Tomas, Teodor, did you like this new version of the patch? Stas Kelvich wrote: > > 7) Some of the functions use intexterm that does not match the function > > name. I see two such cases - to_tsvector and setweight. Is there a > > reason for that? > > Because sgml compiler wants unique indexterm. Both functions that you > mentioned use overloading of arguments and have non-unique name. This sounds wrong. I think what you should really do is use foo bar to distinguish the two entries. It's a bit funny that you reintroduce the "unrecognized weight: %d" (instead of %c) in tsvector_setweight_by_filter. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Hi, Tomáš! Thanks for comprehensive review. > On 15 Dec 2015, at 06:07, Tomas Vondrawrote: > > 1) It's a bit difficult to judge the usefulness of the API, as I've > always been a mere user of full-text search, and I never had a need > (or courage) to mess with the tsvectors. OTOH I don't see a good > reason no to have such API, when there's a need for it. > > The API seems to be reasonably complete, with one exception - when > looking at editing function we have for 'hstore', we do have these > variants for delete() > > delete(hstore,text) > delete(hstore,text[]) > delete(hstore,hstore) > > while this patch only adds delete(tsvector,text). Would it make > sense to add variants similar to hstore? It probably does not make > much sense to add delete(tsvector,tsvector), right? But being able > to delete a bunch of lexemes in one go seems like a good thing. > > What do you think? That’s a good idea and actually deleting tsvector from tsvector makes perfect sense. In delete function I used exact string match between string and lexemes in tsvector, but if somebody wants to delete for example “Cats” from tsvector, then he should downcase and singularize this word. Easiest way to do it is to just use to_tsvector() function. Also we can use this function to delete specific positions: like delete('cat:3 fat:2,4'::tsvector, 'fat:2'::tsvector) -> 'cat:3 fat:4'::tsvector. So in attached patch I’ve implemented following: delete(tsin tsvector, lexarrtext[]) — remove any occurence of lexemes inlexarr from tsin delete(tsin tsvector, tsv_filter tsvector) — Delete lexemes and/or positions of tsv_filter from tsin. When lexeme in tsv_filter has no positions function will delete any occurrence of same lexeme in tsin. When tsv_filter lexeme have positions function will delete them from positions of matching lexeme in tsin. If after such removal resulting positions set is empty then function will delete that lexeme from resulting tsvector. Also if we want some level of completeness of API and taking into account that concat() function shift positions on second argument I thought that it can be useful to also add function that can shift all positions of specific value. This helps to undo concatenation: delete one of concatenating tsvectors and then shift positions in resulting tsvector. So I also wrote one another small function: shift(tsin tsvector,offset int16) — Shift all positions in tsin by given offset > > > 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it > currently triggers: > >tsvector_op.c:211:2: warning: ISO C90 forbids mixed ... >tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but … > fixed > 3) the patch also touches tsvector_setweight(), only to do change: > > elog(ERROR, "unrecognized weight: %d", cw); > > to > > elog(ERROR, "unrecognized weight: %c", cw); > > That should probably go get committed separately, as a bugfix. > Okay, i’ll submit that as a separate patch. > > 4) I find it rather annoying that there are pretty much no comments in > the code. Granted, there are pretty much no comments in the > surrounding code, but I doubt that's a good reason for not having > any comments in new code. It makes reviews unnecessarily difficult. > Fixed, I think. > > 5) tsvector_concat() is not mentioned in docs at all > Concat mentioned in docs as an operator ||. > > 6) Docs don't mention names of the new parameters in function > signatures, just data types. The functions with a single parameter > probably don't need to do that, but multi-parameter ones should. > Fixed. > 7) Some of the functions use intexterm that does not match the function > name. I see two such cases - to_tsvector and setweight. Is there a > reason for that? > Because sgml compiler wants unique indexterm. Both functions that you mentioned use overloading of arguments and have non-unique name. tsvector_ops.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
On Tue, Dec 15, 2015 at 12:07 PM, Tomas Vondrawrote: > Hi, > > On 12/07/2015 03:05 PM, Stas Kelvich wrote: >> >> Hello. >> >> Done with the list of suggestions. Also heavily edit delete function. >> > > I did a quick review of the updated patch. I'm not a tsvector-expert so > hopefully my comments won't be entirely bogus. > > 1) It's a bit difficult to judge the usefulness of the API, as I've >always been a mere user of full-text search, and I never had a need >(or courage) to mess with the tsvectors. OTOH I don't see a good >reason no to have such API, when there's a need for it. > >The API seems to be reasonably complete, with one exception - when >looking at editing function we have for 'hstore', we do have these >variants for delete() > > delete(hstore,text) > delete(hstore,text[]) > delete(hstore,hstore) > >while this patch only adds delete(tsvector,text). Would it make >sense to add variants similar to hstore? It probably does not make >much sense to add delete(tsvector,tsvector), right? But being able >to delete a bunch of lexemes in one go seems like a good thing. > >What do you think? > > > 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it >currently triggers: > > tsvector_op.c:211:2: warning: ISO C90 forbids mixed ... > tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but ... > > 3) the patch also touches tsvector_setweight(), only to do change: > > elog(ERROR, "unrecognized weight: %d", cw); > >to > > elog(ERROR, "unrecognized weight: %c", cw); > >That should probably go get committed separately, as a bugfix. > > > 4) I find it rather annoying that there are pretty much no comments in >the code. Granted, there are pretty much no comments in the >surrounding code, but I doubt that's a good reason for not having >any comments in new code. It makes reviews unnecessarily difficult. > > > 5) tsvector_concat() is not mentioned in docs at all > > > 6) Docs don't mention names of the new parameters in function >signatures, just data types. The functions with a single parameter >probably don't need to do that, but multi-parameter ones should. > > 7) Some of the functions use intexterm that does not match the function >name. I see two such cases - to_tsvector and setweight. Is there a >reason for that? I have marked this patch as returned with feedback based on the presence of a review and a lack of replies from the author. Stas, if you are still working on the patch, please feel free to move it to the next commit fest. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Hi, On 12/07/2015 03:05 PM, Stas Kelvich wrote: Hello. Done with the list of suggestions. Also heavily edit delete function. I did a quick review of the updated patch. I'm not a tsvector-expert so hopefully my comments won't be entirely bogus. 1) It's a bit difficult to judge the usefulness of the API, as I've always been a mere user of full-text search, and I never had a need (or courage) to mess with the tsvectors. OTOH I don't see a good reason no to have such API, when there's a need for it. The API seems to be reasonably complete, with one exception - when looking at editing function we have for 'hstore', we do have these variants for delete() delete(hstore,text) delete(hstore,text[]) delete(hstore,hstore) while this patch only adds delete(tsvector,text). Would it make sense to add variants similar to hstore? It probably does not make much sense to add delete(tsvector,tsvector), right? But being able to delete a bunch of lexemes in one go seems like a good thing. What do you think? 2) tsvector_op.c needs a bit of love, to eliminate the two warnings it currently triggers: tsvector_op.c:211:2: warning: ISO C90 forbids mixed ... tsvector_op.c:635:9: warning: variable ‘lexeme2copy’ set but ... 3) the patch also touches tsvector_setweight(), only to do change: elog(ERROR, "unrecognized weight: %d", cw); to elog(ERROR, "unrecognized weight: %c", cw); That should probably go get committed separately, as a bugfix. 4) I find it rather annoying that there are pretty much no comments in the code. Granted, there are pretty much no comments in the surrounding code, but I doubt that's a good reason for not having any comments in new code. It makes reviews unnecessarily difficult. 5) tsvector_concat() is not mentioned in docs at all 6) Docs don't mention names of the new parameters in function signatures, just data types. The functions with a single parameter probably don't need to do that, but multi-parameter ones should. 7) Some of the functions use intexterm that does not match the function name. I see two such cases - to_tsvector and setweight. Is there a reason for that? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Hello. Done with the list of suggestions. Also heavily edit delete function. tsvector_ops.diff Description: Binary data --- Stas Kelvich Postgres Professional: http://www.postgrespro.com Russian Postgres Company > On 27 Nov 2015, at 15:13, Teodor Sigaevwrote: > > Hmm, seems, it will be useful to add two fuctions: > > tsvector filter(tsvector, array_of_weigths) - returns tsvector contains > lexemes with given weights > > tsvector setweight(tsvector, weigth, array_of_lexemes) - sets given weight > for given lexemes > > Stas Kelvich wrote: >> Hello. >> >> There is patch that adds some editing routines for tsvector type. >> >> tsvector delete(tsvector, text) >> removes entry from tsvector by lexeme name >> set unnest(tsvector) >> expands a tsvector to a set of rows. Each row has following columns: >> lexeme, postings, weights. >> text[] to_array(tsvector) >> converts tsvector to array of lexemes >> tsvector to_tsvector(text[]) >> converts array of lexemes to tsvector >> >> >> >> >> >> >> Stas Kelvich >> Postgres Professional: http://www.postgrespro.com >> The Russian Postgres Company >> >> >> >> >> > > -- > Teodor Sigaev E-mail: teo...@sigaev.ru > WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
1 Please, make patch compilable with current master. cd ../../../src/include/catalog && '/usr/local/bin/perl' ./duplicate_oids 3315 3316 2 lexin = TextDatumGetCString(PG_GETARG_DATUM(1)) lexin_len = strlen(lexin) Why do you use C-string instead of just text? Suppose, much better: t = PG_GETARG_TEXT_P(1) lexin = VARDATA(t) lexin_len = VARSIZE_ANY_EXHDR(t) 3 Why do you use linear search in tsvector instead of binary search? It could produce a performance impact 4 Again, using BuildTupleFromCStrings() call is not very optimal 5 printing weights as numbers is not consistent with other usage of weigth's in FTS. Lexem's weight are mentioned as one of A,B,C,D and default weight is a D. Teodor Sigaev wrote: There is patch that adds some editing routines for tsvector type. ... When submitting a patch, it's a good idea to explain why someone would want the feature you are adding. Maybe that's obvious to you, but it isn't clear to me why we'd want this. Some examples: tsvector delete(tsvector, text) remove wronlgy indexed word (may, be a stop word) text[] to_array(tsvector) In my practice, I needed it to work with smlar module. tsvector to_tsvector(text[]) Converts list of tags to tsvector, because search in tsvector is more flexible and fast than array's equivalents set unnest(tsvector) Count some complicated statistics. That functions mostly needed in utility processing rather in workflow. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
Hmm, seems, it will be useful to add two fuctions: tsvector filter(tsvector, array_of_weigths) - returns tsvector contains lexemes with given weights tsvector setweight(tsvector, weigth, array_of_lexemes) - sets given weight for given lexemes Stas Kelvich wrote: Hello. There is patch that adds some editing routines for tsvector type. tsvector delete(tsvector, text) removes entry from tsvector by lexeme name set unnest(tsvector) expands a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights. text[] to_array(tsvector) converts tsvector to array of lexemes tsvector to_tsvector(text[]) converts array of lexemes to tsvector Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
There is patch that adds some editing routines for tsvector type. ... When submitting a patch, it's a good idea to explain why someone would want the feature you are adding. Maybe that's obvious to you, but it isn't clear to me why we'd want this. Some examples: tsvector delete(tsvector, text) remove wronlgy indexed word (may, be a stop word) text[] to_array(tsvector) In my practice, I needed it to work with smlar module. tsvector to_tsvector(text[]) Converts list of tags to tsvector, because search in tsvector is more flexible and fast than array's equivalents set unnest(tsvector) Count some complicated statistics. That functions mostly needed in utility processing rather in workflow. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tsvector editing functions
On Mon, Oct 5, 2015 at 1:29 PM, Stas Kelvichwrote: > Hello. > > There is patch that adds some editing routines for tsvector type. > > tsvector delete(tsvector, text) > removes entry from tsvector by lexeme name > set unnest(tsvector) > expands a tsvector to a set of rows. Each row has following columns: > lexeme, postings, weights. > text[] to_array(tsvector) > converts tsvector to array of lexemes > tsvector to_tsvector(text[]) > converts array of lexemes to tsvector When submitting a patch, it's a good idea to explain why someone would want the feature you are adding. Maybe that's obvious to you, but it isn't clear to me why we'd want this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Tsvector editing functions
Hello. There is patch that adds some editing routines for tsvector type. tsvector delete(tsvector, text) removes entry from tsvector by lexeme name set unnest(tsvector) expands a tsvector to a set of rows. Each row has following columns: lexeme, postings, weights. text[] to_array(tsvector) converts tsvector to array of lexemes tsvector to_tsvector(text[]) converts array of lexemes to tsvector tsvector_funcs.diff Description: Binary data Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers