Re: [HACKERS] Flexible configuration for full-text search
On Tue, 31 Oct 2017 09:47:57 +0100 Emre Hasegeliwrote: > > If we want to save this behavior, we should somehow pass a stopword > > to tsvector composition function (parsetext in ts_parse.c) for > > counter increment or increment it in another way. Currently, an > > empty lexemes array is passed as a result of LexizeExec. > > > > One of possible way to do so is something like: > > CASE polish_stopword > > WHEN MATCH THEN KEEP -- stopword counting > > ELSE polish_isspell > > END > > This would mean keeping the stopwords. What we want is > > CASE polish_stopword-- stopword counting > WHEN NO MATCH THEN polish_isspell > END > > Do you think it is possible? Hi Emre, I thought how it can be implemented. The way I see is to increment word counter in case if any chcked dictionary matched the word even without returning lexeme. Main drawback is that counter increment is implicit. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Flexible configuration for full-text search
On Mon, 6 Nov 2017 18:05:23 +1300 Thomas Munrowrote: > On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov > wrote: > > In attachment updated patch with fixes of empty XML tags in > > documentation. > > Hi Aleksandr, > > I'm not sure if this is expected at this stage, but just in case you > aren't aware, with this version of the patch the binary upgrade test > in > src/bin/pg_dump/t/002_pg_dump.pl fails for me: > > # Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION > dump_test.alt_ts_conf1 ...' > # at t/002_pg_dump.pl line 6715. > Hi Thomas, Thank you for noticing it. I will investigate it during work on next version of patch. -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Flexible configuration for full-text search
On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenovwrote: > In attachment updated patch with fixes of empty XML tags in > documentation. Hi Aleksandr, I'm not sure if this is expected at this stage, but just in case you aren't aware, with this version of the patch the binary upgrade test in src/bin/pg_dump/t/002_pg_dump.pl fails for me: # Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION dump_test.alt_ts_conf1 ...' # at t/002_pg_dump.pl line 6715. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Flexible configuration for full-text search
> I'm mostly happy with mentioned modifications, but I have few questions > to clarify some points. I will send new patch in week or two. I am glad you liked it. Though, I think we should get approval from more senior community members or committers about the syntax, before we put more effort to the code. > But configuration: > > CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END > > is not (as I understand ELSE can be used only with KEEP). > > I think we should decide to allow or disallow usage of different > dictionaries for match checking (between CASE and WHEN) and a result > (after THEN). If answer is 'allow', maybe we should allow the > third example too for consistency in configurations. I think you are right. We better allow this too. Then the CASE syntax becomes: CASE config WHEN [ NO ] MATCH THEN { KEEP | config } [ ELSE config ] END > Based on formal definition it is possible to describe this example in > following manner: > CASE english_noun WHEN MATCH THEN english_hunspell END > > The question is same as in the previous example. I couldn't understand the question. > Currently, stopwords increment position, for example: > SELECT to_tsvector('english','a test message'); > - > 'messag':3 'test':2 > > A stopword 'a' has a position 1 but it is not in the vector. Is this problem only applies to stopwords and the whole thing we are inventing? Shouldn't we preserve the positions through the pipeline? > If we want to save this behavior, we should somehow pass a stopword to > tsvector composition function (parsetext in ts_parse.c) for counter > increment or increment it in another way. Currently, an empty lexemes > array is passed as a result of LexizeExec. > > One of possible way to do so is something like: > CASE polish_stopword > WHEN MATCH THEN KEEP -- stopword counting > ELSE polish_isspell > END This would mean keeping the stopwords. What we want is CASE polish_stopword-- stopword counting WHEN NO MATCH THEN polish_isspell END Do you think it is possible? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Flexible configuration for full-text search
I'm mostly happy with mentioned modifications, but I have few questions to clarify some points. I will send new patch in week or two. On Thu, 26 Oct 2017 20:01:14 +0200 Emre Hasegeliwrote: > To put it formally: > > ALTER TEXT SEARCH CONFIGURATION name > ADD MAPPING FOR token_type [, ... ] WITH config > > where config is one of: > > dictionary_name > config { UNION | INTERSECT | EXCEPT } config > CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END According to formal definition following configurations are valid: CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END CASE english_noun WHEN MATCH THEN english_hunspell END But configuration: CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END is not (as I understand ELSE can be used only with KEEP). I think we should decide to allow or disallow usage of different dictionaries for match checking (between CASE and WHEN) and a result (after THEN). If answer is 'allow', maybe we should allow the third example too for consistency in configurations. > > 3) Using different dictionaries for recognizing and output > > generation. As I mentioned before, in new syntax condition and > > command are separate and we can use it for some more complex text > > processing. Here an example for processing only nouns: > > > > ALTER TEXT SEARCH CONFIGURATION nouns_only > > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > > word, hword, hword_part WITH CASE > > WHEN english_noun THEN english_hunspell > > END > > This would also still work with the simpler syntax because > "english_noun", still being a dictionary, would pass the tokens to the > next one. Based on formal definition it is possible to describe this example in following manner: CASE english_noun WHEN MATCH THEN english_hunspell END The question is same as in the previous example. > Instead of supporting old way of putting stopwords on dictionaries, we > can make them dictionaries on their own. This would then become > something like: > > CASE polish_stopword > WHEN NO MATCH THEN polish_isspell > END Currently, stopwords increment position, for example: SELECT to_tsvector('english','a test message'); - 'messag':3 'test':2 A stopword 'a' has a position 1 but it is not in the vector. If we want to save this behavior, we should somehow pass a stopword to tsvector composition function (parsetext in ts_parse.c) for counter increment or increment it in another way. Currently, an empty lexemes array is passed as a result of LexizeExec. One of possible way to do so is something like: CASE polish_stopword WHEN MATCH THEN KEEP -- stopword counting ELSE polish_isspell END -- Aleksandr Parfenov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Flexible configuration for full-text search
> The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE > construction. Interesting feature. I needed this flexibility before when I was implementing text search for a Turkish private listing application. Aleksandr and Arthur were kind enough to discuss it with me off-list today. > 1) Multilingual search. Can be used for FTS on a set of documents in > different languages (example for German and English languages). > > ALTER TEXT SEARCH CONFIGURATION multi > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN english_hunspell AND german_hunspell THEN > english_hunspell UNION german_hunspell > WHEN english_hunspell THEN english_hunspell > WHEN german_hunspell THEN german_hunspell > ELSE german_stem UNION english_stem > END; I understand the need to support branching, but this syntax is overly complicated. I don't think there is any need to support different set of dictionaries as condition and action. Something like this might work better: ALTER TEXT SEARCH CONFIGURATION multi ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH CASE english_hunspell UNION german_hunspell WHEN MATCH THEN KEEP ELSE german_stem UNION english_stem END; To put it formally: ALTER TEXT SEARCH CONFIGURATION name ADD MAPPING FOR token_type [, ... ] WITH config where config is one of: dictionary_name config { UNION | INTERSECT | EXCEPT } config CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END > 2) Combination of exact search with morphological one. This patch not > fully solve the problem but it is a step toward solution. Currently, we > should split exact and morphological search in query manually and use > separate index for each part. With new way to configure FTS we can use > following configuration: > > ALTER TEXT SEARCH CONFIGURATION exact_and_morph > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN english_hunspell THEN english_hunspell UNION simple > ELSE english_stem UNION simple > END This could be: CASE english_hunspell THEN KEEP ELSE english_stem END UNION simple > 3) Using different dictionaries for recognizing and output generation. > As I mentioned before, in new syntax condition and command are separate > and we can use it for some more complex text processing. Here an > example for processing only nouns: > > ALTER TEXT SEARCH CONFIGURATION nouns_only > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN english_noun THEN english_hunspell > END This would also still work with the simpler syntax because "english_noun", still being a dictionary, would pass the tokens to the next one. > 4) Special stopword processing allows us to discard stopwords even if > the main dictionary doesn't support such feature (in example pl_ispell > dictionary keeps stopwords in text): > > ALTER TEXT SEARCH CONFIGURATION pl_without_stops > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > word, hword, hword_part WITH CASE > WHEN simple_pl IS NOT STOPWORD THEN pl_ispell > END Instead of supporting old way of putting stopwords on dictionaries, we can make them dictionaries on their own. This would then become something like: CASE polish_stopword WHEN NO MATCH THEN polish_isspell END -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers