Re: [HACKERS] Flexible configuration for full-text search

2017-11-07 Thread Aleksandr Parfenov
On Tue, 31 Oct 2017 09:47:57 +0100
Emre Hasegeli  wrote:

> > If we want to save this behavior, we should somehow pass a stopword
> > to tsvector composition function (parsetext in ts_parse.c) for
> > counter increment or increment it in another way. Currently, an
> > empty lexemes array is passed as a result of LexizeExec.
> >
> > One of possible way to do so is something like:
> > CASE polish_stopword
> > WHEN MATCH THEN KEEP -- stopword counting
> > ELSE polish_isspell
> > END  
> 
> This would mean keeping the stopwords.  What we want is
> 
> CASE polish_stopword-- stopword counting
> WHEN NO MATCH THEN polish_isspell
> END
> 
> Do you think it is possible?

Hi Emre,

I thought how it can be implemented. The way I see is to increment
word counter in case if any chcked dictionary matched the word even
without returning lexeme. Main drawback is that counter increment is
implicit.

--
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Flexible configuration for full-text search

2017-11-06 Thread Aleksandr Parfenov
On Mon, 6 Nov 2017 18:05:23 +1300
Thomas Munro  wrote:

> On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
>  wrote:
> > In attachment updated patch with fixes of empty XML tags in
> > documentation.  
> 
> Hi Aleksandr,
> 
> I'm not sure if this is expected at this stage, but just in case you
> aren't aware, with this version of the patch the binary upgrade test
> in
> src/bin/pg_dump/t/002_pg_dump.pl fails for me:
> 
> #   Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
> dump_test.alt_ts_conf1 ...'
> #   at t/002_pg_dump.pl line 6715.
> 

Hi Thomas,

Thank you for noticing it. I will investigate it during work on next
version of patch.

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Flexible configuration for full-text search

2017-11-05 Thread Thomas Munro
On Sat, Oct 21, 2017 at 1:39 AM, Aleksandr Parfenov
 wrote:
> In attachment updated patch with fixes of empty XML tags in
> documentation.

Hi Aleksandr,

I'm not sure if this is expected at this stage, but just in case you
aren't aware, with this version of the patch the binary upgrade test
in
src/bin/pg_dump/t/002_pg_dump.pl fails for me:

#   Failed test 'binary_upgrade: dumps ALTER TEXT SEARCH CONFIGURATION
dump_test.alt_ts_conf1 ...'
#   at t/002_pg_dump.pl line 6715.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Flexible configuration for full-text search

2017-10-31 Thread Emre Hasegeli
> I'm mostly happy with mentioned modifications, but I have few questions
> to clarify some points. I will send new patch in week or two.

I am glad you liked it.  Though, I think we should get approval from
more senior community members or committers about the syntax, before
we put more effort to the code.

> But configuration:
>
> CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END
>
> is not (as I understand ELSE can be used only with KEEP).
>
> I think we should decide to allow or disallow usage of different
> dictionaries for match checking (between CASE and WHEN) and a result
> (after THEN). If answer is 'allow', maybe we should allow the
> third example too for consistency in configurations.

I think you are right.  We better allow this too.  Then the CASE syntax becomes:

CASE config
WHEN [ NO ] MATCH THEN { KEEP | config }
[ ELSE config ]
END

> Based on formal definition it is possible to describe this example in
> following manner:
> CASE english_noun WHEN MATCH THEN english_hunspell END
>
> The question is same as in the previous example.

I couldn't understand the question.

> Currently, stopwords increment position, for example:
> SELECT to_tsvector('english','a test message');
> -
>  'messag':3 'test':2
>
> A stopword 'a' has a position 1 but it is not in the vector.

Is this problem only applies to stopwords and the whole thing we are
inventing?  Shouldn't we preserve the positions through the pipeline?

> If we want to save this behavior, we should somehow pass a stopword to
> tsvector composition function (parsetext in ts_parse.c) for counter
> increment or increment it in another way. Currently, an empty lexemes
> array is passed as a result of LexizeExec.
>
> One of possible way to do so is something like:
> CASE polish_stopword
> WHEN MATCH THEN KEEP -- stopword counting
> ELSE polish_isspell
> END

This would mean keeping the stopwords.  What we want is

CASE polish_stopword-- stopword counting
WHEN NO MATCH THEN polish_isspell
END

Do you think it is possible?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Flexible configuration for full-text search

2017-10-30 Thread Aleksandr Parfenov
I'm mostly happy with mentioned modifications, but I have few questions
to clarify some points. I will send new patch in week or two.

On Thu, 26 Oct 2017 20:01:14 +0200
Emre Hasegeli  wrote:
> To put it formally:
> 
> ALTER TEXT SEARCH CONFIGURATION name
> ADD MAPPING FOR token_type [, ... ] WITH config
> 
> where config is one of:
> 
> dictionary_name
> config { UNION | INTERSECT | EXCEPT } config
> CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END

According to formal definition following configurations are valid:

CASE english_hunspell WHEN MATCH THEN KEEP ELSE simple END
CASE english_noun WHEN MATCH THEN english_hunspell END

But configuration:

CASE english_noun WHEN MATCH THEN english_hunspell ELSE simple END

is not (as I understand ELSE can be used only with KEEP).

I think we should decide to allow or disallow usage of different
dictionaries for match checking (between CASE and WHEN) and a result
(after THEN). If answer is 'allow', maybe we should allow the
third example too for consistency in configurations.

> > 3) Using different dictionaries for recognizing and output
> > generation. As I mentioned before, in new syntax condition and
> > command are separate and we can use it for some more complex text
> > processing. Here an example for processing only nouns:
> >
> > ALTER TEXT SEARCH CONFIGURATION nouns_only
> >   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> > word, hword, hword_part WITH CASE
> >   WHEN english_noun THEN english_hunspell
> > END  
> 
> This would also still work with the simpler syntax because
> "english_noun", still being a dictionary, would pass the tokens to the
> next one.

Based on formal definition it is possible to describe this example in
following manner:
CASE english_noun WHEN MATCH THEN english_hunspell END

The question is same as in the previous example.

> Instead of supporting old way of putting stopwords on dictionaries, we
> can make them dictionaries on their own.  This would then become
> something like:
> 
> CASE polish_stopword
> WHEN NO MATCH THEN polish_isspell
> END

Currently, stopwords increment position, for example:
SELECT to_tsvector('english','a test message');
-
 'messag':3 'test':2

A stopword 'a' has a position 1 but it is not in the vector.

If we want to save this behavior, we should somehow pass a stopword to
tsvector composition function (parsetext in ts_parse.c) for counter
increment or increment it in another way. Currently, an empty lexemes
array is passed as a result of LexizeExec.

One of possible way to do so is something like:
CASE polish_stopword
WHEN MATCH THEN KEEP -- stopword counting
ELSE polish_isspell
END

-- 
Aleksandr Parfenov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Flexible configuration for full-text search

2017-10-26 Thread Emre Hasegeli
> The patch introduces way to configure FTS based on CASE/WHEN/THEN/ELSE
> construction.

Interesting feature.  I needed this flexibility before when I was
implementing text search for a Turkish private listing application.
Aleksandr and Arthur were kind enough to discuss it with me off-list
today.

> 1) Multilingual search. Can be used for FTS on a set of documents in
> different languages (example for German and English languages).
>
> ALTER TEXT SEARCH CONFIGURATION multi
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part WITH CASE
> WHEN english_hunspell AND german_hunspell THEN
>   english_hunspell UNION german_hunspell
> WHEN english_hunspell THEN english_hunspell
> WHEN german_hunspell THEN german_hunspell
> ELSE german_stem UNION english_stem
>   END;

I understand the need to support branching, but this syntax is overly
complicated.  I don't think there is any need to support different set
of dictionaries as condition and action.  Something like this might
work better:

ALTER TEXT SEARCH CONFIGURATION multi
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
  word, hword, hword_part WITH
CASE english_hunspell UNION german_hunspell
WHEN MATCH THEN KEEP
ELSE german_stem UNION english_stem
END;

To put it formally:

ALTER TEXT SEARCH CONFIGURATION name
ADD MAPPING FOR token_type [, ... ] WITH config

where config is one of:

dictionary_name
config { UNION | INTERSECT | EXCEPT } config
CASE config WHEN [ NO ] MATCH THEN [ KEEP ELSE ] config END

> 2) Combination of exact search with morphological one. This patch not
> fully solve the problem but it is a step toward solution. Currently, we
> should split exact and morphological search in query manually and use
> separate index for each part. With new way to configure FTS we can use
> following configuration:
>
> ALTER TEXT SEARCH CONFIGURATION exact_and_morph
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
>   word, hword, hword_part WITH CASE
> WHEN english_hunspell THEN english_hunspell UNION simple
> ELSE english_stem UNION simple
>   END

This could be:

CASE english_hunspell
THEN KEEP
ELSE english_stem
END
UNION
simple

> 3) Using different dictionaries for recognizing and output generation.
> As I mentioned before, in new syntax condition and command are separate
> and we can use it for some more complex text processing. Here an
> example for processing only nouns:
>
> ALTER TEXT SEARCH CONFIGURATION nouns_only
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part WITH CASE
>   WHEN english_noun THEN english_hunspell
> END

This would also still work with the simpler syntax because
"english_noun", still being a dictionary, would pass the tokens to the
next one.

> 4) Special stopword processing allows us to discard stopwords even if
> the main dictionary doesn't support such feature (in example pl_ispell
> dictionary keeps stopwords in text):
>
> ALTER TEXT SEARCH CONFIGURATION pl_without_stops
>   ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
> word, hword, hword_part WITH CASE
> WHEN simple_pl IS NOT STOPWORD THEN pl_ispell
>   END

Instead of supporting old way of putting stopwords on dictionaries, we
can make them dictionaries on their own.  This would then become
something like:

CASE polish_stopword
WHEN NO MATCH THEN polish_isspell
END


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers