Hey Matthew,

Many thanks for this, and also many thanks for raising these on the list
first instead of quietly via the document author, and my apologies for a
slow response.

On Fri, 3 Jun 2022 at 10:51, Matthew Wild <[email protected]> wrote:

> Hi folks,
>
> Thanks to Guus's persistence, I finally took time to close a few
> issues I have with the current XEP-0431 (Full Text Search in MAM).
>
> The main issue is that the current version of the spec provides no
> guarantees about how the search string (generally input from a user)
> will be interpreted. Usually in such cases, I would say this is
> fine... an implementation that returns all messages containing "bar"
> when you submit a search for "foo" is obviously broken and nobody
> would want to use it, even if it's 100% permitted behaviour by the
> XEP.
>
>
Yes. The problem is I don't believe that one can usefully codify anything
here without going deep into stemming and stop words, and since nobody
(sane) will write their own free-text indexing system explicitly to match
this specification, it's very difficult to know what can be specified.


> But full-text search is actually a complex topic, and there are
> various backend implementations that servers are likely to lean on.
> Each of them has a different search syntax, and there is no way (in an
> open ecosystem) for a user to know which of these may be used.
>
> My proposal does two things to fix this situation:
>
>   1) Add a "simple" search type, which is recommended to be
> implemented as a baseline for interoperability. For simple searches,
> the server promises that no search terms or symbols will be
> interpreted as special syntax - what you search is what you get.
>
>
There was a reason I didn't include this in the original. :-)

Having implemented XEP-0431 server-side in a previous life, I'm not sure
this is universally possible - or rather, it is, but it's so poorly
performant I'm not sure I'd want anyone using it, and would probably
disable it.

Specifically, if we assume a PostgreSQL backend for the MAM archive, with
an extracted body field, the sensible way to handle FTS is to include a
computed index over that text (this isn't quite what I did, in fact,
because I encrypted the body text, but whatevs):

CREATE TABLE mam_archive (
  ...
  body TEXT,
  retracted BOOLEAN,
  ...
);
CREATE INDEX fts_mam_archive ON mam_archive USING gist (to_tsvector(body))
WHERE NOT retracted;

OK, so now I can implement a reasonably good full-text search:

SELECT * FROM mam_archive WHERE websearch_to_tsvector(${search_string}) @@
to_tsvector(body) AND NOT retracted

This will use an efficient index, and thus will be really quite fast.

But if I want to support a Dumb Substring Search, it'll be:

SELECT * FROM mam_archive WHERE body LIKE '%${search_string}%';

There's no index that I could do to help here, and therefore I'll be
forcing a full table scan, and there will be much howling and gnashing of
teeth.

Your suggested change says that supporting this is a SHOULD - in other
words, careful consideration ought to be given to the circumstances where
this might not be offered. But I don't think it's ever reasonable to offer
this - it cannot scale in any meaningful manner - and while it's
functionally possible, any clients relying on it to the exclusion of the
advanced search are surely going to discover it's unsupported. So a SHOULD
becomes, in practical terms, a SHOULD NOT.


>   2) Extend the existing ("advanced") search field with a
> recommendation that the server includes a <desc> element (already
> defined in XEP-0004) to explain the supported syntax to the user, and
> an (entirely optional) machine-readable hint that can be used to
> indicate to a client that a commonly-used syntax is supported.
>
>
Yes, all sounds good.


> Finally, most full-text search engines are not language-agnostic. This
> is because they perform operations such as stemming, and utilize a
> "stop word" list while building the index to help improve the search
> results. Many default to English, and while searches in other
> languages generally work, they may be silently worse. I've added an
> optional tag through which the server can indicate the natural
> languages that the search is optimized for. I feel least strongly
> about this addition, since this information is usually going to be
> apparent to the user already based on the context.
>
>
So, funnily enough, I think this is very sensible, and moreover it may be
useful to offer users a choice about their own language, but also, one
could even change the indexing depending on the language of the stanza.

Note that this can still be performant:

CREATE TABLE mam_archive (
  ...
  body TEXT,
  pglang TEXT,
  retracted BOOLEAN,
  ...
);
CREATE INDEX fts_mam_archive ON mam_archive USING gist (to_tsvector(pglang,
body)) WHERE NOT retracted;

Dave.
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________

Reply via email to