Hey Matthew, Many thanks for this, and also many thanks for raising these on the list first instead of quietly via the document author, and my apologies for a slow response.
On Fri, 3 Jun 2022 at 10:51, Matthew Wild <[email protected]> wrote: > Hi folks, > > Thanks to Guus's persistence, I finally took time to close a few > issues I have with the current XEP-0431 (Full Text Search in MAM). > > The main issue is that the current version of the spec provides no > guarantees about how the search string (generally input from a user) > will be interpreted. Usually in such cases, I would say this is > fine... an implementation that returns all messages containing "bar" > when you submit a search for "foo" is obviously broken and nobody > would want to use it, even if it's 100% permitted behaviour by the > XEP. > > Yes. The problem is I don't believe that one can usefully codify anything here without going deep into stemming and stop words, and since nobody (sane) will write their own free-text indexing system explicitly to match this specification, it's very difficult to know what can be specified. > But full-text search is actually a complex topic, and there are > various backend implementations that servers are likely to lean on. > Each of them has a different search syntax, and there is no way (in an > open ecosystem) for a user to know which of these may be used. > > My proposal does two things to fix this situation: > > 1) Add a "simple" search type, which is recommended to be > implemented as a baseline for interoperability. For simple searches, > the server promises that no search terms or symbols will be > interpreted as special syntax - what you search is what you get. > > There was a reason I didn't include this in the original. :-) Having implemented XEP-0431 server-side in a previous life, I'm not sure this is universally possible - or rather, it is, but it's so poorly performant I'm not sure I'd want anyone using it, and would probably disable it. Specifically, if we assume a PostgreSQL backend for the MAM archive, with an extracted body field, the sensible way to handle FTS is to include a computed index over that text (this isn't quite what I did, in fact, because I encrypted the body text, but whatevs): CREATE TABLE mam_archive ( ... body TEXT, retracted BOOLEAN, ... ); CREATE INDEX fts_mam_archive ON mam_archive USING gist (to_tsvector(body)) WHERE NOT retracted; OK, so now I can implement a reasonably good full-text search: SELECT * FROM mam_archive WHERE websearch_to_tsvector(${search_string}) @@ to_tsvector(body) AND NOT retracted This will use an efficient index, and thus will be really quite fast. But if I want to support a Dumb Substring Search, it'll be: SELECT * FROM mam_archive WHERE body LIKE '%${search_string}%'; There's no index that I could do to help here, and therefore I'll be forcing a full table scan, and there will be much howling and gnashing of teeth. Your suggested change says that supporting this is a SHOULD - in other words, careful consideration ought to be given to the circumstances where this might not be offered. But I don't think it's ever reasonable to offer this - it cannot scale in any meaningful manner - and while it's functionally possible, any clients relying on it to the exclusion of the advanced search are surely going to discover it's unsupported. So a SHOULD becomes, in practical terms, a SHOULD NOT. > 2) Extend the existing ("advanced") search field with a > recommendation that the server includes a <desc> element (already > defined in XEP-0004) to explain the supported syntax to the user, and > an (entirely optional) machine-readable hint that can be used to > indicate to a client that a commonly-used syntax is supported. > > Yes, all sounds good. > Finally, most full-text search engines are not language-agnostic. This > is because they perform operations such as stemming, and utilize a > "stop word" list while building the index to help improve the search > results. Many default to English, and while searches in other > languages generally work, they may be silently worse. I've added an > optional tag through which the server can indicate the natural > languages that the search is optimized for. I feel least strongly > about this addition, since this information is usually going to be > apparent to the user already based on the context. > > So, funnily enough, I think this is very sensible, and moreover it may be useful to offer users a choice about their own language, but also, one could even change the indexing depending on the language of the stanza. Note that this can still be performant: CREATE TABLE mam_archive ( ... body TEXT, pglang TEXT, retracted BOOLEAN, ... ); CREATE INDEX fts_mam_archive ON mam_archive USING gist (to_tsvector(pglang, body)) WHERE NOT retracted; Dave.
_______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: [email protected] _______________________________________________
