Re: [HACKERS] text search changes vs. binary upgrade
On Tue, May 03, 2016 at 11:13:54PM -0400, Tom Lane wrote: > Noah Misch writes: > > Commit bb14050 said: > > - change order for tsquery, so, users, who has a btree index over > > tsquery, > > should reindex it > > We undid that in 1ec4c7c05, no? Ah, looks that way. > > Commit 61d66c4 may or may not warrant pg_upgrade treatment: > > Fix support of digits in email/hostnames. > > The general theory about changes in text search parser and dictionary > behavior has always been that a reindex is not required, because that does > not invalidate the derived data in the same sort of way that changing, > say, btree sort order of a datatype would. At worst, searches for the > specifically affected words might fail to find relevant entries because > to_tsvector now produces a different list of lexemes than before (and > those new lexemes are not in the index, the old ones are). If the > affected set of words is sufficiently large and relevant to her use-case, > a user might judge that rebuilding derived tsvector data is worth her > trouble. But I am dubious that pg_upgrade should issue guidance > unconditionally telling people to do it. Most people probably aren't > going to have any noticeable amount of data that's affected by this change. > > If we did worry about this for 61d66c4, then for example the unaccent > changes would also be problematic, and probably the ispell changes too. > I'm inclined to just group all those things in the release notes and > provide text counseling users to think about how much those changes affect > their full-text data and whether rebuilding derived tsvectors would be > worth it. Fair. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] text search changes vs. binary upgrade
Noah Misch writes: > Commit bb14050 said: > - change order for tsquery, so, users, who has a btree index over tsquery, > should reindex it We undid that in 1ec4c7c05, no? (Even if we didn't, the usefulness of a btree index on tsquery seems negligibly small.) > Commit 61d66c4 may or may not warrant pg_upgrade treatment: > Fix support of digits in email/hostnames. The general theory about changes in text search parser and dictionary behavior has always been that a reindex is not required, because that does not invalidate the derived data in the same sort of way that changing, say, btree sort order of a datatype would. At worst, searches for the specifically affected words might fail to find relevant entries because to_tsvector now produces a different list of lexemes than before (and those new lexemes are not in the index, the old ones are). If the affected set of words is sufficiently large and relevant to her use-case, a user might judge that rebuilding derived tsvector data is worth her trouble. But I am dubious that pg_upgrade should issue guidance unconditionally telling people to do it. Most people probably aren't going to have any noticeable amount of data that's affected by this change. If we did worry about this for 61d66c4, then for example the unaccent changes would also be problematic, and probably the ispell changes too. I'm inclined to just group all those things in the release notes and provide text counseling users to think about how much those changes affect their full-text data and whether rebuilding derived tsvectors would be worth it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] text search changes vs. binary upgrade
Commit bb14050 said: - change order for tsquery, so, users, who has a btree index over tsquery, should reindex it The last change of this sort also modified pg_upgrade to issue REINDEX guidance. See old_8_3_invalidate_hash_gin_indexes() in the PostgreSQL 9.4 source. PostgreSQL 9.6 pg_upgrade should do likewise. Commit 61d66c4 may or may not warrant pg_upgrade treatment: Fix support of digits in email/hostnames. When tsearch was implemented I did several mistakes in hostname/email definition rules: 1) allow underscore in hostname what prohibited by RFC 2) forget to allow leading digits separated by hyphen (like 123-x.com) in hostname 3) do no allow underscore/hyphen after leading digits in localpart of email ... Any index (not just btree) that depends on a text search configuration using parser pg_catalog.default may need a REINDEX after this change. (Furthermore, any constraint having such a dependency would need a recheck. That use case may be less important.) I think the last changes to pg_catalog.default semantics were 2c265ad (URLs) and 89b0095 (emails), both in 9.0. For those, we didn't change pg_upgrade or recommend REINDEX in the release notes. We could call that a relevant precedent and, for this 9.6 change, once again take no particular action. On the other hand, binary upgrade has matured since its 9.0 birth. Perhaps standards have risen, and pg_upgrade should issue guidance to REINDEX affected text search indexes. (The guidance could mention the kind of queries that will notice the difference.) I lean toward having pg_upgrade address this incompatibility. Other opinions? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers