How to query the underlying dictionary i.e. inverse of ts_lexize()

PG Doc comments form Sun, 07 Apr 2019 01:13:03 -0700

The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/11/textsearch-debugging.html
Description:


It would be helpful if there were some documentation on how to query the
dictionaries themselves, to get a canonical root word, either.

1. Directly, such as:
  "SELECT words FROM english_stem WHERE stem = 'chlorin'
 -- should return e.g. "chlorine", "chlorination", "chlorinated"
 -- there isn't any documentation on how to actually do this.

2. Indirectly, such as:
 "SELECT ts_unlexize('english_stem','chlorin');  
-- this is a function which doesn't yet seem to exist: the one-to-many
inverse of ts_lexize().

3. Or, the canonical version of (2).
"SELECT ts_canonical('english_stem','chlorin');
--a one to one function to find the english root word (not the lexeme).

An example of where this is useful: consider a list of documents, containing
a large amount of english text. 
For this example, consider that the following words are frequent: "the",
"kitten", "kittens", "chlorination", "chlorinated", "temperature" and
"something".

We wish to display a "tag cloud" of the most common terms, excluding
stopwords, by means of ts_stat().  
At the moment, it lists: 
  "kitten"          -- correctly treating "kitten" and "kittens" as the
same.
  "chlorin"        -- correctly merging "chlorination" and "chlorinated",
but creating a non-word.
  "temperatur"  -- right stem, not a word.
  "someth"       -- mistaken parser, has removed the -ing suffix.

So, given the array ["kitten","chlorin","temperatur","someth"], we wish to
un-stem to find the first valid english word whose stem is in that array,
i.e. 
  ["kitten", "chlorine", "temperature", "something"]
Note that it is intentional to retrieve "chlorine" even though the original
inputs were "chlorinated" and "chlorination", and did not necessarily
contain "chlorine"] 

There doesn't seem to be any process for doing this. Not sure whether this
is just something for the documentation, or an RFE for (2). Thanks very
much.

How to query the underlying dictionary i.e. inverse of ts_lexize()

Reply via email to