[Solr Wiki] Update of "Suggester" by RobertMuir

Apache Wiki Sun, 19 Feb 2012 09:31:55 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "Suggester" page has been changed by RobertMuir:
http://wiki.apache.org/solr/Suggester?action=diff&rev1=10&rev2=11

Comment:
add docs for wFST impl

        <str 
name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
        <!-- Alternatives to lookupImpl: 
             org.apache.solr.spelling.suggest.fst.FSTLookup   [finite state 
automaton]
+            org.apache.solr.spelling.suggest.fst.WFSTLookupFactory [weighted 
finite state automaton]
             org.apache.solr.spelling.suggest.jaspell.JaspellLookup [default, 
jaspell-based]
             org.apache.solr.spelling.suggest.tst.TSTLookup   [ternary trees]
        -->
@@ -45, +46 @@

   * JaspellLookup - tree-based representation based on Jaspell,
   * TSTLookup - ternary tree based representation, capable of immediate data 
structure updates,
   * FSTLookup - automaton based representation; slower to build, but consumes 
far less memory at runtime (see performance notes below).
+  * WFSTLookup - weighted automaton representation: an alternative to 
FSTLookup for more fine-grained ranking. Solr 3.6+
  
  For practical purposes all of the above implementations will most likely run 
at similar speed when requests are made via the HTTP stack (which will
- become the bottleneck). Direct benchmarks of these classes indicate that 
FSTLookup provides better performance compared to the other two methods, at a 
much lower memory cost. JaspellLookup can provide "fuzzy" suggestions, though 
this functionality is not currently exposed (it's a one line change in 
JaspellLookup). Support for infix-suggestions is planned for FSTLookup (which 
would be the only structure to support these).
+ become the bottleneck). Direct benchmarks of these classes indicate that 
(W)FSTLookup provides better performance compared to the other two methods, at 
a much lower memory cost. JaspellLookup can provide "fuzzy" suggestions, though 
this functionality is not currently exposed (it's a one line change in 
JaspellLookup). Support for infix-suggestions is planned for FSTLookup (which 
would be the only structure to support these).
  
  An example of an autosuggest request:
  
@@ -89, +91 @@

    * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie 
based lookup
    * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup 
based on a ternary trie from the [[http://jaspell.sourceforge.net/|JaSpell]] 
project.
    * `org.apache.solr.suggest.fst.FSTLookup` - automaton-based lookup
+   * `org.apache.solr.spelling.suggest.fst.WFSTLookupFactory` - weighted 
automaton-based lookup
   * `buildOnCommit` - if set to true then the Lookup data structure will be 
rebuilt after commit. If false (default) then the Lookup data will be built 
only when requested (by URL parameter `spellcheck.build=true`). '''NOTE: 
currently implemented Lookup-s keep their data in memory, so unlike 
spellchecker data this data is discarded on core reload and not available until 
you invoke the build command, either explicitly or implicitly via commit.'''
   * `sourceLocation` - location of the dictionary file. If not empty then this 
is a path to a dictionary file (see below). If this value is empty then the 
main index will be used as a source of terms and weights.
   * `field` - if `sourceLocation` is empty then terms from this field in the 
index will be used when building the trie.
@@ -110, +113 @@

  
  Please note that the format of the file is not limited to single terms but 
can also contain phrases - which is an improvement over the TermsComponent that 
you could also use for a simple version of autocomplete functionality. 
  
- FSTLookup has a built-in mechism to discetize weights into a fixed set of 
buckets (to speed up suggestions). The number of buckets is configurable.
+ FSTLookup has a built-in mechanism to discretize weights into a fixed set of 
buckets (to speed up suggestions). The number of buckets is configurable.
+ 
+ WFSTLookup does not use buckets, but instead a shortest path algorithm. Note 
that it expects weights to be whole numbers.
  
  === Threshold parameter ===
  As mentioned above, if the `sourceLocation` parameter is empty then the terms 
from a field indicated by the `field` parameter are used. It's often the case 
that due to imperfect source data there are many uncommon or invalid terms that 
occur only once in the whole corpus (e.g. OCR errors, typos, etc). According to 
the Zipf's law this actually forms the majority of terms, which means that the 
dictionary built indiscriminately from a real-life index would consist mostly 
of uncommon terms, and its size would be enormous. In order to avoid this and 
to reduce the size of in-memory structures it's best to set the `threshold` 
parameter to a value slightly above zero (0.5% in the example above). This 
already vastly reduces the size of the dictionary by skipping 
[[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still 
preserving most of the common terms. This parameter has no effect when using a 
file-based dictionary - it's assumed that only useful terms are found there. ;)
@@ -125, +130 @@

   * `spellcheck.collate=true` - to provide a query collated with the first 
matching suggestion.
  
  = Tips and tricks =
- * Use FSTLookup to conserve memory (unless you need a more sophisticated 
matching, then look at JaspellLookup). There are some benchmarks of all three 
implementations: 
[[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|SOLR-1316]]
 (outdated) and a bit newer here:
+ * Use (W)FSTLookup to conserve memory (unless you need a more sophisticated 
matching, then look at JaspellLookup). There are some benchmarks of all four 
implementations: 
[[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|SOLR-1316]]
 (outdated) and a bit newer here:
- [[https://issues.apache.org/jira/browse/SOLR-2378|SOLR-2378]]. The class to 
perform these benchmarks is in the source tree and is called 
LookupBenchmarkTest.
+ [[https://issues.apache.org/jira/browse/SOLR-2378|SOLR-2378]], and here: 
[[https://issues.apache.org/jira/browse/LUCENE-3714|LUCENE-3714]]. 
+ The class to perform these benchmarks is in the source tree and is called 
LookupBenchmarkTest.
  
  * Use `threshold` parameter to limit the size of the trie, to reduce the 
build time and to remove invalid/uncommon terms. Values below 0.01 should be 
sufficient, greater values can be used to limit the impact of terms that occur 
in a larger portion of documents. Values above 0.5 probably don't make much 
sense.

[Solr Wiki] Update of "Suggester" by RobertMuir

Reply via email to