On Thu, Nov 25, 2010 at 9:48 AM, Alexander Klimetschek <aklim...@adobe.com> wrote: > On 24.11.10 22:29, "Ard Schrijvers" <a.schrijv...@onehippo.com> wrote: > >>On Wed, Nov 24, 2010 at 10:03 PM, Zhou Wu <zwu...@yahoo.com> wrote: >>> I'm trying to do some thing like >>> org.apache.jackrabbit.core.query.lucene.spell.SpellChecker for >>>autocomplete: >>> When user type in the search input box, a list of words (phrases) that >>>pops >>> up like Google suggestion. I searched on the web and got >>> >>>http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion >>>-suggestions-in-lucene >>> that looks like helpful. But I don't know how to start to get it work >>>with >>> Jackrabbit. Could any one give some tips? Thanks, >> >>Afaiu, Spellchecker wouldn't fit auto completion. Auto completion is >>about suggesting existing terms in the index after you typed, say >>'jack'. > > Exactly, spellcheck is about getting from "jeck" to "jack", but > autocompletion (in its hardest form) is about getting from typing an "j" > to a list like "jack, jupiter, jelly, january, ...". > > Also there are different use cases as what to show in auto-completion > (always showing all possibilities doesn't work ;-)) and it is language- > and region dependent. > > Since those few-letter inputs like "j" will be the most frequent ones, as > people are typing words one-by-one, you want to directly lookup those > terms from a pre-built index as directly as possible. For this, you can > have something like "j/ja/jac" in the repository. On each level there is a > multi-value property containing the auto-completions/suggestions you want > to show (10 is a good number for example, used by google).
Ah, you suggest to manually keep track of the 'auto-suggest' list, right? Just read them all in once, have some observer for changes, et voila. That works, but I wanted to build it differently myself I want to deliver the feature for us in a different way: Expose the Lucene term enum as a virtual hierarchical node tree, where every node is a single letter. This is very efficient, and easy to build once virtual layers are up&running. The only thing I am struggling with in my head is about Lucene stemming: the term enum then contains stemmed words. OTOH, imo, the complete stemming concept in Lucene has been broken from the start, I never advice stemming. Removing diacritics is enough. (Lucene 4.0 won't need stemming any more ever, as you can do everything with fuzzy searches because of a new bleeding edge automaton query...first upgrade jackrabbit however :-)) Regards Ard > > How this index is built in the first time, depends on the use case. For > example, the Google search shows you terms that are currently popular, so > they probably update that index based on query statistics like one or two > times a day. To start, you can use a dictionary, filter out stop words > like "the", "and" etc. and build that index automatically. Then you only > get single words - Google also shows full searches, like "jack wolfskin". > And there are probably many other sources you can build such an index from. > > Hope that helps, > Alex > > -- > Alexander Klimetschek > Developer // Adobe (Day) // Berlin - Basel > > > > > -- Hippo Europe • Amsterdam Oosteinde 11 • 1017 WT Amsterdam • +31 (0)20 522 4466 USA • San Francisco 185 H Street Suite B • Petaluma CA 94952-5100 • +1 (707) 773 4646 Canada • Montréal 5369 Boulevard St-Laurent • Montréal QC H2T 1S5 • +1 (514) 316 8966 www.onehippo.com • www.onehippo.org • i...@onehippo.com