On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: > Are you sure you really want to throw out stopwords for your use case? I > don't think autocompletion will work how you want if you do.
in our case i think it makes sense. the content is targetting the electronic music / dj scene, so we have a lot of words like "DJ" or "featuring" which make sense to throw out of the query. Also searches for "the beastie boys" and "beastie boys" should return a match in the autocompletion. > > And if you don't... then why use the WhitespaceTokenizer and then try to jam > the tokens back together? Why not just NOT tokenize in the first place. Use > the KeywordTokenizer, which really should be called the > NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates > one token from the entire input string. I started out with the KeywordTokenizer, which worked well, except the StopWord problem. For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does what i'm after: public class ConcatFilter extends TokenFilter { private TokenStream tstream; protected ConcatFilter(TokenStream input) { super(input); this.tstream = input; } @Override public Token next() throws IOException { Token token = new Token(); StringBuilder builder = new StringBuilder(); TermAttribute termAttribute = (TermAttribute) tstream.getAttribute(TermAttribute.class); TypeAttribute typeAttribute = (TypeAttribute) tstream.getAttribute(TypeAttribute.class); boolean incremented = false; while (tstream.incrementToken()) { if (typeAttribute.type().equals("word")) { builder.append(termAttribute.term()); } incremented = true; } token.setTermBuffer(builder.toString()); if (incremented == true) return token; return null; } } I'm not sure if this is a safe way to do this, as i'm not familar with the whole solr/lucene implementation after all. best -robert > > Then lowercase, remove whitespace (or not), do whatever else you want to do > to your single token to normalize it, and then edgengram it. > > If you include whitespace in the token, then when making your queries for > auto-complete, be sure to use a query parser that doesn't do > "pre-tokenization", the 'field' query parser should work well for this. > > Jonathan > > > > ________________________________________ > From: Robert Gründler [rob...@dubture.com] > Sent: Wednesday, November 10, 2010 6:39 PM > To: solr-user@lucene.apache.org > Subject: Concatenate multiple tokens into one > > Hi, > > i've created the following filterchain in a field type, the idea is to use it > for autocompletion purposes: > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens > separated by whitespace --> > <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything --> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> <!-- throw out > stopwords --> > <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" > replacement="" replace="all" /> <!-- throw out all everything except a-z --> > > <!-- actually, here i would like to join multiple tokens together again, to > provide one token for the EdgeNGramFilterFactory --> > > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" > /> <!-- create edgeNGram tokens for autocomplete matches --> > > With that kind of filterchain, the EdgeNGramFilterFactory will receive > multiple tokens on input strings with whitespaces in it. This leads to the > following results: > Input Query: "George Cloo" > Matches: > - "George Harrison" > - "John Clooridge" > - "George Smith" > -"George Clooney" > - etc > > However, only "George Clooney" should match in the autocompletion use case. > Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which > concatenates all the tokens generated by the WhitespaceTokenizerFactory. > Are there filters which can do such a thing? > > If not, are there examples how to implement a custom TokenFilter? > > thanks! > > -robert > > > >