Re: Concatenate multiple tokens into one

Robert Gründler Wed, 10 Nov 2010 16:23:33 -0800

On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote:

> Are you sure you really want to throw out stopwords for your use case?  I 
> don't think autocompletion will work how you want if you do.


in our case i think it makes sense. the content is targetting the electronic 
music / dj scene, so we have a lot of words like "DJ" or "featuring" which
make sense to throw out of the query. Also searches for "the beastie boys" and 
"beastie boys" should return a match in the autocompletion.

> 
> And if you don't... then why use the WhitespaceTokenizer and then try to jam 
> the tokens back together? Why not just NOT tokenize in the first place. Use 
> the KeywordTokenizer, which really should be called the 
> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates 
> one token from the entire input string. 

I started out with the KeywordTokenizer, which worked well, except the StopWord 
problem.

For now, i've come up with a quick-and-dirty custom "ConcatFilter", which does 
what i'm after:

public class ConcatFilter extends TokenFilter {

        private TokenStream tstream;

        protected ConcatFilter(TokenStream input) {
                super(input);
                this.tstream = input;
        }

        @Override
        public Token next() throws IOException {
                
                Token token = new Token();
                StringBuilder builder = new StringBuilder();
                
                TermAttribute termAttribute = (TermAttribute) 
tstream.getAttribute(TermAttribute.class);
                TypeAttribute typeAttribute = (TypeAttribute) 
tstream.getAttribute(TypeAttribute.class);
                
                boolean incremented = false;
                
                while (tstream.incrementToken()) {
                        
                        if (typeAttribute.type().equals("word")) {
                                builder.append(termAttribute.term());           
                
                        }
                        incremented = true;
                }
                
                token.setTermBuffer(builder.toString());
                
                if (incremented == true)
                        return token;
                
                return null;
        }
}

I'm not sure if this is a safe way to do this, as i'm not familar with the 
whole solr/lucene implementation after all.


best


-robert




> 
> Then lowercase, remove whitespace (or not), do whatever else you want to do 
> to your single token to normalize it, and then edgengram it. 
> 
> If you include whitespace in the token, then when making your queries for 
> auto-complete, be sure to use a query parser that doesn't do 
> "pre-tokenization", the 'field' query parser should work well for this. 
> 
> Jonathan
> 
> 
> 
> ________________________________________
> From: Robert Gründler [rob...@dubture.com]
> Sent: Wednesday, November 10, 2010 6:39 PM
> To: solr-user@lucene.apache.org
> Subject: Concatenate multiple tokens into one
> 
> Hi,
> 
> i've created the following filterchain in a field type, the idea is to use it 
> for autocompletion purposes:
> 
> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- create tokens 
> separated by whitespace -->
> <filter class="solr.LowerCaseFilterFactory"/> <!-- lowercase everything -->
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" />  <!-- throw out 
> stopwords -->
> <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" 
> replacement="" replace="all" />  <!-- throw out all everything except a-z -->
> 
> <!-- actually, here i would like to join multiple tokens together again, to 
> provide one token for the EdgeNGramFilterFactory -->
> 
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" 
> /> <!-- create edgeNGram tokens for autocomplete matches -->
> 
> With that kind of filterchain, the EdgeNGramFilterFactory will receive 
> multiple tokens on input strings with whitespaces in it. This leads to the 
> following results:
> Input Query: "George Cloo"
> Matches:
> - "George Harrison"
> - "John Clooridge"
> - "George Smith"
> -"George Clooney"
> - etc
> 
> However, only "George Clooney" should match in the autocompletion use case.
> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which 
> concatenates all the tokens generated by the WhitespaceTokenizerFactory.
> Are there filters which can do such a thing?
> 
> If not, are there examples how to implement a custom TokenFilter?
> 
> thanks!
> 
> -robert
> 
> 
> 
>

Re: Concatenate multiple tokens into one

Reply via email to