I'd like to use the filter factories in the org.apache.solr.analysis package for tokenizing text in a separate application. I need to chain a couple tokenizers together like Solr does on indexing and query parsing. I have looked into the TokenizerChain class to do this. I have successfully implemented a tokenization chain, but was wondering if there is an established way to do this. I just hacked together something that happened to work. Below is a code snippet. Any advise would be appreciated. Dependencies: solr-core-1.4.0, lucene-core-2.9.3, lucene-snowball-2.9.3. I am not tied to these and could use different versions. P.S. Is this more of a question for the solr-dev mailing list?
<code> TokenizerFactory tokenizer = new WhitespaceTokenizerFactory(); Map<String,String> args = new HashMap<String,String>(); SnowballPorterFilterFactory porterFilter = new SnowballPorterFilterFactory(); porterFilter.init(args); args = new HashMap<String,String>(); args.put("generateWordParts", "1"); args.put("generateNumberParts", "1"); args.put("catenateWords", "1"); args.put("catenateNumbers", "1"); args.put("catenateAll", "0"); WordDelimiterFilterFactory wordFilter = new WordDelimiterFilterFactory(); wordFilter.init(args); LowerCaseFilterFactory lowercaseFilter = new LowerCaseFilterFactory(); TokenFilterFactory[] filters = new TokenFilterFactory[] { wordFilter, lowercaseFilter, porterFilter }; TokenizerChain chain = new TokenizerChain(tokenizer, filters); TokenStream stream = chain.tokenStream(null, new StringReader(builder.toString())); TermAttribute tm = (TermAttribute)stream.getAttribute(TermAttribute.class); while (stream.incrementToken()) { System.out.println(tm.term()); } </code>