Removing whitespace
Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B
Re: Removing whitespace
That sounds strange requirement, but I think you can use CharFilters instead of implementing your own Tokenizer. Take a look at this section, maybe it helps. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories The On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B -- Alireza Salimi Java EE Developer
RE: Removing whitespace
Hi Devon, Something like this should work for you (untested!): analyzer !-- Remove non-word characters; only underscores, letters numbers allowed -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=\W+ replacement=/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2/ /analyzer Steve -Original Message- From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] Sent: Monday, December 12, 2011 4:52 PM To: 'solr-user@lucene.apache.org' Subject: Removing whitespace Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B
Re: Removing whitespace
(11/12/13 6:51), Devon Baumgarten wrote: Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. How about using one of existing charfilters? https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/MappingCharFilterFactory.html koji -- Check out Query Log Visualizer for Apache Solr http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html http://www.rondhuit.com/en/
RE: Removing whitespace
Thanks Alireza, Steven and Koji for the quick responses! I'll read up on those and give it a shot. Devon Baumgarten -Original Message- From: Alireza Salimi [mailto:alireza.sal...@gmail.com] Sent: Monday, December 12, 2011 4:08 PM To: solr-user@lucene.apache.org Subject: Re: Removing whitespace That sounds strange requirement, but I think you can use CharFilters instead of implementing your own Tokenizer. Take a look at this section, maybe it helps. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories The On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B -- Alireza Salimi Java EE Developer
RE: Removing whitespace
Thanks Alireza, Steven and Koji for the quick responses! I'll read up on those and give it a shot. Devon Baumgarten