Removing whitespace

2011-12-12 Thread Devon Baumgarten
Hello,

I am having trouble finding how to remove/ignore whitespace when indexing. The 
only answer I have found suggested that it is necessary to write my own 
tokenizer. Is this true? I want to remove whitespace and special characters 
from the phrase and create N-grams from the result.

Ultimately, the effect I am after is that searching bobdole would match Bob 
Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can 
anyone lend some assistance?

Thanks!

Dev B



Re: Removing whitespace

2011-12-12 Thread Alireza Salimi
That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten 
dbaumgar...@nationalcorp.com wrote:

 Hello,

 I am having trouble finding how to remove/ignore whitespace when indexing.
 The only answer I have found suggested that it is necessary to write my own
 tokenizer. Is this true? I want to remove whitespace and special characters
 from the phrase and create N-grams from the result.

 Ultimately, the effect I am after is that searching bobdole would match
 Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way...
 can anyone lend some assistance?

 Thanks!

 Dev B




-- 
Alireza Salimi
Java EE Developer


RE: Removing whitespace

2011-12-12 Thread Steven A Rowe
Hi Devon,

Something like this should work for you (untested!):

analyzer
  !-- Remove non-word characters; only underscores, letters  numbers 
allowed --
  charFilter class=solr.PatternReplaceCharFilterFactory pattern=\W+ 
replacement=/
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2/
/analyzer

Steve

 -Original Message-
 From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com]
 Sent: Monday, December 12, 2011 4:52 PM
 To: 'solr-user@lucene.apache.org'
 Subject: Removing whitespace
 
 Hello,
 
 I am having trouble finding how to remove/ignore whitespace when indexing.
 The only answer I have found suggested that it is necessary to write my
 own tokenizer. Is this true? I want to remove whitespace and special
 characters from the phrase and create N-grams from the result.
 
 Ultimately, the effect I am after is that searching bobdole would match
 Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better
 way... can anyone lend some assistance?
 
 Thanks!
 
 Dev B



Re: Removing whitespace

2011-12-12 Thread Koji Sekiguchi

(11/12/13 6:51), Devon Baumgarten wrote:

Hello,

I am having trouble finding how to remove/ignore whitespace when indexing. The 
only answer I have found suggested that it is necessary to write my own 
tokenizer. Is this true? I want to remove whitespace and special characters 
from the phrase and create N-grams from the result.


How about using one of existing charfilters?

https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html

https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/MappingCharFilterFactory.html

koji
--
Check out Query Log Visualizer for Apache Solr
http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html
http://www.rondhuit.com/en/


RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten

-Original Message-
From: Alireza Salimi [mailto:alireza.sal...@gmail.com] 
Sent: Monday, December 12, 2011 4:08 PM
To: solr-user@lucene.apache.org
Subject: Re: Removing whitespace

That sounds strange requirement, but I think you can use CharFilters
instead of implementing your own Tokenizer.
Take a look at this section, maybe it helps.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories



The

On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten 
dbaumgar...@nationalcorp.com wrote:

 Hello,

 I am having trouble finding how to remove/ignore whitespace when indexing.
 The only answer I have found suggested that it is necessary to write my own
 tokenizer. Is this true? I want to remove whitespace and special characters
 from the phrase and create N-grams from the result.

 Ultimately, the effect I am after is that searching bobdole would match
 Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way...
 can anyone lend some assistance?

 Thanks!

 Dev B




-- 
Alireza Salimi
Java EE Developer


RE: Removing whitespace

2011-12-12 Thread Devon Baumgarten
Thanks Alireza, Steven and Koji for the quick responses!

I'll read up on those and give it a shot.

Devon Baumgarten