WordDelimiterFilterFactory will _almost_ do what you want
by setting things like catenateWords=0 and catenateNumbers=1,
_except_ that the punctuation will be removed. So
12.34 -> 1234
ab,cd -> ab cd

is that "close enough"?

Otherwise, writing a simple Filter is probably the way to go.

Best
Erick

On Wed, Apr 11, 2012 at 1:59 PM, Jian Xu <joseph...@yahoo.com> wrote:
> Hello,
>
> I am new to solr/lucene. I am tasked to index a large number of documents. 
> Some of these documents contain decimal points. I am looking for a way to 
> index these documents so that adjacent numeric characters (such as [0-9.,]) 
> are treated as single token. For example,
>
> 12.34 => "12.34"
> 12,345 => "12,345"
>
> However, "," and "." should be treated as usual when around non-digital 
> characters. For example,
>
> ab,cd => "ab" "cd".
>
> It is so that searching for "12.34" will match "12.34" not "12 34". Searching 
> for "ab.cd" should match both "ab.cd" and "ab cd".
>
> After doing some research on solr, It seems that there is a build-in analyzer 
> called solr.WordDelimiterFilter that supports a "types" attribute which map 
> special characters as different delimiters.  However, it isn't exactly what I 
> want. It doesn't provide context check such as "," or "." must surround by 
> digital characters, etc.
>
> Does anyone have any experience configuring solr to meet this requirements?  
> Is writing my own plugin necessary for this simple thing?
>
> Thanks in advance!
>
> -Jian

Reply via email to