Re: Trouble Configuring WordDelimiterFilterFactory

2009-11-30 Thread Erick Erickson
I think the problem here is that underlying WordDelimiterFactory
is StandardTokenizer, at least that's what I infer from here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

I
think you want to use a different tokenizer, because StandardTokenizer
may be stripping the decimal from .355. But that's just a guess. You'll get
more info if you examine your index and see what's *really* indexed in
these fields

Best
Erick

On Sun, Nov 29, 2009 at 10:31 AM, Rahul R  wrote:

> Steve,
> My settings for both index and query are :
>  generateNumberParts="0" catenateWords="1" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1"/>
>
> Let me give an example. Suppose I have the following 2 documents:
> Document 1(Text Field): Bridge-Diode .355 Volts
> Document 2(Text Field): Bridge-Diode 355 Volts
>
> Requirement : Search for ".355" should retrieve only document 1 (Not
> happening now)
> Requirement: Search for "Bridge" should retrieve both documents (Works as
> expected)
>
> The reason why a search for ".355" is retrieving both documents is that
> term
> texts for .355 in the document are created as .355 and 355. Even if I set
> generateWordParts and catenateWords to "0", the way term texts are created
> for ".355" does not change.
>
> Thank you for your time.
>
> Regards
> Rahul
>
> On Sun, Nov 29, 2009 at 1:07 AM, Steven A Rowe  wrote:
>
> > Hi Rahul,
> >
> > On 11/26/2009 at 12:53 AM, Rahul R wrote:
> > > Is there a way by which I can prevent the WordDelimiterFilterFactory
> > > from totally acting on numerical data ?
> >
> > "prevent ... from totally acting on" is pretty vague, and nowhere AFAICT
> do
> > you say precisely what it is you want.
> >
> > It would help if you could give example text and the terms you think
> should
> > be the result of analysis of the text.  If you want different index/query
> > time behavior, please provide this info for both.
> >
> > Steve
> >
> >
>


Re: Trouble Configuring WordDelimiterFilterFactory

2009-11-29 Thread Rahul R
Steve,
My settings for both index and query are :


Let me give an example. Suppose I have the following 2 documents:
Document 1(Text Field): Bridge-Diode .355 Volts
Document 2(Text Field): Bridge-Diode 355 Volts

Requirement : Search for ".355" should retrieve only document 1 (Not
happening now)
Requirement: Search for "Bridge" should retrieve both documents (Works as
expected)

The reason why a search for ".355" is retrieving both documents is that term
texts for .355 in the document are created as .355 and 355. Even if I set
generateWordParts and catenateWords to "0", the way term texts are created
for ".355" does not change.

Thank you for your time.

Regards
Rahul

On Sun, Nov 29, 2009 at 1:07 AM, Steven A Rowe  wrote:

> Hi Rahul,
>
> On 11/26/2009 at 12:53 AM, Rahul R wrote:
> > Is there a way by which I can prevent the WordDelimiterFilterFactory
> > from totally acting on numerical data ?
>
> "prevent ... from totally acting on" is pretty vague, and nowhere AFAICT do
> you say precisely what it is you want.
>
> It would help if you could give example text and the terms you think should
> be the result of analysis of the text.  If you want different index/query
> time behavior, please provide this info for both.
>
> Steve
>
>


RE: Trouble Configuring WordDelimiterFilterFactory

2009-11-28 Thread Steven A Rowe
Hi Rahul,

On 11/26/2009 at 12:53 AM, Rahul R wrote:
> Is there a way by which I can prevent the WordDelimiterFilterFactory
> from totally acting on numerical data ?

"prevent ... from totally acting on" is pretty vague, and nowhere AFAICT do you 
say precisely what it is you want.

It would help if you could give example text and the terms you think should be 
the result of analysis of the text.  If you want different index/query time 
behavior, please provide this info for both.

Steve



Re: Trouble Configuring WordDelimiterFilterFactory

2009-11-25 Thread Rahul R
Hello,
Would really appreciate any inputs/suggestions on this. Thank you.



On Tue, Nov 24, 2009 at 10:59 PM, Rahul R  wrote:

> Hello,
> In our application we have a catch-all field (the 'text' field) which is
> cofigured as the default search field. Now this field will have a
> combination of numbers, alphabets, special characters etc. I have a
> requirement wherein the WordDelimiterFilterFactory does not work on numbers,
> especially those with decimal points. Accuracy of results with relevance to
> numerical data is quite important, So if the text field of a document has
> data like "Bridge-Diode 3.55 Volts", I want to make sure that a search for
> "355" or "35.5" does not retrieve this document. So I found the following
> setting for the WordDelimiterFilterFactory to work for me (for most parts):
>  generateNumberParts="0" catenateWords="1" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1"/>
>
> I am using the same setting for both index and query.
>
> Now the only problem is, if I have data like ".355". With the above
> setting, the analysis jsp shows me that WordDelimiterFilterFactory is
> creating term texts as both ".355' and "355". So a search for ".355"
> retrieves documents containing both ".355" and "355". A search for "355"
> also has the same effect. I noticed that when the entry for the
> WordDelimiterFilterFactory was completely removed (both index and query),
> then the above problem was resolved. But this seems too harsh a measure.
>
> Is there a way by which I can prevent the WordDelimiterFilterFactory from
> totally acting on numerical data ?
>
> Regards
> Rahul
>