Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

2013-04-30 Thread meghana
Thanks Jack Krupansky, Its very helpful :)

Jack Krupansky-2 wrote
 The WDF types will treat a character the same regardless of where it 
 appears.
 
 For something conditional, like dot between letters vs. dot lot preceded
 and 
 followed by a letter, you either have to have a custom tokenizer or a 
 character filter.
 
 Interesting that although the standard tokenizer messes up embedded
 hyphens, 
 it does handle the embedded dot vs. trailing dot case as you wish (but 
 messes up U.S.A. by stripping the trailing dot) - but that doesn't help 
 your case.
 
 A character filter like the following might help your case:
 fieldType name=text_ws_dot class=solr.TextField 
 positionIncrementGap=100
   
 analyzer
 
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
 
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
 
 charFilter class=solr.PatternReplaceCharFilterFactory 
 pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 /
 
 tokenizer class=solr.WhitespaceTokenizerFactory/
   
 /analyzer
 /fieldType
 I'm not a regular expression expert, so I'm not sure whether/how those 
 patterns could be combined.
 
 Also, that doesn't allow the case of a single ., , or _ as a word - 
 but you didn't specify how that case should be handled.
 
 
 
 -- Jack Krupansky
 -Original Message- 
 From: meghana
 Sent: Wednesday, April 24, 2013 6:49 AM
 To: 

 solr-user@.apache

 Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only
 on 
 Boundires
 
 I have configured WordDelimiterFilterFactory for custom tokenizers for ''
 and '-' , and for few tokenizer (like . _ :) we need to split on boundries
 only.
 
 e.g.
 test.com (should tokenized to test.com)
 newyear.  (should tokenized to newyear)
 new_car (should tokenized to new_car)
 ..
 ..
 
 Below is defination for text field
 fieldType name=text_general_preserved class=solr.TextField
 positionIncrementGap=100
   
 analyzer type=index
  
 tokenizer class=solr.WhitespaceTokenizerFactory/
  
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=false /
  
 filter class=solr.WordDelimiterFilterFactory
 splitOnCaseChange =0
 splitOnNumerics =0
 stemEnglishPossessive =0
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 preserveOriginal=0
 protected=protwords_general.txt
 types=wdfftypes_general.txt
 /
 
 filter class=solr.LowerCaseFilterFactory/
   
 /analyzer
   
 analyzer type=query
 
 tokenizer class=solr.WhitespaceTokenizerFactory/
 
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=false /
 
 filter class=solr.WordDelimiterFilterFactory
 splitOnCaseChange =0
 splitOnNumerics =0
 stemEnglishPossessive =0
 generateWordParts=1
 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0
 catenateAll=0
 preserveOriginal=0
 protected=protwords_general.txt
 types=wdfftypes_general.txt
 /
 
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true/
 
 filter class=solr.LowerCaseFilterFactory/
   
 /analyzer
 
 /fieldType
 below is wdfftypes_general.txt content
 
  = ALPHA
 - = ALPHA
 _ = SUBWORD_DELIM
 : = SUBWORD_DELIM
 . = SUBWORD_DELIM
 
 types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
 ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
 type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but
 it
 doesn't seem to work.
 
 Can anybody suggest me how can i set configuration for worddelimiter
 factory
 to fulfill my requirement.
 
 Thanks.
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
 Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557p4060011.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

2013-04-24 Thread meghana
I have configured WordDelimiterFilterFactory for custom tokenizers for ''
and '-' , and for few tokenizer (like . _ :) we need to split on boundries
only. 

e.g. 
test.com (should tokenized to test.com)
newyear.  (should tokenized to newyear)
new_car (should tokenized to new_car)
..
..

Below is defination for text field

fieldType name=text_general_preserved class=solr.TextField
positionIncrementGap=100
  analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
 filter class=solr.WordDelimiterFilterFactory
splitOnCaseChange =0
splitOnNumerics =0
stemEnglishPossessive =0
generateWordParts=1
generateNumberParts=1
catenateWords=0
catenateNumbers=0
catenateAll=0
preserveOriginal=0
protected=protwords_general.txt
types=wdfftypes_general.txt
/

filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
filter class=solr.WordDelimiterFilterFactory
splitOnCaseChange =0
splitOnNumerics =0
stemEnglishPossessive =0
generateWordParts=1
generateNumberParts=1
catenateWords=0
catenateNumbers=0
catenateAll=0
preserveOriginal=0
protected=protwords_general.txt
types=wdfftypes_general.txt
/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

below is wdfftypes_general.txt content

 = ALPHA
- = ALPHA
_ = SUBWORD_DELIM
: = SUBWORD_DELIM
. = SUBWORD_DELIM

types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it
doesn't seem to work. 

Can anybody suggest me how can i set configuration for worddelimiter factory
to fulfill my requirement. 

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

2013-04-24 Thread Jack Krupansky
The WDF types will treat a character the same regardless of where it 
appears.


For something conditional, like dot between letters vs. dot lot preceded and 
followed by a letter, you either have to have a custom tokenizer or a 
character filter.


Interesting that although the standard tokenizer messes up embedded hyphens, 
it does handle the embedded dot vs. trailing dot case as you wish (but 
messes up U.S.A. by stripping the trailing dot) - but that doesn't help 
your case.


A character filter like the following might help your case:

fieldType name=text_ws_dot class=solr.TextField 
positionIncrementGap=100

 analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=([\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
   charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=(^|[^\w\d])[\._amp;]+($|[^\w\d]) replacement=$1 $2 /
   charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=(^|[^\w\d])[\._amp;]+([\w\d]) replacement=$1 $2 /

   tokenizer class=solr.WhitespaceTokenizerFactory/
 /analyzer
/fieldType

I'm not a regular expression expert, so I'm not sure whether/how those 
patterns could be combined.


Also, that doesn't allow the case of a single ., , or _ as a word - 
but you didn't specify how that case should be handled.




-- Jack Krupansky
-Original Message- 
From: meghana

Sent: Wednesday, April 24, 2013 6:49 AM
To: solr-user@lucene.apache.org
Subject: Solr - WordDelimiterFactory with Custom Tokenizer to split only on 
Boundires


I have configured WordDelimiterFilterFactory for custom tokenizers for ''
and '-' , and for few tokenizer (like . _ :) we need to split on boundries
only.

e.g.
test.com (should tokenized to test.com)
newyear.  (should tokenized to newyear)
new_car (should tokenized to new_car)
..
..

Below is defination for text field

fieldType name=text_general_preserved class=solr.TextField
positionIncrementGap=100
 analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange =0
   splitOnNumerics =0
   stemEnglishPossessive =0
   generateWordParts=1
   generateNumberParts=1
   catenateWords=0
   catenateNumbers=0
   catenateAll=0
   preserveOriginal=0
   protected=protwords_general.txt
   types=wdfftypes_general.txt
   /

   filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=false /
   filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange =0
   splitOnNumerics =0
   stemEnglishPossessive =0
   generateWordParts=1
   generateNumberParts=1
   catenateWords=0
   catenateNumbers=0
   catenateAll=0
   preserveOriginal=0
   protected=protwords_general.txt
   types=wdfftypes_general.txt
   /
   filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/
   filter class=solr.LowerCaseFilterFactory/
 /analyzer
   /fieldType

below is wdfftypes_general.txt content

 = ALPHA
- = ALPHA
_ = SUBWORD_DELIM
: = SUBWORD_DELIM
. = SUBWORD_DELIM

types can be used in worddelimiter  are LOWER, UPPER, ALPHA, DIGIT,
ALPHANUM, SUBWORD_DELIM . there's no description available for use of each
type. as per name, i thought type SUBWORD_DELIM may fulfill my need, but it
doesn't seem to work.

Can anybody suggest me how can i set configuration for worddelimiter factory
to fulfill my requirement.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-WordDelimiterFactory-with-Custom-Tokenizer-to-split-only-on-Boundires-tp4058557.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: WordDelimiterFactory

2013-04-19 Thread Erick Erickson
Ashok:

You really, _really_ need to dive into the admin/analysis page.
That'll show you exactly what WDFF (and all the other elements of your
chain) do to input tokens. Understanding the index and query-time
implications of all the settings in WDFF takes a while.

But from what you're describing, WDFF may not be what you're looking
for anyway, some of the regex filters could split, for instance, on
all non-alphanum characters.

Best
Erick

On Wed, Apr 17, 2013 at 12:25 AM, Shawn Heisey s...@elyograg.org wrote:
 On 4/16/2013 8:12 PM, Ashok wrote:
 It looks like any 'word' that starts with a digit is treated as a numeric
 string.

 Setting generateNumberParts=1 in stead of 0 seems to generate the right
 tokens in this case but need to see if it has any other impacts on the
 finalized token list...

 I have a fieldType that is using WDF with the following settings on the
 index side.  Both index and query analysis show it behaving correctly
 with terms that start with numbers, on versions 4.2.1 and 3.5.0:

 filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange=1
   splitOnNumerics=1
   stemEnglishPossessive=1
   generateWordParts=1
   generateNumberParts=1
   catenateWords=1
   catenateNumbers=1
   catenateAll=0
   preserveOriginal=1
 /

 It has different settings on the query side, but generateNumberParts is
 1 for both:

 filter class=solr.WordDelimiterFilterFactory
   splitOnCaseChange=1
   splitOnNumerics=1
   stemEnglishPossessive=1
   generateWordParts=1
   generateNumberParts=1
   catenateWords=0
   catenateNumbers=0
   catenateAll=0
   preserveOriginal=0
 /

 I haven't tried it with generateNumberParts set to 0.

 Thanks,
 Shawn



Re: WordDelimiterFactory

2013-04-19 Thread Ashok
Yes, thank you Erick. The analysis/document handlers hold the key to deciding
the type  order of the filters to employ given one's document set, 
subject matter at hand. The finalized terms they produce for SOLR search,
mlt etc... are crucial to the quality of the results.

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4057349.html
Sent from the Solr - User mailing list archive at Nabble.com.


WordDelimiterFactory

2013-04-16 Thread Ashok
Hi,

Why does WDF swallow all 'words' that start with a 'digit'?

My config is:

filter class=solr.WordDelimiterFilterFactory generateNumberParts=0
splitOnNumerics=0 splitOnCaseChange=0 preserveOriginal=0
protected=protwords.txt/

For some text like

20x-30y

I am expecting ( want) '20x'  '30y' to be returned  retained as the
tokens after WDF is done with it. But I get nothing as per the analysis
page.

Any idea why? I am using 4.1

Thanks

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: WordDelimiterFactory

2013-04-16 Thread Jack Krupansky

Because you told it to!!! With: generateNumberParts=0

WDF is tricky... tell us exactly what rules you want it to follow and then 
we can tell you how to set the options.


Maybe more to the point: why exactly do you think you want it use WDF? Not 
that there aren't good reasons, but what specifically are yours?


Generally, see the schema in the Solr example for suggested best practices. 
Copy and paste from there, or, better yet, use exactly the types that are 
there.


-- Jack Krupansky

-Original Message- 
From: Ashok

Sent: Tuesday, April 16, 2013 7:52 PM
To: solr-user@lucene.apache.org
Subject: WordDelimiterFactory

Hi,

Why does WDF swallow all 'words' that start with a 'digit'?

My config is:

filter class=solr.WordDelimiterFilterFactory generateNumberParts=0
splitOnNumerics=0 splitOnCaseChange=0 preserveOriginal=0
protected=protwords.txt/

For some text like

20x-30y

I am expecting ( want) '20x'  '30y' to be returned  retained as the
tokens after WDF is done with it. But I get nothing as per the analysis
page.

Any idea why? I am using 4.1

Thanks

- ashok



--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: WordDelimiterFactory

2013-04-16 Thread Ashok
Thank you Jack, yes it is tricky.

If my text is

x20-y30

I get two nice tokens x20  y30 that I need to keep.

But the text 20x-30y is treated differently and I get nothing.

20x-y30 gives me just 'y30'

The docs on LucidWorks say

generateNumberParts: (integer, default 1) If non-zero, splits numeric
strings at delimiters:1947-32 -1947, 32

It looks like any 'word' that starts with a digit is treated as a numeric
string.

Setting generateNumberParts=1 in stead of 0 seems to generate the right
tokens in this case but need to see if it has any other impacts on the
finalized token list...

Thanks

- ashok





--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiterFactory-tp4056529p4056544.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: WordDelimiterFactory

2013-04-16 Thread Shawn Heisey
On 4/16/2013 8:12 PM, Ashok wrote:
 It looks like any 'word' that starts with a digit is treated as a numeric
 string.
 
 Setting generateNumberParts=1 in stead of 0 seems to generate the right
 tokens in this case but need to see if it has any other impacts on the
 finalized token list...

I have a fieldType that is using WDF with the following settings on the
index side.  Both index and query analysis show it behaving correctly
with terms that start with numbers, on versions 4.2.1 and 3.5.0:

filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=1
  splitOnNumerics=1
  stemEnglishPossessive=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=1
  catenateNumbers=1
  catenateAll=0
  preserveOriginal=1
/

It has different settings on the query side, but generateNumberParts is
1 for both:

filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=1
  splitOnNumerics=1
  stemEnglishPossessive=1
  generateWordParts=1
  generateNumberParts=1
  catenateWords=0
  catenateNumbers=0
  catenateAll=0
  preserveOriginal=0
/

I haven't tried it with generateNumberParts set to 0.

Thanks,
Shawn



Regarding WordDelimiterFactory

2010-09-09 Thread Sandhya Agarwal
Hello,

I have a file with the input string 91{40}9490949090, and I wanted to return 
this file when I search for the query string +91?40?9*.  The problem is that, 
the input string is getting indexed as 3 terms 91, 40, 9490949090.   Is 
there a way to consider { and } as part of the string itself.  Can we 
configure WordDelimiterFilterFactory *not to consider* curly braces as 
delimiters?

Thanks,
Sandhya



Re: Regarding WordDelimiterFactory

2010-09-09 Thread Robert Muir
On Thu, Sep 9, 2010 at 3:57 AM, Sandhya Agarwal sagar...@opentext.comwrote:

 Hello,

 I have a file with the input string 91{40}9490949090, and I wanted to
 return this file when I search for the query string +91?40?9*.  The
 problem is that, the input string is getting indexed as 3 terms 91, 40,
 9490949090.   Is there a way to consider { and } as part of the string
 itself.  Can we configure WordDelimiterFilterFactory *not to consider* curly
 braces as delimiters?


See: https://issues.apache.org/jira/browse/SOLR-2059
https://issues.apache.org/jira/browse/SOLR-2059
as a workaround, if you dont want to use trunk, you could also turn on
preserveOriginal

-- 
Robert Muir
rcm...@gmail.com


Re: Regarding WordDelimiterFactory

2010-09-09 Thread Grijesh.singh

set splitWordsPart=0,splitNumberPart=0 

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Regarding-WordDelimiterFactory-tp1444694p1444742.html
Sent from the Solr - User mailing list archive at Nabble.com.