Re: Strip special chars like -

2011-08-12 Thread roySolr
Erick, you're right. It's working, my schema looks like this:

fieldType name=name_type class=solr.TextField
positionIncrementGap=100
  analyzer type=index
 charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/ 
filter class=solr.TrimFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 catenateWords=1 splitOnCaseChange=0
splitOnNumerics=0 stemEnglishPossessive=0/ 
  /analyzer
  analyzer type=query
 charFilter class=solr.HTMLStripCharFilterFactory/
tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/ 
filter class=solr.TrimFilterFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 catenateWords=0 catenateNumbers=0
splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=0 / 
  /analyzer
/fieldType

Thanks for helping me!!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3248545.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strip special chars like -

2011-08-09 Thread roySolr
Yes, i understand the difference between generateWordParts and catenateWords.
But i can't fix my problem with these options, It doesn't fix all the
possibilities.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3239186.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strip special chars like -

2011-08-09 Thread Erick Erickson
OK, what are the other possibilities that it doesn't fix? Just saying
it won't work without some examples doesn't leave much to
go on...

Best
Erick

On Tue, Aug 9, 2011 at 10:41 AM, roySolr royrutten1...@gmail.com wrote:
 Yes, i understand the difference between generateWordParts and catenateWords.
 But i can't fix my problem with these options, It doesn't fix all the
 possibilities.

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3239186.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Strip special chars like -

2011-08-09 Thread roySolr
Ok, i there are three query possibilities:

Manchester-united
Manchester united
Manchesterunited

The original name of the club is manchester-united. 


generateWordParts will fixes two of these possibilities:

Manchester-united = manchester,united

I can search for Manchester-united and manchester united. When i
search for manchesterunited i get no results. 

To fix this i could use catenateWords:

Manchester-united = manchesterunited 

In this situation i can search for  Manchester-united and
manchesterunited. When i search for manchester united i get no results.
The catenateWords option will also fixes only 2 situations.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3239256.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Strip special chars like -

2011-08-09 Thread lee carroll
Hi I might be wrong as I've not tried it out to be sure but from the wiki docs:

These parameters may be combined in any way.

Example of generateWordParts=1 and catenateWords=1:
PowerShot - 0:Power, 1:Shot 1:PowerShot
(where 0,1,1 are token positions)

does that fit the bill ?

On 9 August 2011 16:03, roySolr royrutten1...@gmail.com wrote:
 Ok, i there are three query possibilities:

 Manchester-united
 Manchester united
 Manchesterunited

 The original name of the club is manchester-united.


 generateWordParts will fixes two of these possibilities:

 Manchester-united = manchester,united

 I can search for Manchester-united and manchester united. When i
 search for manchesterunited i get no results.

 To fix this i could use catenateWords:

 Manchester-united = manchesterunited

 In this situation i can search for  Manchester-united and
 manchesterunited. When i search for manchester united i get no results.
 The catenateWords option will also fixes only 2 situations.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3239256.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Strip special chars like -

2011-08-09 Thread Sujit Pal
I have done this using a custom tokenfilter that (among other things)
detects hyphenated words and converts it to the 3 variations, using a
regex match on the incoming token:
(\w+)-(\w+)

that runs the following regex transform:

s/(\w+)-(\w+)/$1$2__$1 $2/

and then splits by __ and passes the original token, the one word and
two word versions through a SynonymFilter further down the chain (see
Lucene in Action, 2nd Edition for code).

-sujit

On Tue, 2011-08-09 at 06:27 -0700, roySolr wrote:
 Hello,
 
 I have some terms in my index with specials characters. An example is
 manchester-united. I want that a user can search for
 manchester-united,manchester united and  manchesterunited. What's the
 best way to fix this? i have used the patternReplaceFilter and some
 tokenizers but it couldn't fix the last situation(manchesterunited). Can
 someone helps me?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3238942.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Strip special chars like -

2011-08-09 Thread Erick Erickson
That's not what I get. This is for Solr 3.3, but there's no
reason that I know of that other versions should give
different results.


Here's the field def form the 3.3 example, this is just
the standard implementation.

  fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
!-- in this example, we will only use synonyms at query time
filter class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true expand=false/
--
!-- Case insensitive stop word removal.
  add enablePositionIncrements=true in both the index and query
  analyzers to leave a 'gap' for more accurate phrase queries.
--
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory
ignoreCase=true
words=stopwords_en.txt
enablePositionIncrements=true
/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType

At index time, it produces the tokens for manchester-united
pos 1 pos 2
manchester united
manchesterunited

at query time, manchesterunited matches, it isn't transformed and
matches on the second row
manchester united and manchester-united
both parse to
manchester united
and match the first row.


So somehow we're not doing the same thing. Try
attaching debugQuery=on to your query and post the results.
Also try looking at the admin/analysis page and see what
that tells you.

Best
Erick

P.S. Did you re-index after your schema changes?


On Tue, Aug 9, 2011 at 11:03 AM, roySolr royrutten1...@gmail.com wrote:
 Ok, i there are three query possibilities:

 Manchester-united
 Manchester united
 Manchesterunited

 The original name of the club is manchester-united.


 generateWordParts will fixes two of these possibilities:

 Manchester-united = manchester,united

 I can search for Manchester-united and manchester united. When i
 search for manchesterunited i get no results.

 To fix this i could use catenateWords:

 Manchester-united = manchesterunited

 In this situation i can search for  Manchester-united and
 manchesterunited. When i search for manchester united i get no results.
 The catenateWords option will also fixes only 2 situations.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Strip-special-chars-like-tp3238942p3239256.html
 Sent from the Solr - User mailing list archive at Nabble.com.