Re: Tokenising on Each Letter

2010-08-23 Thread Scottie

Probably a good idea to post the relevant information! I guess I thought it
would be a really obvious answer but it seems its a bit more complex ;)

field name=productsModel type=textTight indexed=true stored=true
omitNorms=true/

!-- Less flexible matching, but less false matches.  Probably not ideal
for product names,
 but may be good for SKUs.  Can insert dashes in the wrong place and
still match. --
fieldType name=textTight class=solr.TextField
positionIncrementGap=100 
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=false/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=0 generateNumberParts=0 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt/
!-- this filter can remove any duplicate tokens that appear at the
same position - sometimes
 possible with WordDelimiterFilter in conjuncton with stemming.
--
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType

It seems you may be correct about the catenateAll option, but I'm not sure
if adding in a wildcard at the end of every search would be a great idea?
This is meant to be applied to a general search box, but still retain
flexibility for model numbers. Right now, we are using mySQL % % wildcards
so it matches pretty much anything on the model number, whether you cut off
the start or the end etc, and I wanted to retain that.

Could you elaborate about N gram for me, based on my schema?

The main reason I picked TextTight was for model numbers like
EQW-500DBE-1AVER etc, I thought it would produce better results?

Thanks a lot for the detailed reply.

Scott
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1291984.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tokenising on Each Letter

2010-08-23 Thread Scottie

Nikolas, thanks a lot for that, I've just gave it a quick test and it 
definitely seems to work for the examples I've gave.

Thanks again,

Scott


From: Nikolas Tautenhahn [via Lucene] 
Sent: Monday, August 23, 2010 3:14 PM
To: Scottie 
Subject: Re: Tokenising on Each Letter


Hi Scottie, 

 Could you elaborate about N gram for me, based on my schema? 

just a quick reply: 


 fieldType name=textNGram class=solr.TextField 
 positionIncrementGap=100 
   analyzer type=index 
 tokenizer class=solr.WhitespaceTokenizerFactory/ 
 !-- in this example, we will only use synonyms at query time 
 filter class=solr.SynonymFilterFactory 
 synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- 
 
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=0 catenateWords=1 catenateNumbers=0 catenateAll=0 
 splitOnCaseChange=1 splitOnNumerics=0 preserveOriginal=1/ 
 filter class=solr.LowerCaseFilterFactory/ 
 filter class=solr.EdgeNGramFilterFactory side=front minGramSize=2 
 maxGramSize=30 / 
 filter class=solr.RemoveDuplicatesTokenFilterFactory/ 
   /analyzer 
   analyzer type=query 
 tokenizer class=solr.WhitespaceTokenizerFactory/ 
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/ 
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 
 splitOnCaseChange=1 splitOnNumerics=0 preserveOriginal=1/ 
 filter class=solr.LowerCaseFilterFactory/ 
 filter class=solr.RemoveDuplicatesTokenFilterFactory/ 
   /analyzer 
 /fieldType 

Will produce any NGrams from 2 up to 30 Characters, for Info check 
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

Be sure to adjust those sizes (minGramSize/maxGramSize) so that 
maxGramSize is big enough to keep the whole original serial number/model 
number and minGramSize is not so small that you fill your index with 
useless information. 

Best regards, 
Nikolas Tautenhahn 







View message @ 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1292238.html
 
To unsubscribe from Tokenising on Each Letter, click here. 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1294586.html
Sent from the Solr - User mailing list archive at Nabble.com.


Tokenising on Each Letter

2010-08-20 Thread Scottie

Just getting ready to launch Solr on one of our websites.

Unfortunately, we can't work out one little issue; how do I configure Solr
such that it can search our model numbers easily? For example:

ADS12P2

If somebody searched for ADS it would match, because currently its split
into tokens when it sees letters and numbers, if somebody did ADS12 it would
also work etc.

But if somebody does ADS1, currently there is no results?

Does anybody know how I should configure Solr such that it will split a
certain field over each letter or wildcard etc?

Kind Regards

Scott
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1247113.html
Sent from the Solr - User mailing list archive at Nabble.com.