Re: Tokenising on Each Letter
Probably a good idea to post the relevant information! I guess I thought it would be a really obvious answer but it seems its a bit more complex ;) field name=productsModel type=textTight indexed=true stored=true omitNorms=true/ !-- Less flexible matching, but less false matches. Probably not ideal for product names, but may be good for SKUs. Can insert dashes in the wrong place and still match. -- fieldType name=textTight class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ !-- this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with stemming. -- filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType It seems you may be correct about the catenateAll option, but I'm not sure if adding in a wildcard at the end of every search would be a great idea? This is meant to be applied to a general search box, but still retain flexibility for model numbers. Right now, we are using mySQL % % wildcards so it matches pretty much anything on the model number, whether you cut off the start or the end etc, and I wanted to retain that. Could you elaborate about N gram for me, based on my schema? The main reason I picked TextTight was for model numbers like EQW-500DBE-1AVER etc, I thought it would produce better results? Thanks a lot for the detailed reply. Scott -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1291984.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tokenising on Each Letter
Nikolas, thanks a lot for that, I've just gave it a quick test and it definitely seems to work for the examples I've gave. Thanks again, Scott From: Nikolas Tautenhahn [via Lucene] Sent: Monday, August 23, 2010 3:14 PM To: Scottie Subject: Re: Tokenising on Each Letter Hi Scottie, Could you elaborate about N gram for me, based on my schema? just a quick reply: fieldType name=textNGram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 catenateWords=1 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 splitOnNumerics=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory side=front minGramSize=2 maxGramSize=30 / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1 splitOnNumerics=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Will produce any NGrams from 2 up to 30 Characters, for Info check http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory Be sure to adjust those sizes (minGramSize/maxGramSize) so that maxGramSize is big enough to keep the whole original serial number/model number and minGramSize is not so small that you fill your index with useless information. Best regards, Nikolas Tautenhahn View message @ http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1292238.html To unsubscribe from Tokenising on Each Letter, click here. -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1294586.html Sent from the Solr - User mailing list archive at Nabble.com.
Tokenising on Each Letter
Just getting ready to launch Solr on one of our websites. Unfortunately, we can't work out one little issue; how do I configure Solr such that it can search our model numbers easily? For example: ADS12P2 If somebody searched for ADS it would match, because currently its split into tokens when it sees letters and numbers, if somebody did ADS12 it would also work etc. But if somebody does ADS1, currently there is no results? Does anybody know how I should configure Solr such that it will split a certain field over each letter or wildcard etc? Kind Regards Scott -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenising-on-Each-Letter-tp1247113p1247113.html Sent from the Solr - User mailing list archive at Nabble.com.