Re: Indexing tweet and searching @keyword OR #keyword
I don't see an easy way to do that with the standard set of filters. You'll probably need to write something custom (note, this is actually pretty easy). I suspect you'll need to do something like Synonyms, where when you get a token like #ipod, you essentially make it a synonym for ipod and insert both in the document... This assumes you can't create a list of all the terms you want treated this way, because you could just synonyms if you could. Best Erick On Thu, Aug 11, 2011 at 1:37 AM, Mohammad Shariq shariqn...@gmail.com wrote: Do you really want a search on ipad to *fail* to match input of #ipad? Or vice-versa? My requirement is : I want to search both '#ipad' and 'ipad' for q='ipad' BUT for q='#ipad' I want to search ONLY '#ipad' excluding 'ipad'. On 10 August 2011 19:49, Erick Erickson erickerick...@gmail.com wrote: Please look more carefully at the documentation for WDDF, specifically: split on intra-word delimiters (all non alpha-numeric characters). WordDelimiterFilterFactory will always throw away non alpha-numeric characters, you can't tell it do to otherwise. Try some of the other tokenizers/analyzers to get what you want, and also look at the admin/analysis page to see what the exact effects are of your fieldType definitions. Here's a great place to start: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters You probably want something like WhitespaceTokenizerFactory followed by LowerCaseFilterFactory or some such... But I really question whether this is what you want either. Do you really want a search on ipad to *fail* to match input of #ipad? Or vice-versa? KeywordTokenizerFactory is probably not the place you want to start, the tokenization process doesn't break anything up, you happen to be getting separate tokens because of WDDF, which as you see can't process things the way you want. Best Erick On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq shariqn...@gmail.com wrote: I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols and it ignored totally. I need solution plz suggest. On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote: It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think. Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that! On 8/4/2011 11:22 AM, Mohammad Shariq wrote: I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Indexing tweet and searching @keyword OR #keyword
I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols and it ignored totally. I need solution plz suggest. On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote: It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think. Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that! On 8/4/2011 11:22 AM, Mohammad Shariq wrote: I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType -- Thanks and Regards Mohammad Shariq
Re: Indexing tweet and searching @keyword OR #keyword
Please look more carefully at the documentation for WDDF, specifically: split on intra-word delimiters (all non alpha-numeric characters). WordDelimiterFilterFactory will always throw away non alpha-numeric characters, you can't tell it do to otherwise. Try some of the other tokenizers/analyzers to get what you want, and also look at the admin/analysis page to see what the exact effects are of your fieldType definitions. Here's a great place to start: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters You probably want something like WhitespaceTokenizerFactory followed by LowerCaseFilterFactory or some such... But I really question whether this is what you want either. Do you really want a search on ipad to *fail* to match input of #ipad? Or vice-versa? KeywordTokenizerFactory is probably not the place you want to start, the tokenization process doesn't break anything up, you happen to be getting separate tokens because of WDDF, which as you see can't process things the way you want. Best Erick On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq shariqn...@gmail.com wrote: I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols and it ignored totally. I need solution plz suggest. On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote: It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think. Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that! On 8/4/2011 11:22 AM, Mohammad Shariq wrote: I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType -- Thanks and Regards Mohammad Shariq
Re: Indexing tweet and searching @keyword OR #keyword
Do you really want a search on ipad to *fail* to match input of #ipad? Or vice-versa? My requirement is : I want to search both '#ipad' and 'ipad' for q='ipad' BUT for q='#ipad' I want to search ONLY '#ipad' excluding 'ipad'. On 10 August 2011 19:49, Erick Erickson erickerick...@gmail.com wrote: Please look more carefully at the documentation for WDDF, specifically: split on intra-word delimiters (all non alpha-numeric characters). WordDelimiterFilterFactory will always throw away non alpha-numeric characters, you can't tell it do to otherwise. Try some of the other tokenizers/analyzers to get what you want, and also look at the admin/analysis page to see what the exact effects are of your fieldType definitions. Here's a great place to start: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters You probably want something like WhitespaceTokenizerFactory followed by LowerCaseFilterFactory or some such... But I really question whether this is what you want either. Do you really want a search on ipad to *fail* to match input of #ipad? Or vice-versa? KeywordTokenizerFactory is probably not the place you want to start, the tokenization process doesn't break anything up, you happen to be getting separate tokens because of WDDF, which as you see can't process things the way you want. Best Erick On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq shariqn...@gmail.com wrote: I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols and it ignored totally. I need solution plz suggest. On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote: It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think. Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that! On 8/4/2011 11:22 AM, Mohammad Shariq wrote: I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.**KeywordTokenizerFactory/ filter class=solr.**CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Indexing tweet and searching @keyword OR #keyword
I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType -- Thanks and Regards Mohammad Shariq
Re: Indexing tweet and searching @keyword OR #keyword
It's the WordDelimiterFactory in your filter chain that's removing the punctuation entirely from your index, I think. Read up on what the WordDelimiter filter does, and what it's settings are; decide how you want things to be tokenized in your index to get the behavior your want; either get WordDelimiter to do it that way by passing it different arguments, or stop using WordDelimiter; come back with any questions after trying that! On 8/4/2011 11:22 AM, Mohammad Shariq wrote: I have indexed around 1 million tweets ( using text dataType). when I search the tweet with # OR @ I dont get the exact result. e.g. when I search for #ipad OR @ipad I get the result where ipad is mentioned skipping the # and @. please suggest me, how to tune or what are filterFactories to use to get the desired result. I am indexing the tweet as text, below is text which is there in my schema.xml. fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.CommonGramsFilterFactory words=stopwords.txt minShingleSize=3 maxShingleSize=3 ignoreCase=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory protected=protwords.txt language=English/ /analyzer /fieldType