Re: Indexing tweet and searching @keyword OR #keyword

2011-08-13 Thread Erick Erickson
I don't see an easy way to do that with the standard set of
filters. You'll probably need to write something custom (note,
this is actually pretty easy). I suspect you'll
need to do something like Synonyms, where when you
get a token like #ipod, you essentially make it a synonym
for ipod and insert both in the document...

This assumes you can't create a list of all the terms you want
treated this way, because you could just synonyms if you could.


Best
Erick

On Thu, Aug 11, 2011 at 1:37 AM, Mohammad Shariq shariqn...@gmail.com wrote:
 Do you really want a search on ipad to *fail* to match input of #ipad?
 Or
 vice-versa?
 My requirement is :  I want to search both '#ipad' and 'ipad' for q='ipad'
 BUT for q='#ipad'  I want to search ONLY '#ipad' excluding 'ipad'.


 On 10 August 2011 19:49, Erick Erickson erickerick...@gmail.com wrote:

 Please look more carefully at the documentation for WDDF,
 specifically:

 split on intra-word delimiters (all non alpha-numeric characters).

 WordDelimiterFilterFactory will always throw away non alpha-numeric
 characters, you can't tell it do to otherwise. Try some of the other
 tokenizers/analyzers to get what you want, and also look at the
 admin/analysis page to see what the exact effects are of your
 fieldType definitions.

 Here's a great place to start:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 You probably want something like WhitespaceTokenizerFactory
 followed by LowerCaseFilterFactory or some such...

 But I really question whether this is what you want either. Do you
 really want a search on ipad to *fail* to match input of #ipad? Or
 vice-versa?

 KeywordTokenizerFactory is probably not the place you want to start,
 the tokenization process doesn't break anything up, you happen to be
 getting separate tokens because of WDDF, which as you see can't
 process things the way you want.


 Best
 Erick

 On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq shariqn...@gmail.com
 wrote:
  I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols
  and it ignored totally.
  I need solution plz suggest.
 
  On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote:
 
  It's the WordDelimiterFactory in your filter chain that's removing the
  punctuation entirely from your index, I think.
 
  Read up on what the WordDelimiter filter does, and what it's settings
 are;
  decide how you want things to be tokenized in your index to get the
 behavior
  your want; either get WordDelimiter to do it that way by passing it
  different arguments, or stop using WordDelimiter; come back with any
  questions after trying that!
 
 
 
  On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
 
  I have indexed around 1 million tweets ( using  text dataType).
  when I search the tweet with #  OR @  I dont get the exact result.
  e.g.  when I search for #ipad OR @ipad   I get the result where
 ipad
  is
  mentioned skipping the # and @.
  please suggest me, how to tune or what are filterFactories to use to
 get
  the
  desired result.
  I am indexing the tweet as text, below is text which is there in my
  schema.xml.
 
 
  fieldType name=text class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
      tokenizer class=solr.**KeywordTokenizerFactory/
      filter class=solr.**CommonGramsFilterFactory
 words=stopwords.txt
  minShingleSize=3 maxShingleSize=3 ignoreCase=true/
      filter class=solr.**WordDelimiterFilterFactory
  generateWordParts=1
  generateNumberParts=1 catenateWords=1 catenateNumbers=1
  catenateAll=0 splitOnCaseChange=1/
      filter class=solr.**LowerCaseFilterFactory/
      filter class=solr.**SnowballPorterFilterFactory
  protected=protwords.txt language=English/
  /analyzer
  analyzer type=query
          tokenizer class=solr.**KeywordTokenizerFactory/
          filter class=solr.**CommonGramsFilterFactory
  words=stopwords.txt
  minShingleSize=3 maxShingleSize=3 ignoreCase=true/
          filter class=solr.**WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
  catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
          filter class=solr.**LowerCaseFilterFactory/
          filter class=solr.**SnowballPorterFilterFactory
  protected=protwords.txt language=English/
  /analyzer
  /fieldType
 
 
 
 
  --
  Thanks and Regards
  Mohammad Shariq
 




 --
 Thanks and Regards
 Mohammad Shariq



Re: Indexing tweet and searching @keyword OR #keyword

2011-08-10 Thread Mohammad Shariq
I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols
and it ignored totally.
I need solution plz suggest.

On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote:

 It's the WordDelimiterFactory in your filter chain that's removing the
 punctuation entirely from your index, I think.

 Read up on what the WordDelimiter filter does, and what it's settings are;
 decide how you want things to be tokenized in your index to get the behavior
 your want; either get WordDelimiter to do it that way by passing it
 different arguments, or stop using WordDelimiter; come back with any
 questions after trying that!



 On 8/4/2011 11:22 AM, Mohammad Shariq wrote:

 I have indexed around 1 million tweets ( using  text dataType).
 when I search the tweet with #  OR @  I dont get the exact result.
 e.g.  when I search for #ipad OR @ipad   I get the result where ipad
 is
 mentioned skipping the # and @.
 please suggest me, how to tune or what are filterFactories to use to get
 the
 desired result.
 I am indexing the tweet as text, below is text which is there in my
 schema.xml.


 fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.**KeywordTokenizerFactory/
 filter class=solr.**CommonGramsFilterFactory words=stopwords.txt
 minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.**WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.**LowerCaseFilterFactory/
 filter class=solr.**SnowballPorterFilterFactory
 protected=protwords.txt language=English/
 /analyzer
 analyzer type=query
 tokenizer class=solr.**KeywordTokenizerFactory/
 filter class=solr.**CommonGramsFilterFactory
 words=stopwords.txt
 minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.**WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.**LowerCaseFilterFactory/
 filter class=solr.**SnowballPorterFilterFactory
 protected=protwords.txt language=English/
 /analyzer
 /fieldType




-- 
Thanks and Regards
Mohammad Shariq


Re: Indexing tweet and searching @keyword OR #keyword

2011-08-10 Thread Erick Erickson
Please look more carefully at the documentation for WDDF,
specifically:

split on intra-word delimiters (all non alpha-numeric characters).

WordDelimiterFilterFactory will always throw away non alpha-numeric
characters, you can't tell it do to otherwise. Try some of the other
tokenizers/analyzers to get what you want, and also look at the
admin/analysis page to see what the exact effects are of your
fieldType definitions.

Here's a great place to start:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You probably want something like WhitespaceTokenizerFactory
followed by LowerCaseFilterFactory or some such...

But I really question whether this is what you want either. Do you
really want a search on ipad to *fail* to match input of #ipad? Or
vice-versa?

KeywordTokenizerFactory is probably not the place you want to start,
the tokenization process doesn't break anything up, you happen to be
getting separate tokens because of WDDF, which as you see can't
process things the way you want.


Best
Erick

On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq shariqn...@gmail.com wrote:
 I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols
 and it ignored totally.
 I need solution plz suggest.

 On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote:

 It's the WordDelimiterFactory in your filter chain that's removing the
 punctuation entirely from your index, I think.

 Read up on what the WordDelimiter filter does, and what it's settings are;
 decide how you want things to be tokenized in your index to get the behavior
 your want; either get WordDelimiter to do it that way by passing it
 different arguments, or stop using WordDelimiter; come back with any
 questions after trying that!



 On 8/4/2011 11:22 AM, Mohammad Shariq wrote:

 I have indexed around 1 million tweets ( using  text dataType).
 when I search the tweet with #  OR @  I dont get the exact result.
 e.g.  when I search for #ipad OR @ipad   I get the result where ipad
 is
 mentioned skipping the # and @.
 please suggest me, how to tune or what are filterFactories to use to get
 the
 desired result.
 I am indexing the tweet as text, below is text which is there in my
 schema.xml.


 fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index
     tokenizer class=solr.**KeywordTokenizerFactory/
     filter class=solr.**CommonGramsFilterFactory words=stopwords.txt
 minShingleSize=3 maxShingleSize=3 ignoreCase=true/
     filter class=solr.**WordDelimiterFilterFactory
 generateWordParts=1
 generateNumberParts=1 catenateWords=1 catenateNumbers=1
 catenateAll=0 splitOnCaseChange=1/
     filter class=solr.**LowerCaseFilterFactory/
     filter class=solr.**SnowballPorterFilterFactory
 protected=protwords.txt language=English/
 /analyzer
 analyzer type=query
         tokenizer class=solr.**KeywordTokenizerFactory/
         filter class=solr.**CommonGramsFilterFactory
 words=stopwords.txt
 minShingleSize=3 maxShingleSize=3 ignoreCase=true/
         filter class=solr.**WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 catenateWords=1
 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
         filter class=solr.**LowerCaseFilterFactory/
         filter class=solr.**SnowballPorterFilterFactory
 protected=protwords.txt language=English/
 /analyzer
 /fieldType




 --
 Thanks and Regards
 Mohammad Shariq



Re: Indexing tweet and searching @keyword OR #keyword

2011-08-10 Thread Mohammad Shariq
Do you really want a search on ipad to *fail* to match input of #ipad?
Or
vice-versa?
My requirement is :  I want to search both '#ipad' and 'ipad' for q='ipad'
BUT for q='#ipad'  I want to search ONLY '#ipad' excluding 'ipad'.


On 10 August 2011 19:49, Erick Erickson erickerick...@gmail.com wrote:

 Please look more carefully at the documentation for WDDF,
 specifically:

 split on intra-word delimiters (all non alpha-numeric characters).

 WordDelimiterFilterFactory will always throw away non alpha-numeric
 characters, you can't tell it do to otherwise. Try some of the other
 tokenizers/analyzers to get what you want, and also look at the
 admin/analysis page to see what the exact effects are of your
 fieldType definitions.

 Here's a great place to start:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 You probably want something like WhitespaceTokenizerFactory
 followed by LowerCaseFilterFactory or some such...

 But I really question whether this is what you want either. Do you
 really want a search on ipad to *fail* to match input of #ipad? Or
 vice-versa?

 KeywordTokenizerFactory is probably not the place you want to start,
 the tokenization process doesn't break anything up, you happen to be
 getting separate tokens because of WDDF, which as you see can't
 process things the way you want.


 Best
 Erick

 On Wed, Aug 10, 2011 at 3:09 AM, Mohammad Shariq shariqn...@gmail.com
 wrote:
  I tried tweaking WordDelimiterFactory but I won't accept # OR @ symbols
  and it ignored totally.
  I need solution plz suggest.
 
  On 4 August 2011 21:08, Jonathan Rochkind rochk...@jhu.edu wrote:
 
  It's the WordDelimiterFactory in your filter chain that's removing the
  punctuation entirely from your index, I think.
 
  Read up on what the WordDelimiter filter does, and what it's settings
 are;
  decide how you want things to be tokenized in your index to get the
 behavior
  your want; either get WordDelimiter to do it that way by passing it
  different arguments, or stop using WordDelimiter; come back with any
  questions after trying that!
 
 
 
  On 8/4/2011 11:22 AM, Mohammad Shariq wrote:
 
  I have indexed around 1 million tweets ( using  text dataType).
  when I search the tweet with #  OR @  I dont get the exact result.
  e.g.  when I search for #ipad OR @ipad   I get the result where
 ipad
  is
  mentioned skipping the # and @.
  please suggest me, how to tune or what are filterFactories to use to
 get
  the
  desired result.
  I am indexing the tweet as text, below is text which is there in my
  schema.xml.
 
 
  fieldType name=text class=solr.TextField
 positionIncrementGap=100
  analyzer type=index
  tokenizer class=solr.**KeywordTokenizerFactory/
  filter class=solr.**CommonGramsFilterFactory
 words=stopwords.txt
  minShingleSize=3 maxShingleSize=3 ignoreCase=true/
  filter class=solr.**WordDelimiterFilterFactory
  generateWordParts=1
  generateNumberParts=1 catenateWords=1 catenateNumbers=1
  catenateAll=0 splitOnCaseChange=1/
  filter class=solr.**LowerCaseFilterFactory/
  filter class=solr.**SnowballPorterFilterFactory
  protected=protwords.txt language=English/
  /analyzer
  analyzer type=query
  tokenizer class=solr.**KeywordTokenizerFactory/
  filter class=solr.**CommonGramsFilterFactory
  words=stopwords.txt
  minShingleSize=3 maxShingleSize=3 ignoreCase=true/
  filter class=solr.**WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1 catenateWords=1
  catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
  filter class=solr.**LowerCaseFilterFactory/
  filter class=solr.**SnowballPorterFilterFactory
  protected=protwords.txt language=English/
  /analyzer
  /fieldType
 
 
 
 
  --
  Thanks and Regards
  Mohammad Shariq
 




-- 
Thanks and Regards
Mohammad Shariq


Indexing tweet and searching @keyword OR #keyword

2011-08-04 Thread Mohammad Shariq
I have indexed around 1 million tweets ( using  text dataType).
when I search the tweet with #  OR @  I dont get the exact result.
e.g.  when I search for #ipad OR @ipad   I get the result where ipad is
mentioned skipping the # and @.
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as text, below is text which is there in my
schema.xml.


fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
/fieldType

-- 
Thanks and Regards
Mohammad Shariq


Re: Indexing tweet and searching @keyword OR #keyword

2011-08-04 Thread Jonathan Rochkind
It's the WordDelimiterFactory in your filter chain that's removing the 
punctuation entirely from your index, I think.


Read up on what the WordDelimiter filter does, and what it's settings 
are; decide how you want things to be tokenized in your index to get the 
behavior your want; either get WordDelimiter to do it that way by 
passing it different arguments, or stop using WordDelimiter; come back 
with any questions after trying that!



On 8/4/2011 11:22 AM, Mohammad Shariq wrote:

I have indexed around 1 million tweets ( using  text dataType).
when I search the tweet with #  OR @  I dont get the exact result.
e.g.  when I search for #ipad OR @ipad   I get the result where ipad is
mentioned skipping the # and @.
please suggest me, how to tune or what are filterFactories to use to get the
desired result.
I am indexing the tweet as text, below is text which is there in my
schema.xml.


fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.CommonGramsFilterFactory words=stopwords.txt
minShingleSize=3 maxShingleSize=3 ignoreCase=true/
 filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SnowballPorterFilterFactory
protected=protwords.txt language=English/
/analyzer
/fieldType