Re: StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-23 Thread Shawn Heisey

On 9/23/2011 1:45 AM, Pranav Prakash wrote:
Maybe I am wrong. But my intentions of using both of them is - first I 
want to use phrase queries so used CommonGramsFilterFactory. Secondly, 
I dont want those stopwords in my index, so I have used 
StopFilterFactory to remove them. 


CommonGrams is not necessary for phrase queries.  If you have a 
super-dense index with very large documents, it will reduce the amount 
of memory used by Solr, which can make them faster.  It comes at a large 
expense in disk space because your index gets considerably larger.  The 
cost trade-off in index size vs. memory usage may not be worth it.  For 
an index like the Hathi Trust, the tradeoff is worthwhile.



term frequencyto 26164and 25804the 25566of 25022a 24918in 24590for 23646n23588
with 23055is 22510


Is this typical of your production index size, or just a test?  With 
numbers this low, neither commongrams nor stopfilter is really 
necessary.  I suspect that these are probably test numbers, though.





  Did you do delete and do a full reindex after you changed your schema?


Yup I did that a couple of times


I don't know what's going  on here, but it sounds like your config might 
not be saying what you think it's saying.  It might be a good idea to 
include your entire schema.xml and the name of the field that you are 
looking at for term frequency.


Thanks,
Shawn



Re: StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-23 Thread Pranav Prakash
> You've got CommonGramsFilterFactory and StopFilterFactory both using
> stopwords.txt, which is a confusing configuration.  Normally you'd want one
> or the other, not both ... but if you did legitimately have both, you'd want
> them to each use a different wordlist.
>

Maybe I am wrong. But my intentions of using both of them is - first I want
to use phrase queries so used CommonGramsFilterFactory. Secondly, I dont
want those stopwords in my index, so I have used StopFilterFactory to remove
them.



>
> The commongrams filter turns each found occurrence of a word in the file
> into two tokens - one prepended with the token before it, one appended with
> the token after it.  If it's the first or last term in a field, it only
> produces one token.  When it gets to the stopfilter, the combined terms no
> longer match what's in stopwords.txt, so no action is taken.
>
> If I had to guess, what you are seeing in the top 10 terms is the
> concatenation of your most common stopword with another word.  If it were
> English, I would guess that to be "of_the" or something similar.  If my
> guess is wrong, then I'm not sure what's going on, and some cut/paste of
> what you're actually seeing might be in order.


term frequencyto 26164and 25804the 25566of 25022a 24918in 24590for 23646n23588
with 23055is 22510



>  Did you do delete and do a full reindex after you changed your schema?
>

Yup I did that a couple of times


>
> Thanks,
> Shawn
>
>
*Pranav Prakash*

"temet nosce"

Twitter  | Blog 
 | Google 


Re: StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-22 Thread Shawn Heisey

On 9/22/2011 3:54 AM, Pranav Prakash wrote:

Hi List,

I included StopFilterFactory and I  can see it taking action in the Analyzer
Interface. However, when I go to Schema Analyzer, I see those stop words in
the top 10 terms. Is this normal?

















You've got CommonGramsFilterFactory and StopFilterFactory both using 
stopwords.txt, which is a confusing configuration.  Normally you'd want 
one or the other, not both ... but if you did legitimately have both, 
you'd want them to each use a different wordlist.


The commongrams filter turns each found occurrence of a word in the file 
into two tokens - one prepended with the token before it, one appended 
with the token after it.  If it's the first or last term in a field, it 
only produces one token.  When it gets to the stopfilter, the combined 
terms no longer match what's in stopwords.txt, so no action is taken.


If I had to guess, what you are seeing in the top 10 terms is the 
concatenation of your most common stopword with another word.  If it 
were English, I would guess that to be "of_the" or something similar.  
If my guess is wrong, then I'm not sure what's going on, and some 
cut/paste of what you're actually seeing might be in order.  Did you do 
delete and do a full reindex after you changed your schema?


Thanks,
Shawn



StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-22 Thread Pranav Prakash
Hi List,

I included StopFilterFactory and I  can see it taking action in the Analyzer
Interface. However, when I go to Schema Analyzer, I see those stop words in
the top 10 terms. Is this normal?
















*Pranav Prakash*

"temet nosce"

Twitter  | Blog  |
Google