Dear Lucene Users:

What is the best way to get the most common terms for a subset of the total
documents in your index?

I know how to get the most common terms for a field for the entire index,
but what is the most efficient way to do this for a subset of documents?

Here is the code I am using to get the top "numberOfTerms" common terms for
the field "fieldName":

        public TermInfo[] mostCommonTerms(String fieldName, int
numberOfTerms)
        {
                //make sure min will get a positive number
                if (numberOfTerms < 1)
                {
                        numberOfTerms = Integer.MAX_VALUE;
                }
                numberOfTerms = Math.min(numberOfTerms, 50);
                //String[] commonTerms = new String[numberOfTerms];
                try
                {
                        IndexReader reader = IndexReader.open(indexPath);
                        TermInfoQueue tiq = new
TermInfoQueue(numberOfTerms);
                        TermEnum terms = reader.terms();

                        int minFreq = 0;
                        while (terms.next())
                        {
        
if(fieldName.equalsIgnoreCase(terms.term().field()))
                                {
                                        if (terms.docFreq() > minFreq)
                                        {
                                                tiq.put(new
TermInfo(terms.term(), terms.docFreq()));
                                                if (tiq.size() >=
numberOfTerms) // if tiq overfull
                                                {
                                                        tiq.pop(); // remove
lowest in tiq
                                                        minFreq =
((TermInfo) tiq.top()).docFreq; // reset
        
// minFreq
                                                }
                                        }

                                }
                        }
                        TermInfo[] res = new TermInfo[tiq.size()];
                        for (int i = 0; i < res.length; i++)
                        {
                                res[res.length - i - 1] = (TermInfo)
tiq.pop();
                        }
                        reader.close();
                        return res;

                }
                catch (IOException ioe)
                {
                        logger.error("IOException: " + ioe.getMessage());
                }
                return null;
        }

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to