Re: raw hit count
Kent, Erik, On Saturday 29 November 2003 17:20, Erik Hatcher wrote: I enjoy at least attempting to answer questions here, even if I'm half wrong, so by all means correct me if I misspeak Me too, :) On Saturday, November 29, 2003, at 06:37 PM, Kent Gibson wrote: All I would like to know is how many times a query was found in a particular document. I have no problems getting the score from hits.score(). hits.length is the number of times in total that the query was found, however I want the the number of times the query was found on a document by document basis. is this possible? Could you be a bit more precise on what you mean by 'the number of times the query was found'? For a single query term, it is straightforward, but what about eg. a query for three optional terms? The 'coord' factor used in computing the score is exactly this. See the javadoc for it: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ Similarity.html#coord(int,%20int) AFAIK, this overlap is the number of terms the document and the query have in common. For a query consisting of a single term, the overlap is always one, and the number of times the query occurs in a document is the term frequency in the document. You could implement a custom Similarity to capture the overlap or adjust the the factor depending on what you're trying to accomplish. The only idea I have is to rerun the search, but I can't even see how to run a search on only one document! You could always rerun a search with a Filter with only one bit enabled and see if zero or one document is returned - that would be quite trivial and fast. You could also implement a Similarity that ignores the total number of terms in the searched document field, see lengthNorm() in http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html As lengthNorm() is applied at indexing time, you will have to reindex for this to work for you. At query time you can then use a tf() implementation that is linear, instead of the default square root in DefaultSimilarity, and a constant idf(), instead of the default log of the inverse document frequency. You should then get a document score that is proportional to the number of query terms in the document. Kind regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.
http://sourceforge.net/projects/weblucene/ WebLucene: Lucene search engine XML interface, provided sax based indexing, indexing sequence based result sorting and xml output with highlight support.The CJKTokenizer support Chinese Japanese and Korean with Westen language simultaneously. The key features: 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer 2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher 3 xml output: com/chedong/weblucene/search/DOMSearcher 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer 5 token based highlighter: reverse StopTokenzier: org/apache/lucene/anlysis/HighlightAnalyzer.java HighlightFilter.java with abstract: com/chedong/weblucene/search/WebluceneHighlighter 6 A simplified query parser: google like syntax with term limit org/apache/lucene/queryParser/SimpleQueryParser modified from early version of Lucene :) Regards Che, Dong
Re: raw hit count
Thanks for the help guys, but unfortunately I am still stuck. Let me reiterate what I would like to do and then explain what I have tried. I would like to know that in document x the query y appeared n times. For example: query = Bank : Bank found in doc number 1, 3 tims Understandbly this is a bit tricky when query y is composed of more than one word, but for the moment I would be satisified if I knew how many times query y appeared in its entirety. However in the end it would be great if I could get a result as follows: query = Hells Bells; Hells found in doc number 2, 3 times and Bells Found 0 times as per Erik's idea I tried with the BitSet as follows: QueryFilter qf = new QueryFilter(query); IndexReader ir = IndexReader.open(indexPath); Searcher searcher2 = new IndexSearcher(ir); // get the bit set for the query BitSet bits = qf.bits(ir); last = bits.nextSetBit(offset); offset = last + 1; System.out.println(First bit is: + last); System.out.println(Bits + bits.toString()); // clear all the bits bits.clear(); System.out.println(Bits after + bits.toString()); bits.set(last); /* just to see the effect */ BitSet bits2 = qf.bits(ir); System.out.println(Bits now + bits2.toString()); Hits hits2 = searcher2.search(query,qf); /* this value is always one /* */ System.out.println(raw hits : + hits2.length()); However I always get a result of 1, which I suppose is has to do with this overlap thingy. As per Ype's idea I tried to implement a Similarity object, but two things I believe are wrong, a) I am doing something fundamentally wrong with the maths b) I get a sneaky idea this is the wrong way around this. Is there not a simple way to just get some word statistics out of a file? Once again thanks for the inputs and I look forward to a long fight. public float lengthNorm(String fieldName, int numTerms) { return (float) 1.0 ; } /** Implemented as codesqrt(freq)/code. */ public float tf(float freq) { return (float) (freq); } /** Implemented as codelog(numDocs/(docFreq+1)) + 1/code. */ public float idf(int docFreq, int numDocs) { return (float)1.0; } --- Ype Kingma [EMAIL PROTECTED] wrote: Kent, Erik, On Saturday 29 November 2003 17:20, Erik Hatcher wrote: I enjoy at least attempting to answer questions here, even if I'm half wrong, so by all means correct me if I misspeak Me too, :) On Saturday, November 29, 2003, at 06:37 PM, Kent Gibson wrote: All I would like to know is how many times a query was found in a particular document. I have no problems getting the score from hits.score(). hits.length is the number of times in total that the query was found, however I want the the number of times the query was found on a document by document basis. is this possible? Could you be a bit more precise on what you mean by 'the number of times the query was found'? For a single query term, it is straightforward, but what about eg. a query for three optional terms? The 'coord' factor used in computing the score is exactly this. See the javadoc for it: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ Similarity.html#coord(int,%20int) AFAIK, this overlap is the number of terms the document and the query have in common. For a query consisting of a single term, the overlap is always one, and the number of times the query occurs in a document is the term frequency in the document. You could implement a custom Similarity to capture the overlap or adjust the the factor depending on what you're trying to accomplish. The only idea I have is to rerun the search, but I can't even see how to run a search on only one document! You could always rerun a search with a Filter with only one bit enabled and see if zero or one document is returned - that would be quite trivial and fast. You could also implement a Similarity that ignores the total number of terms in the searched document field, see lengthNorm() in http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html As lengthNorm() is applied at indexing time, you will have to reindex for this to work for you. At query time you can then use a tf() implementation that is linear, instead of the default square root in DefaultSimilarity, and a constant idf(), instead of the default log of the inverse document frequency. You should then get a document score that is proportional to the number of query terms in the document. Kind regards, Ype - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL
Re: raw hit count
On Sunday, November 30, 2003, at 11:13 AM, Kent Gibson wrote: as per Erik's idea I tried with the BitSet as follows: QueryFilter qf = new QueryFilter(query); IndexReader ir = IndexReader.open(indexPath); Searcher searcher2 = new IndexSearcher(ir); // get the bit set for the query BitSet bits = qf.bits(ir); I did not mean to imply for you to call the bits method in this manner. In fact, you should not call it - the IndexSearcher calls it under the covers. I was implying that you could write your own Filter subclass that lit up a single-bit corresponding to the document you're interested in. However I always get a result of 1, which I suppose is has to do with this overlap thingy. No, not related with respect to a filter - two different concepts. Is there not a simple way to just get some word statistics out of a file? Look at the Lucene index format (from Lucene's main web page). Term frequencies are part of the statistics gathered, of course. You can get at the values there using IndexReader. This may be a lot lower-level than you desire, but what Lucene stores is there for you. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: raw hit count
Thanks a mil erik, I tried to make my own filter class with a modified bit method as per below: if (doc == interestingDoc) { bits.set(doc); // set bit for hit } but this baby continues to always return 1! so then I looked at indexreader, like you said and ended up with something like this, its probably a messy way of doing it, but I am happy. Term term = new Term(body, mercedes); IndexReader ir = IndexReader.open(indexPath); TermDocs termdocs = ir.termDocs(term); int id = hits.id(i); while (termdocs.next()) { if (termdocs.doc() == id) { System.out.println( Document number + termdocs.doc() + Freq: + termdocs.freq()); } } it only works for single words, but I reckon I just send split up the query and then make multiple scans. cheers kent --- Erik Hatcher [EMAIL PROTECTED] wrote: On Sunday, November 30, 2003, at 11:13 AM, Kent Gibson wrote: as per Erik's idea I tried with the BitSet as follows: QueryFilter qf = new QueryFilter(query); IndexReader ir = IndexReader.open(indexPath); Searcher searcher2 = new IndexSearcher(ir); // get the bit set for the query BitSet bits = qf.bits(ir); I did not mean to imply for you to call the bits method in this manner. In fact, you should not call it - the IndexSearcher calls it under the covers. I was implying that you could write your own Filter subclass that lit up a single-bit corresponding to the document you're interested in. However I always get a result of 1, which I suppose is has to do with this overlap thingy. No, not related with respect to a filter - two different concepts. Is there not a simple way to just get some word statistics out of a file? Look at the Lucene index format (from Lucene's main web page). Term frequencies are part of the statistics gathered, of course. You can get at the values there using IndexReader. This may be a lot lower-level than you desire, but what Lucene stores is there for you. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.
Hi, Do you have the install.txt for windows XP setup of the WebLucene? It seems that the install.txt is only for UNIX setup. Thanks. -Original Message- From: Che Dong [mailto:[EMAIL PROTECTED] Sent: Sunday, November 30, 2003 9:57 PM To: Lucene Developers List; Lucene Users List Subject: WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support. http://sourceforge.net/projects/weblucene/ WebLucene: Lucene search engine XML interface, provided sax based indexing, indexing sequence based result sorting and xml output with highlight support.The CJKTokenizer support Chinese Japanese and Korean with Westen language simultaneously. The key features: 1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer 2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher 3 xml output: com/chedong/weblucene/search/DOMSearcher 4 sax based indexing: com/chedong/weblucene/index/SAXIndexer 5 token based highlighter: reverse StopTokenzier: org/apache/lucene/anlysis/HighlightAnalyzer.java HighlightFilter.java with abstract: com/chedong/weblucene/search/WebluceneHighlighter 6 A simplified query parser: google like syntax with term limit org/apache/lucene/queryParser/SimpleQueryParser modified from early version of Lucene :) Regards Che, Dong - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]