Re: raw hit count

2003-11-30 Thread Ype Kingma
Kent, Erik,

On Saturday 29 November 2003 17:20, Erik Hatcher wrote:
 I enjoy at least attempting to answer questions here, even if I'm half
 wrong, so by all means correct me if I misspeak

Me too, :)

 On Saturday, November 29, 2003, at 06:37  PM, Kent Gibson wrote:
  All I would like to know is how many times a query was
  found in a particular document. I have no problems
  getting the score from hits.score(). hits.length is
  the number of times in total that the query was found,
  however I want the the number of times the query was
  found on a document by document basis. is this
  possible?

Could you be a bit more precise on what you mean
by 'the number of times the query was found'? For a single
query term, it is straightforward, but what about eg. a query for three
optional terms?


 The 'coord' factor used in computing the score is exactly this.  See
 the javadoc for it:

   http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/
 Similarity.html#coord(int,%20int)

AFAIK, this overlap is the number of terms the document and the query
have in common.
For a query consisting of a single term, the overlap is always one,
and the number of times the query occurs in a document is the term frequency
in the document.

 You could implement a custom Similarity to capture the overlap or
 adjust the the factor depending on what you're trying to accomplish.

   The only idea I have is to rerun the search,
  but I can't even see how to run a search on only one
  document!

 You could always rerun a search with a Filter with only one bit enabled
 and see if zero or one document is returned - that would be quite
 trivial and fast.

You could also implement a Similarity that ignores the total number
of terms in the searched document field, see lengthNorm() in
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
As lengthNorm() is applied at indexing time, you will have to reindex
for this to work for you.
At query time you can then use a tf() implementation that is linear, instead
of the default square root in DefaultSimilarity, and a constant idf(),
instead of the default log of the inverse document frequency.
You should then get a document score that is proportional
to the number of query terms in the document.

Kind regards,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

2003-11-30 Thread Che Dong
http://sourceforge.net/projects/weblucene/

WebLucene: 
Lucene search engine XML interface, provided sax based indexing, indexing sequence 
based result sorting and xml output with highlight support.The CJKTokenizer support 
Chinese Japanese and Korean with Westen language simultaneously.

The key features:
1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer

2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher

3 xml output: com/chedong/weblucene/search/DOMSearcher

4 sax based indexing: com/chedong/weblucene/index/SAXIndexer

5 token based highlighter: 
reverse StopTokenzier:
org/apache/lucene/anlysis/HighlightAnalyzer.java
  HighlightFilter.java
with abstract:
com/chedong/weblucene/search/WebluceneHighlighter

6 A simplified query parser:
google like syntax with term limit
org/apache/lucene/queryParser/SimpleQueryParser
modified from early version of Lucene :)

Regards

Che, Dong

Re: raw hit count

2003-11-30 Thread Kent Gibson
Thanks for the help guys, but unfortunately I am still
stuck. Let me reiterate what I would like to do and
then explain what I have tried.

I would like to know that in document x the query y
appeared n times. 

For example:

query = Bank : Bank found in doc number 1, 3 tims

Understandbly this is a bit tricky when query y is
composed of more than one word, but for the moment I
would be satisified if I knew how many times query y
appeared in its entirety.

However in the end it would be great if I could get a
result as follows: 
query = Hells Bells; Hells found in doc number 2, 3
times and Bells Found 0 times 

as per Erik's idea I tried with the BitSet as follows:

QueryFilter qf = new QueryFilter(query);
IndexReader ir = IndexReader.open(indexPath);
Searcher searcher2 = new IndexSearcher(ir);

// get the bit set for the query
BitSet bits = qf.bits(ir);
last = bits.nextSetBit(offset);
offset = last + 1;

System.out.println(First bit is:  + last);
System.out.println(Bits  + bits.toString());

// clear all the bits
bits.clear();
System.out.println(Bits after  + bits.toString());
bits.set(last);

/* just to see the effect */
BitSet bits2 = qf.bits(ir);
System.out.println(Bits now  + bits2.toString());

Hits hits2 = searcher2.search(query,qf);
/* this value is always one /*
  */
System.out.println(raw hits :  + hits2.length());

However I always get a result of 1, which I suppose is
has to do with this overlap thingy.

As per Ype's idea I tried to implement a Similarity
object, but two things I believe are wrong, a) I am
doing something fundamentally wrong with the maths b)
I get a sneaky idea this is the wrong way around this.

Is there not a simple way to just get some word
statistics out of a file?

Once again thanks for the inputs and I look forward to
a long fight.

public float lengthNorm(String fieldName, int
numTerms)
{
return (float) 1.0 ;
}

/** Implemented as codesqrt(freq)/code. */
public float tf(float freq)
{
return (float) (freq);
}

/** Implemented as codelog(numDocs/(docFreq+1)) +
1/code. */
public float idf(int docFreq, int numDocs)
{
return (float)1.0;

}
--- Ype Kingma [EMAIL PROTECTED] wrote:
 Kent, Erik,
 
 On Saturday 29 November 2003 17:20, Erik Hatcher
 wrote:
  I enjoy at least attempting to answer questions
 here, even if I'm half
  wrong, so by all means correct me if I
 misspeak
 
 Me too, :)
 
  On Saturday, November 29, 2003, at 06:37  PM, Kent
 Gibson wrote:
   All I would like to know is how many times a
 query was
   found in a particular document. I have no
 problems
   getting the score from hits.score(). hits.length
 is
   the number of times in total that the query was
 found,
   however I want the the number of times the query
 was
   found on a document by document basis. is this
   possible?
 
 Could you be a bit more precise on what you mean
 by 'the number of times the query was found'? For a
 single
 query term, it is straightforward, but what about
 eg. a query for three
 optional terms?
 
 
  The 'coord' factor used in computing the score is
 exactly this.  See
  the javadoc for it:
 
  

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/
  Similarity.html#coord(int,%20int)
 
 AFAIK, this overlap is the number of terms the
 document and the query
 have in common.
 For a query consisting of a single term, the overlap
 is always one,
 and the number of times the query occurs in a
 document is the term frequency
 in the document.
 
  You could implement a custom Similarity to capture
 the overlap or
  adjust the the factor depending on what you're
 trying to accomplish.
 
The only idea I have is to rerun the search,
   but I can't even see how to run a search on only
 one
   document!
 
  You could always rerun a search with a Filter with
 only one bit enabled
  and see if zero or one document is returned - that
 would be quite
  trivial and fast.
 
 You could also implement a Similarity that ignores
 the total number
 of terms in the searched document field, see
 lengthNorm() in

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
 As lengthNorm() is applied at indexing time, you
 will have to reindex
 for this to work for you.
 At query time you can then use a tf() implementation
 that is linear, instead
 of the default square root in DefaultSimilarity, and
 a constant idf(),
 instead of the default log of the inverse document
 frequency.
 You should then get a document score that is
 proportional
 to the number of query terms in the document.
 
 Kind regards,
 Ype
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL 

Re: raw hit count

2003-11-30 Thread Erik Hatcher
On Sunday, November 30, 2003, at 11:13  AM, Kent Gibson wrote:
as per Erik's idea I tried with the BitSet as follows:

QueryFilter qf = new QueryFilter(query);
IndexReader ir = IndexReader.open(indexPath);
Searcher searcher2 = new IndexSearcher(ir);
// get the bit set for the query
BitSet bits = qf.bits(ir);
I did not mean to imply for you to call the bits method in this manner. 
 In fact, you should not call it - the IndexSearcher calls it under the 
covers.  I was implying that you could write your own Filter subclass 
that lit up a single-bit corresponding to the document you're 
interested in.

However I always get a result of 1, which I suppose is
has to do with this overlap thingy.
No, not related with respect to a filter - two different concepts.

Is there not a simple way to just get some word
statistics out of a file?
Look at the Lucene index format (from Lucene's main web page).  Term 
frequencies are part of the statistics gathered, of course.  You can 
get at the values there using IndexReader.  This may be a lot 
lower-level than you desire, but what Lucene stores is there for you.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: raw hit count

2003-11-30 Thread Kent Gibson
Thanks a mil erik, I tried to make my own filter class
with a modified bit method as per below:
if (doc == interestingDoc)
{
bits.set(doc); // set bit for hit   
}

but this baby continues to always return 1!

so then I looked at indexreader, like you said and
ended up with something like this, its probably a
messy way of doing it, but I am happy.

Term term = new Term(body, mercedes);
IndexReader ir = IndexReader.open(indexPath);
TermDocs termdocs = ir.termDocs(term);

int id = hits.id(i);

while (termdocs.next())
{
if (termdocs.doc() == id)
{
System.out.println(
Document number 
+ termdocs.doc()
+  Freq: 
+ termdocs.freq());
}

}

it only works for single words, but I reckon I just
send split up the query and then make multiple scans. 

cheers

kent

--- Erik Hatcher [EMAIL PROTECTED] wrote:
 On Sunday, November 30, 2003, at 11:13  AM, Kent
 Gibson wrote:
  as per Erik's idea I tried with the BitSet as
 follows:
 
  QueryFilter qf = new QueryFilter(query);
  IndexReader ir = IndexReader.open(indexPath);
  Searcher searcher2 = new IndexSearcher(ir);
 
  // get the bit set for the query
  BitSet bits = qf.bits(ir);
 
 I did not mean to imply for you to call the bits
 method in this manner. 
   In fact, you should not call it - the
 IndexSearcher calls it under the 
 covers.  I was implying that you could write your
 own Filter subclass 
 that lit up a single-bit corresponding to the
 document you're 
 interested in.
 
  However I always get a result of 1, which I
 suppose is
  has to do with this overlap thingy.
 
 No, not related with respect to a filter - two
 different concepts.
 
  Is there not a simple way to just get some word
  statistics out of a file?
 
 Look at the Lucene index format (from Lucene's main
 web page).  Term 
 frequencies are part of the statistics gathered, of
 course.  You can 
 get at the values there using IndexReader.  This may
 be a lot 
 lower-level than you desire, but what Lucene stores
 is there for you.
 
   Erik
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 


__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: WebLucene 0.3 release:support CJK, use sax based indexing, docID based result sorting and xml format output with highlighting support.

2003-11-30 Thread Tun Lin
Hi,

Do you have the install.txt for windows XP setup of the WebLucene? It seems that
the install.txt is only for UNIX setup.

Thanks.  

-Original Message-
From: Che Dong [mailto:[EMAIL PROTECTED] 
Sent: Sunday, November 30, 2003 9:57 PM
To: Lucene Developers List; Lucene Users List
Subject: WebLucene 0.3 release:support CJK, use sax based indexing, docID based
result sorting and xml format output with highlighting support.

http://sourceforge.net/projects/weblucene/

WebLucene: 
Lucene search engine XML interface, provided sax based indexing, indexing
sequence based result sorting and xml output with highlight support.The
CJKTokenizer support Chinese Japanese and Korean with Westen language
simultaneously.

The key features:
1 The bi-gram based CJK support: org/apache/lucene/analysis/cjk/CJKTokenizer

2 docID based result sorting: org/apache/lucene/search/IndexOrderSearcher

3 xml output: com/chedong/weblucene/search/DOMSearcher

4 sax based indexing: com/chedong/weblucene/index/SAXIndexer

5 token based highlighter: 
reverse StopTokenzier:
org/apache/lucene/anlysis/HighlightAnalyzer.java
  HighlightFilter.java
with abstract:
com/chedong/weblucene/search/WebluceneHighlighter

6 A simplified query parser:
google like syntax with term limit
org/apache/lucene/queryParser/SimpleQueryParser
modified from early version of Lucene :)

Regards

Che, Dong



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]