RE: New Highlighter features

2005-01-03 Thread Bruce Ritchie
 The Highlighter package in CVS has been updated with the following new
 features:

Good stuff. Will this work against the 1.4 or only against CVS head? 


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
 
  You can also see 'Books like this' example from here 
  
 https://secure.manning.com/catalog/view.php?book=hatcher2item=source
 
 Well done, uses a term vector, instead of reparsing the orig 
 doc, to form the similarity query. Also I like the way you 
 exclude the source doc in the query, I didn't think of doing 
 that in my code.

I agree, it's a good way to exclude the source doc.
 
 I don't trust calling vector.size() and vector.getTerms() 
 within the loop but I haven't looked at the code to see if it 
 calculates  the results each time or caches them...

From the code I looked at, those calls don't recalculate on every call. 


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
  From the code I looked at, those calls don't recalculate on 
 every call. 
 
 I was referring to this fragment below from BooksLikeThis.docsLike(), 
 and was mentioning it as the javadoc 
 http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
 dex/TermFreqVector.html 
 does not say that the values returned by size() and getTerms() are 
 cached, and while the impl may cache them (haven't checked) it's not 
 guarenteed, thus it's safer to put the size() and getTerms() call 
 outside the loop.
 
   for (int j = 0; j  vector.size(); j++) {
TermQuery tq = new TermQuery(
new Term(subject, vector.getTerms()[j]));

I agree on your overall point that it's probably best to put those calls 
outside of the loop, I was just saying that I did look at the implementation 
and the calls do not recalculate anything. I'm sorry I didn't explain myself 
clearly enough.


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
Christoph,

I'm not entirely certain if this is what you want, but a while back David 
Spencer did code up a 'More Like This' class which can be used for generating 
similarities between documents. I can't seem to find this class in the sandbox 
so I've attached it here. Just repackage and test.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/   

 -Original Message-
 From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
 Sent: December 14, 2004 11:45 AM
 To: Lucene Users List
 Subject: TFIDF Implementation
 
 Hi,
 My current task/problem is the following: I need to implement 
 TFIDF document term ranking using Jakarta Lucene to compute a 
 similarity rank between arbitrary documents in the constructed index.
 I saw from the API that there are similar functions already 
 implemented in the class Similarity and DefaultSimilarity but 
 I don't know exactly how to use them. At the time my index 
 has about 25000 (small) documents and there are about 75000 
 terms stored in total.
 Now, my question is simple. Does anybody has done this before 
 or could point me to another location for help?
 
 Thanks for any help in advance.
 Christoph 
 
 --
 Christoph Kiefer
 
 Department of Informatics, University of Zurich
 
 Office: Uni Irchel 27-K-32
 Phone:  +41 (0) 44 / 635 67 26
 Email:  [EMAIL PROTECTED]
 Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Faster highlighting with TermPositionVectors

2004-10-28 Thread Bruce Ritchie
Mark,

 Thanks to the recent changes (see CVS) in TermFreqVector 
 support we can now make use of term offset information held 
 in the Lucene index rather than incurring the cost of 
 re-analyzing text to highlight it.
 
 I have created a  class ( see 
 http://www.inperspective.com/lucene/TokenSources.java ) which 
 handles creating a TokenStream from the TermPositionVector 
 stored in the database which can then be passed to the highlighter.
 This approach is significantly faster than re-parsing the 
 original text.
 If people are happy with this class I'll add it to the 
 Highlighter sandbox but it may sit better elsewhere in the 
 Lucene code base as a more general purpose utility.
 
 BTW as part of putting this together I found that the 
 TermFreq code throws a null pointer when indexing fields that 
 produce no tokens (ie empty or all stopwords). Otherwise 
 things work very well.

This is great news! While I won't have the time to test this until probably mid 
November I do look forward to the speed improvements as the current highlighting 
mechanisms (reparsing the text) was just not performant enough under heavy loads.


Regards,

Bruce Ritchie
http://www.jivesoftware.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Highlighting PDF file after the search

2004-09-20 Thread Bruce Ritchie
 From: [EMAIL PROTECTED] 

 I can successfully index and search the PDF documents, 
 however i am not able to highlight the searched text in my 
 original PDF file (ie: like dtSearch highlights on original file)
 
 I took a look at the highlighter in sandbox, compiled it and 
 have it ready.  I am wondering if this highlighter is for 
 highlighting indexed documents or can it be used for PDF 
 Files as is !  Please enlighten !

The highlighter code in sandbox can facilitate highlighting of text
*extracted* from the PDF, however it does nothing for you to highlight
search terms *inside* of the PDF. For that you will need some sort of tool
that can modify the PDF on the fly as the user views it. I know of no quick
and dirty tool that allows you to do this, though there is quite a few
projects and products which allow you to manipulate PDF files which likely
can be used to obtain the behavior you are looking for (with some effort on
your part).


Regards,

Bruce Ritchie


smime.p7s
Description: S/MIME cryptographic signature


RE: org.apache.lucene.search.highlight.Highlighter

2004-05-19 Thread Bruce Ritchie

 Thanks for highlighting the problem with the Javadocs...

Groan. :)


Regards,

Bruce Ritchie


smime.p7s
Description: S/MIME cryptographic signature


RE: clustering results

2004-04-12 Thread Bruce Ritchie
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
 Sent: April 11, 2004 1:03 PM
 To: Lucene Users List
 Subject: Re: clustering results
 
 I got all excited reading the subject line clustering 
 results but this isn't really clustering is it?  This is 
 more sorting.  Does anyone know of any work within Lucene (or 
 another indexer) to do actual subject clustering (i.e. like 
 Vivisimo @ http://vivisimo.com/ or Kartoo @ 
 http://www.kartoo.com/)?  It would be pretty awesome if 
 Lucene had such ability, I know there aren't a whole lot of 
 clustering options, and the commercial products are very expensive.  
 Anyhow, just curious.

The one I know about is Carrot - http://www.cs.put.poznan.pl/dweiss/carrot/


Regards,

Bruce Ritchie
http://www.jivesoftware.com/


smime.p7s
Description: S/MIME cryptographic signature


Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Bruce Ritchie
Kevin A. Burton wrote:

I'm playing with this package:

http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms.
This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.

My question is whether it's hard to find a TermPosition for a given term 
in a given document rather than the whole index.

IndexReader.termPositions( Term term ) is term specific not term and 
document specific.
As far as I know it's not currently possible to get this information from a standard lucene index.

Also it seems that after all this time that Lucene should have efficient 
hit highlighting as a standard package.  Is there any interest in seeing 
a contribution in the sandbox for this if it uses the index positions?
I've been meaning to look into good ways to store token offset information to allow for very 
efficient highlighting and I believe Mark may also be looking into improving the highlighter via 
other means such as temporary ram indexes. Search the archives to get a background on some of the 
idea's we've tossed around ('Dmitry's Term Vector stuff, plus some' and 'Demoting results' come to 
mind as threads that touch this topic).

Regards,

Bruce Ritchie
http://www.jivesoftware.com/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: MoreLikeThis Query generator - Re: code for more like this query expansion - was - Re: setMaxClauseCount ??

2004-02-25 Thread Bruce Ritchie
David Spencer wrote:

Code rewritten, automagically chooses lots of defaults, lets you override
the defs thru the static vars at the bottom or the non-static vars also 
at the bottom.
I've taken the liberty to update this code to handle multiple fields and use the new term vector 
support in CVS so that retokenizing a document's text isn't necessary if you have a document ID that 
has indexed and term vector supported fields. I've added the apache 2.0 license to the top however 
if that isn't the licence you want this code to be released under let me know and I'll change it 
immediately.

Regards,

Bruce Ritchie
http://www.jivesoftware.com/


MoreLikeThis.java
Description: application/httpd-cgi


smime.p7s
Description: S/MIME Cryptographic Signature


Re: MoreLikeThis Query generator - Re: code for more like this query expansion - was - Re: setMaxClauseCount ??

2004-02-18 Thread Bruce Ritchie
David Spencer wrote:

[c] interesting words - uses code from MoreLikeThis to give a table of 
all interesting
words in the current source doc ordered by score.
Remember score is idf*tf as per Dougs mail (and as per my
hopefully correct understanding of these things). This page is of course 
more of a debugging
tool that something a normal user would see.  One possible area of 
improvement that jumped out at me after reviewing this table is using 
stemming, say, allowing more words in the generated query when 2 words 
have the same stem.
Actually, the analyzer should do that, shouldn't it? For example, I have stemming analyzers for a 
variety of languages that both stem and remove stop words - it seems silly to me to duplicate that 
functionality when it's so easily provided by the analyzer. Given that, I would suggest removing the 
stop word functionality from this class as it is not needed and only confuses things.

Regards,

Bruce Ritchie
http://www.jivesoftware.com/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: MoreLikeThis Query generator - Re: code for more like this query expansion - was - Re: setMaxClauseCount ??

2004-02-18 Thread Bruce Ritchie
David Spencer wrote:

I'd appreciate if someone could proofread MoreLikeThis.like(Reader) and 
mlt(Reader).

At a glance it seems to return reasonable results on my site.
One thing that I would find extremely useful is updating the code to handle multiple fields since 
many (most?) indexes do not use just 1 field. I'm in the process of doing just that as well as 
making some other changes to the code and will contribute it back if someone doesn't beat me to it 
first.

Regards,

Bruce Ritchie
http://www.jivesoftware.com/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: fuzzy searches

2003-11-11 Thread Bruce Ritchie
Thomas Krämer wrote:
now that the topic is clustering methods: has there been any effort in 
implementing Latent semantic indexing in Lucene? Google only indicates 
someone else asking this in february.
Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to make 
sure that any implementation is either blessed by the patent holders or does not infringe on the 
patents.

Regards,

Bruce Ritchie



smime.p7s
Description: S/MIME Cryptographic Signature


Re: French texts

2003-09-25 Thread Bruce Ritchie
Yes, you can use lucene to search French documents. The snowball stemmers contribution contains a 
French stemmer - you'll find it athttp://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/

Regards,

Bruce Ritchie

Gayo Diallo wrote:

Hi,
I just want to know If It's possible to use Lucene for french documents.
Is there any analyser for this language ?
Best regards,
Gayo Diallo


smime.p7s
Description: S/MIME Cryptographic Signature


Re: cant rename segments.new to segment

2003-09-19 Thread Bruce Ritchie
Wilton, Reece wrote:

Are people having this same issue on Linux or is this just a Windows
issue?
I've only heard of the issue on Windows - I believe a patch from Matt Tucker was actually 
incorporated into Lucene that made some attempt to work around this issue.

Regards,

Bruce Ritchie


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Find Documents 'Similar' to Another

2003-05-30 Thread Bruce Ritchie
David Medinets wrote:

But I don't understand. Do you have any insight into the product pricing?
No, but I'm sure to find out as I get further along in my testing. I would suggest contacting them 
directly for an answer to that question.

Regards,

Bruce Ritchie


smime.p7s
Description: S/MIME Cryptographic Signature