RE: New Highlighter features
The Highlighter package in CVS has been updated with the following new features: Good stuff. Will this work against the 1.4 or only against CVS head? Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the query, I didn't think of doing that in my code. I agree, it's a good way to exclude the source doc. I don't trust calling vector.size() and vector.getTerms() within the loop but I haven't looked at the code to see if it calculates the results each time or caches them... From the code I looked at, those calls don't recalculate on every call. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does not say that the values returned by size() and getTerms() are cached, and while the impl may cache them (haven't checked) it's not guarenteed, thus it's safer to put the size() and getTerms() call outside the loop. for (int j = 0; j vector.size(); j++) { TermQuery tq = new TermQuery( new Term(subject, vector.getTerms()[j])); I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Faster highlighting with TermPositionVectors
Mark, Thanks to the recent changes (see CVS) in TermFreqVector support we can now make use of term offset information held in the Lucene index rather than incurring the cost of re-analyzing text to highlight it. I have created a class ( see http://www.inperspective.com/lucene/TokenSources.java ) which handles creating a TokenStream from the TermPositionVector stored in the database which can then be passed to the highlighter. This approach is significantly faster than re-parsing the original text. If people are happy with this class I'll add it to the Highlighter sandbox but it may sit better elsewhere in the Lucene code base as a more general purpose utility. BTW as part of putting this together I found that the TermFreq code throws a null pointer when indexing fields that produce no tokens (ie empty or all stopwords). Otherwise things work very well. This is great news! While I won't have the time to test this until probably mid November I do look forward to the speed improvements as the current highlighting mechanisms (reparsing the text) was just not performant enough under heavy loads. Regards, Bruce Ritchie http://www.jivesoftware.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Highlighting PDF file after the search
From: [EMAIL PROTECTED] I can successfully index and search the PDF documents, however i am not able to highlight the searched text in my original PDF file (ie: like dtSearch highlights on original file) I took a look at the highlighter in sandbox, compiled it and have it ready. I am wondering if this highlighter is for highlighting indexed documents or can it be used for PDF Files as is ! Please enlighten ! The highlighter code in sandbox can facilitate highlighting of text *extracted* from the PDF, however it does nothing for you to highlight search terms *inside* of the PDF. For that you will need some sort of tool that can modify the PDF on the fly as the user views it. I know of no quick and dirty tool that allows you to do this, though there is quite a few projects and products which allow you to manipulate PDF files which likely can be used to obtain the behavior you are looking for (with some effort on your part). Regards, Bruce Ritchie smime.p7s Description: S/MIME cryptographic signature
RE: org.apache.lucene.search.highlight.Highlighter
Thanks for highlighting the problem with the Javadocs... Groan. :) Regards, Bruce Ritchie smime.p7s Description: S/MIME cryptographic signature
RE: clustering results
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: April 11, 2004 1:03 PM To: Lucene Users List Subject: Re: clustering results I got all excited reading the subject line clustering results but this isn't really clustering is it? This is more sorting. Does anyone know of any work within Lucene (or another indexer) to do actual subject clustering (i.e. like Vivisimo @ http://vivisimo.com/ or Kartoo @ http://www.kartoo.com/)? It would be pretty awesome if Lucene had such ability, I know there aren't a whole lot of clustering options, and the commercial products are very expensive. Anyhow, just curious. The one I know about is Carrot - http://www.cs.put.poznan.pl/dweiss/carrot/ Regards, Bruce Ritchie http://www.jivesoftware.com/ smime.p7s Description: S/MIME cryptographic signature
Re: Performance of hit highlighting and finding term positions for a specific document
Kevin A. Burton wrote: I'm playing with this package: http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index. My question is whether it's hard to find a TermPosition for a given term in a given document rather than the whole index. IndexReader.termPositions( Term term ) is term specific not term and document specific. As far as I know it's not currently possible to get this information from a standard lucene index. Also it seems that after all this time that Lucene should have efficient hit highlighting as a standard package. Is there any interest in seeing a contribution in the sandbox for this if it uses the index positions? I've been meaning to look into good ways to store token offset information to allow for very efficient highlighting and I believe Mark may also be looking into improving the highlighter via other means such as temporary ram indexes. Search the archives to get a background on some of the idea's we've tossed around ('Dmitry's Term Vector stuff, plus some' and 'Demoting results' come to mind as threads that touch this topic). Regards, Bruce Ritchie http://www.jivesoftware.com/ smime.p7s Description: S/MIME Cryptographic Signature
Re: MoreLikeThis Query generator - Re: code for more like this query expansion - was - Re: setMaxClauseCount ??
David Spencer wrote: Code rewritten, automagically chooses lots of defaults, lets you override the defs thru the static vars at the bottom or the non-static vars also at the bottom. I've taken the liberty to update this code to handle multiple fields and use the new term vector support in CVS so that retokenizing a document's text isn't necessary if you have a document ID that has indexed and term vector supported fields. I've added the apache 2.0 license to the top however if that isn't the licence you want this code to be released under let me know and I'll change it immediately. Regards, Bruce Ritchie http://www.jivesoftware.com/ MoreLikeThis.java Description: application/httpd-cgi smime.p7s Description: S/MIME Cryptographic Signature
Re: MoreLikeThis Query generator - Re: code for more like this query expansion - was - Re: setMaxClauseCount ??
David Spencer wrote: [c] interesting words - uses code from MoreLikeThis to give a table of all interesting words in the current source doc ordered by score. Remember score is idf*tf as per Dougs mail (and as per my hopefully correct understanding of these things). This page is of course more of a debugging tool that something a normal user would see. One possible area of improvement that jumped out at me after reviewing this table is using stemming, say, allowing more words in the generated query when 2 words have the same stem. Actually, the analyzer should do that, shouldn't it? For example, I have stemming analyzers for a variety of languages that both stem and remove stop words - it seems silly to me to duplicate that functionality when it's so easily provided by the analyzer. Given that, I would suggest removing the stop word functionality from this class as it is not needed and only confuses things. Regards, Bruce Ritchie http://www.jivesoftware.com/ smime.p7s Description: S/MIME Cryptographic Signature
Re: MoreLikeThis Query generator - Re: code for more like this query expansion - was - Re: setMaxClauseCount ??
David Spencer wrote: I'd appreciate if someone could proofread MoreLikeThis.like(Reader) and mlt(Reader). At a glance it seems to return reasonable results on my site. One thing that I would find extremely useful is updating the code to handle multiple fields since many (most?) indexes do not use just 1 field. I'm in the process of doing just that as well as making some other changes to the code and will contribute it back if someone doesn't beat me to it first. Regards, Bruce Ritchie http://www.jivesoftware.com/ smime.p7s Description: S/MIME Cryptographic Signature
Re: fuzzy searches
Thomas Krämer wrote: now that the topic is clustering methods: has there been any effort in implementing Latent semantic indexing in Lucene? Google only indicates someone else asking this in february. Just a note the LSI is encumbered by US patents 4,839,853 and 5,301,109. It would be wise to make sure that any implementation is either blessed by the patent holders or does not infringe on the patents. Regards, Bruce Ritchie smime.p7s Description: S/MIME Cryptographic Signature
Re: French texts
Yes, you can use lucene to search French documents. The snowball stemmers contribution contains a French stemmer - you'll find it athttp://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ Regards, Bruce Ritchie Gayo Diallo wrote: Hi, I just want to know If It's possible to use Lucene for french documents. Is there any analyser for this language ? Best regards, Gayo Diallo smime.p7s Description: S/MIME Cryptographic Signature
Re: cant rename segments.new to segment
Wilton, Reece wrote: Are people having this same issue on Linux or is this just a Windows issue? I've only heard of the issue on Windows - I believe a patch from Matt Tucker was actually incorporated into Lucene that made some attempt to work around this issue. Regards, Bruce Ritchie smime.p7s Description: S/MIME Cryptographic Signature
Re: Find Documents 'Similar' to Another
David Medinets wrote: But I don't understand. Do you have any insight into the product pricing? No, but I'm sure to find out as I get further along in my testing. I would suggest contacting them directly for an answer to that question. Regards, Bruce Ritchie smime.p7s Description: S/MIME Cryptographic Signature