Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John, Once you make your change locally, use 'cvs diff -u IndexWriter.java indexwriter.patch' to make a patch. Then open a new Bugzilla entry. Finally, attach your patch to that entry. Note that Document deletion is actually done from IndexReader, so your patch may have to be on

Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Akmal Sarhan
that sounds very interesting but how do you handle queries like select * from MY_TABLE where MY_NUMERIC_FIELD 80 as far as I know you have only the range query so you will have to say my_numeric_filed:[80 TO ??] but this would not work in the a/m example or am I missing something? regards

Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Praveen Peddi
Hmm. So far all our fields are just strings. But I would guess you should be able to use Integer.MAX_VALUE or something on the upper bound. Or there might be a better way of doing it. Praveen - Original Message - From: Akmal Sarhan [EMAIL PROTECTED] To: Lucene Users List [EMAIL

Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread petite_abeille
On Dec 14, 2004, at 15:40, Kevin L. Cobb wrote: Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. ZOE [1] [2] takes the same approach and uses Lucene as a relational engine of sort. However, for both practical and

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox Ot oh, sorry, I'll try to get this

Re: [RFE] IndexWriter.updateDocument()

2004-12-14 Thread David Spencer
petite_abeille wrote: Well, the subject says it all... If there is one thing which is overly cumbersome in Lucene, it's updating documents, therefore this Request For Enhancement: Please consider enhancing the IndexWriter API to include an updateDocument(...) method to take care of all the gory

RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Monsur Hossain
My concern is that this just shifts the scaling issue to Lucene, and I haven't found much info on how to scale Lucene vertically. By vertically, of course, I meant horizontally. Basically scaling it across servers as one might do with a relational database.

Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Chris Hostetter
: select * from MY_TABLE where MY_NUMERIC_FIELD 80 : : as far as I know you have only the range query so you will have to say : : my_numeric_filed:[80 TO ??] : but this would not work in the a/m example or am I missing something? RangeQuery allows you to an open ended range -- you can tell the

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Otis Gospodnetic wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the

[RFE] IndexWriter.updateDocument()

2004-12-14 Thread petite_abeille
Well, the subject says it all... If there is one thing which is overly cumbersome in Lucene, it's updating documents, therefore this Request For Enhancement: Please consider enhancing the IndexWriter API to include an updateDocument(...) method to take care of all the gory details involved in

RE: TFIDF Implementation

2004-12-14 Thread Otis Gospodnetic
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Otis --- Bruce Ritchie [EMAIL PROTECTED] wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like

RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the query, I

RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
Well, one could always partition an index, distribute pieces of it horizontally across multiple 'search servers' and use the built-in RMI-based and Parallel search feature. Nutch uses something similar for search scaling. Otis --- Monsur Hossain [EMAIL PROTECTED] wrote: My concern is that

RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does not say that

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in

Indexing a large number of DB records

2004-12-14 Thread Homam S.A.
I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing

A question about scoring function in Lucene

2004-12-14 Thread Nhan Nguyen Dang
Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any

LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

2004-12-14 Thread Karthik N S
Hi Guys Some body tell me what this Exception am Getting Pleae Sys Specifications O/s Linux Gentoo Appserver Apache Tomcat/4.1.24 Jdk build 1.4.2_03-b02 Lucene 1.4.1 ,2, 3 Note: - This Exception is displayed on Every 2nd Query after Tomcat is started java.io.IOException: Stale NFS

RE: A question about scoring function in Lucene

2004-12-14 Thread Chuck Williams
Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the

Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable if you use some of the 'internal' methods. I don't have the need for this, but others might, so

Re: finalize delete without optimize

2004-12-14 Thread John Wang
Hi Otis: Thanks for you reply. I am looking for more of an API call than a tool. e.g. IndexWriter.finalizeDelete() If I implement this, how would I go about submitting a patch? thanks -John On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote:

Re: Hit 2 has score less than Hit 3

2004-12-14 Thread Erik Hatcher
On Dec 14, 2004, at 4:53 AM, Vikas Gupta wrote: I have come across a scenario where the hits returned are not sorted. Or maybe they are sorted but the explanation is not correct. Take a look at http://cofferdam.cs.utexas.edu:8080/search.jsp? query=space+odysseyhitsPerPage=10hitsPerSite=0 This

Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Praveen Peddi
Even we use lucene for similar purpose except that we index and store quite a few fields. Infact I also update partial documents as people suggested. I store all the indexed fields so I don't have to build the whole document again while updating partial document. The reason we do this is due to

Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Nader Henein
How big do you expect it to get and how often do you expect to update it, we've been using Lucene for about 1 M records (19 fields each) with incremental updates every 10 minutes, the performance during updates wasn't wonderful, so it took some seriously intense code to sort that out, as you

RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test.

TFIDF Implementation

2004-12-14 Thread Christoph Kiefer
Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and

RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
You can see Flickr-like tag (lookup) system at my Simpy site ( http://www.simpy.com ). It uses Lucene as the backend for lookups, but still uses a RDBMS as the primary storage. I find it that keeping the RDBMS and Lucene indices is a bit of a pain and error prone, so _thin_ storage layer with

Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Daniel Naber
On Tuesday 14 December 2004 20:13, Monsur Hossain wrote: My concern is that this just shifts the scaling issue to Lucene, and I haven't found much info on how to scale Lucene vertically. You can easily use MultiSearcher to search over several indices. If you want the distribution to be more

Re: Indexing a large number of DB records

2004-12-14 Thread Otis Gospodnetic
Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this:

Re: A question about scoring function in Lucene

2004-12-14 Thread Vikas Gupta
Lucene uses the vector space model. To understand that: -Read section 2.1 of Space optimizations for Total Ranking paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of

Re: Indexing a large number of DB records

2004-12-14 Thread Homam S.A.
Thanks Otis! What do you mean by building it in batches? Does it mean I should close the IndexWriter every 1000 rows and reopen it? Does that releases references to the document objects so that they can be garbage-collected? I'm calling optimize() only at the end. I agree that 1500 documents is