Re: TFIDF Implementation

2004-12-15 Thread Christoph Kiefer
David, Bruce, Otis, Thank you all for the quick replies. I looked through the BooksLikeThis example. I also agree, it's a very good and effective way to find similar docs in the index. Nevertheless, what I need is really a similarity matrix holding all TF*IDF values. For illustration I quick and

Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Hello Homam, The batches I was referring to were batches of DB rows. Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X LIMIT=Y. Don't close IndexWriter - use the single instance. There is no MakeStable()-like method in Lucene, but you can control the number of in-memory

RE: Indexing a large number of DB records

2004-12-15 Thread Garrett Heaver
Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live

RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size,

Re: C# Ports

2004-12-15 Thread Ben Litchfield
I have created a DLL from the lucene jars for use in the PDFBox project. It uses IKVM(http://www.ikvm.net) to create a DLL from a jar. The binary version can be found here http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip This includes the ant script used to

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chuck Williams wrote: I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber
On Wednesday 15 December 2004 19:29, Mike Snare wrote: In my case, the words are keywords that must remain as is, searchable with the hyphen in place. It was easy enough to modify the tokenizer to do what I need, so I'm not really asking for help there. I'm really just curious as to why it is

Re: A question about scoring function in Lucene

2004-12-15 Thread Chris Hostetter
: I question whether such scores are more meaningful. Yes, such scores : would be guaranteed to be between zero and one, but would 0.8 really be : meaningful? I don't think so. Do you have pointers to research which : demonstrates this? E.g., when such a scoring method is used, that :

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare
a-1 is considered a typical product name that needs to be unchanged (there's a comment in the source that mentions this). Indexing hyphen-word as two tokens has the advantage that it can then be found with the following queries: hypen-word (will be turned into a phrase query internally)

Re: A question about scoring function in Lucene

2004-12-15 Thread Otis Gospodnetic
There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Erik Hatcher
On Dec 15, 2004, at 3:14 PM, Mike Snare wrote: [...] In addition, why do we assume that a-1 is a typical product name but a-b isn't? I am in no way second-guessing or suggesting a change, It just doesn't make sense to me, and I'm trying to understand. It is very likely, as is oft the case, that

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
I'll try to address all the comments here. The normalization I proposed a while back on lucene-dev is specified. Its properties can be analyzed, so there is no reason to guess about them. Re. Hoss's example and analysis, yes, I believe it can be demonstrated that the proposed normalization would

Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

2004-12-15 Thread Nader Henein
This is a OS file system error not a Lucene issue (not for this board) , Google it for Gentoo specifically you a get a whole bunch of results one of which is this thread on the Gentoo Forums, http://forums.gentoo.org/viewtopic.php?t=9620 Good Luck Nader Henein Karthik N S wrote: Hi

RE: A question about scoring function in Lucene

2004-12-15 Thread Nhan Nguyen Dang
Thank for your answer, In Lucene scoring function, they use only norm_q, but for one query, norm_q is the same for all documents. So norm_q is actually not effect the score. But norm_d is different, each document has a different norm_d; it effect the score of document d for query q. If you drop

C# Ports

2004-12-15 Thread Garrett Heaver
I was just wondering what tools (JLCA?) people are using to port Lucene to c# as I'd be well interesting in converting things like snowball stemmers, wordnet etc. Thanks Garrett

RE: C# Ports

2004-12-15 Thread George Aroush
Hi Garrett, If you are referring to dotLucene (http://sourceforge.net/projects/dotlucene/) than I can tell you how -- not too long ago I posted on this list how I ported 1.4 and 1.4.3 to C#, please search the list for the answer -- you can't just use JLCA. As for the snwball, I have already

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
Nhan, You are correct that dropping the document norm does cause Lucene's scoring model to deviate from the pure vector space model. However, including norm_d would cause other problems -- e.g., with short queries, as are typical in reality, the resulting scores with norm_d would all be

Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare
I am writing a tool that uses lucene, and I immediately ran into a problem searching for words that contain internal hyphens (dashes). After looking at the StandardTokenizer, I saw that it was because there is no rule that will match ALPHA P ALPHA or ALPHANUM P ALPHANUM. Based on what I can tell

Re: TFIDF Implementation

2004-12-15 Thread David Spencer
Christoph Kiefer wrote: David, Bruce, Otis, Thank you all for the quick replies. I looked through the BooksLikeThis example. I also agree, it's a very good and effective way to find similar docs in the index. Nevertheless, what I need is really a similarity matrix holding all TF*IDF values. For

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Otis Gospodnetic wrote: There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chris Hostetter wrote: For example, using the current scoring equation, if i do a search for Doug Cutting and the results/scores i get back are... 1: 0.9 2: 0.3 3: 0.21 4: 0.21 5: 0.1 ...then there are at least two meaningful pieces of data I can glean:

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber
On Wednesday 15 December 2004 21:14, Mike Snare wrote: Also, the phrase query would place the same value on a doc that simply had the two words as a doc that had the hyphenated version, wouldn't it? This seems odd. Not if these words are spelling variations of the same concept, which doesn't

File locking using java.nio.channels.FileLock

2004-12-15 Thread John Wang
Hi: When is Lucene planning on moving toward java 1.4+? I see there are some problems caused from the current lock file implementation, e.g. Bug# 32171. The problems would be easily fixed by using the java.nio.channels.FileLock object. Thanks -John