David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and
Hello Homam,
The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.
Don't close IndexWriter - use the single instance.
There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory
Hi Homan
I had a similar problem as you in that I was indexing A LOT of data
Essentially how I got round it was to batch the index.
What I was doing was to add 10,000 documents to a temporary index, use
addIndexes() to merge to temporary index into the live index (which also
optimizes the live
Note that this really includes some extra steps.
You don't need a temp index. Add everything to a single index using a
single IndexWriter instance. No need to call addIndexes nor optimize
until the end. Adding Documents to an index takes a constant amount of
time, regardless of the index size,
I have created a DLL from the lucene jars for use in the PDFBox project.
It uses IKVM(http://www.ikvm.net) to create a DLL from a jar.
The binary version can be found here
http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip
This includes the ant script used to
Chuck Williams wrote:
I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically
On Wednesday 15 December 2004 19:29, Mike Snare wrote:
In my case, the words are keywords that must remain as is, searchable
with the hyphen in place. It was easy enough to modify the tokenizer
to do what I need, so I'm not really asking for help there. I'm
really just curious as to why it is
: I question whether such scores are more meaningful. Yes, such scores
: would be guaranteed to be between zero and one, but would 0.8 really be
: meaningful? I don't think so. Do you have pointers to research which
: demonstrates this? E.g., when such a scoring method is used, that
:
a-1 is considered a typical product name that needs to be unchanged
(there's a comment in the source that mentions this). Indexing
hyphen-word as two tokens has the advantage that it can then be found
with the following queries:
hypen-word (will be turned into a phrase query internally)
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago. For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score
On Dec 15, 2004, at 3:14 PM, Mike Snare wrote:
[...]
In addition, why do we assume that a-1 is a typical product name but
a-b isn't?
I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand. It is very likely, as
is oft the case, that
I'll try to address all the comments here.
The normalization I proposed a while back on lucene-dev is specified.
Its properties can be analyzed, so there is no reason to guess about
them.
Re. Hoss's example and analysis, yes, I believe it can be demonstrated
that the proposed normalization would
This is a OS file system error not a Lucene issue (not for this board) ,
Google it for Gentoo specifically you a get a whole bunch of results one
of which is this thread on the Gentoo Forums,
http://forums.gentoo.org/viewtopic.php?t=9620
Good Luck
Nader Henein
Karthik N S wrote:
Hi
Thank for your answer,
In Lucene scoring function, they use only norm_q,
but for one query, norm_q is the same for all
documents.
So norm_q is actually not effect the score.
But norm_d is different, each document has a different
norm_d; it effect the score of document d for query q.
If you drop
I was just wondering what tools (JLCA?) people are using to port Lucene to
c# as I'd be well interesting in converting things like snowball stemmers,
wordnet etc.
Thanks
Garrett
Hi Garrett,
If you are referring to dotLucene
(http://sourceforge.net/projects/dotlucene/) than I can tell you how -- not
too long ago I posted on this list how I ported 1.4 and 1.4.3 to C#, please
search the list for the answer -- you can't just use JLCA.
As for the snwball, I have already
Nhan,
You are correct that dropping the document norm does cause Lucene's scoring
model to deviate from the pure vector space model. However, including norm_d
would cause other problems -- e.g., with short queries, as are typical in
reality, the resulting scores with norm_d would all be
I am writing a tool that uses lucene, and I immediately ran into a
problem searching for words that contain internal hyphens (dashes).
After looking at the StandardTokenizer, I saw that it was because
there is no rule that will match ALPHA P ALPHA or ALPHANUM P
ALPHANUM. Based on what I can tell
Christoph Kiefer wrote:
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For
Otis Gospodnetic wrote:
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago. For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an
Chris Hostetter wrote:
For example, using the current scoring equation, if i do a search for
Doug Cutting and the results/scores i get back are...
1: 0.9
2: 0.3
3: 0.21
4: 0.21
5: 0.1
...then there are at least two meaningful pieces of data I can glean:
On Wednesday 15 December 2004 21:14, Mike Snare wrote:
Also, the phrase query
would place the same value on a doc that simply had the two words as a
doc that had the hyphenated version, wouldn't it? This seems odd.
Not if these words are spelling variations of the same concept, which
doesn't
Hi:
When is Lucene planning on moving toward java 1.4+?
I see there are some problems caused from the current lock file
implementation, e.g. Bug# 32171. The problems would be easily fixed by
using the java.nio.channels.FileLock object.
Thanks
-John
23 matches
Mail list logo