I think the 'create' flag really indicates whether it's ok
to *overwrite* the *possibly*existing* index.
Despite the tricky nuance it works great.
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory,%20org.apache.lucene.
I think if you call IndexReader.close() then the deleted item really goes away.
"Serge A. Redchuk" wrote:
> Hello All !
> I see delete method in IndexReader, but when I delete item from
> reader - this item will not be deleted physically.
> So I must rewrite all index after e
I don't think there's a book, but if you want to read a good book
on the theory of search engines, data structures to support them
that handle large amounts of data, avoiding i/o and so on I thought
"Managing Gigabytes" was a great book. This is my associates
link to Amazon:
http://www.amazon.com
There has been some discussion of ZipDirectory:
http://www.google.com/search?q=lucene+ZipDirectory
"Shah, Lokesh" wrote:
> Hi,
> I am a new user, so pleas be patient with me.
> Is there a way to read index from a jar file, instead of a directory?
> Regards,
> Lokesh
>
--
http://www.tro
Samir Satam wrote:
Thanks for ur reply.
Maybe i asked the wrong question.
Lets Say
Just,
Number of documents indexed. (No. of Document objects in the index)
AND
The index size one has had yet. Regardless of the no of document objects. (To
determine one the max index size one is working with.
I've written what I'd like to donate as example code to the project.
I'm not on the list to have CVS write permissions, so if one of the power
users agrees then please put this into the sandbox.
This code indexes the mail in an IMAP message store.
By default it reads all email from an IMAP server a
Is it possible that there's some combo of:
- the index of your data set being small relative to the Solaris disk
cache/RAM
- stringA being rare
such that it would explain some of your results?
Harry Foxwell wrote:
I have a project for which I want to characterize Lucene query
performance
on di
FYI I tried the textmining.org/poi combo and on a collection of 350 word
docs people have developed here over the years, and it failed on 33% of them
with exceptions being thrown about the formats being invalid.
I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
*.exe, and
it wo
ntiText(
file_name_of_word_file;
Your assistance is greatly appreciated.
Eric Anderson
815-505-6132
Quoting David Spencer <[EMAIL PROTECTED]>:
FYI I tried the textmining.org/poi combo and on a collection of 350 word
docs people have developed here over the years, and it failed on 3
u use the library at http://textmining.org.
contrary to what David Spencer says, it should work on all documents created
with Word 97 or above. I have literally indexed 100,000s of unique documents
using my library.
Ryan Ackley
- Original Message -
From: "Eric Anderson" <[EM
(most likely Word 6.0) or its not a word document. If
this isn't the case you need to email me so I can fix it and make it better
for the benefit of everyone. I plan on adding support for Word 6 in the
future.
Ryan Ackley
- Original Message -
From: "David Spencer" <[EM
Rob Outar wrote:
What happens if I add the same name/value pair to a Lucene Document? Does
it override it? Does it append it so you have duplicates?
I believe it 'appends' in the sense that if you add 2 fields with the same
name then the Document has the union of the content of both fields
added
jcrowell wrote:
Thanks for responding. Are you referring to the solution under the title:
"How do I retrieve all the values of a particular field that exists within
an index, across all documents" ?
Here's some code that might do what you want.
It's shows the frequency of each term also.
Args ar
John Cwikla wrote:
Depends what "similar" means.
If by similar, you mean they contain alot of the same words/phrases, then
you can probably use
a query (although the number you can have is limited to 32 or 64 I think)
and get documents
using lucene.
I have a demo site that does this.
I thought I
I've seen discussions about using the double metaphone algorithm with
Lucene (basically: like soundex, used
to find works that sound similar in English at least) but couldn't find
an implementation, so I spent
a few minutes and wrote a Query and TermEnum object for this. I may have
missed the pr
so in the next
few days hopefully.
Erik
On Friday, December 19, 2003, at 02:51 PM, David Spencer wrote:
I've seen discussions about using the double metaphone algorithm
with Lucene (basically: like soundex, used
to find works that sound similar in English at least) but couldn'
Vipul Sagare wrote:
Lucene docs, FAQs and other research indicates
Note: Leading wildcards (e.g. *ook) are not supported.
Is there any work around for implementation of such feature (if one has
to implement)?
I've written a PrefixQuery and it's not hard to do -I can post it too.
Proble
Doug Cutting wrote:
Karl Koch wrote:
Do you know good papers about strategies of how
to select keywords effectivly beyond the scope of stopword lists and
stemming?
Using term frequencies of the document is not really possible since
lucene
is not providing access to a document vector, isn't it?
Dubious that they do..
in Integer I don't know for sure.
Otis
--- David Spencer <[EMAIL PROTECTED]> wrote:
Doug Cutting wrote:
Karl Koch wrote:
Do you know good papers about strategies of how
to sele
Kristian Hermsdorf wrote:
Hi
I've written a PrefixQuery and it's not hard to do -I can post it too.
Problem is that it is not integrated into the query parser (.jj) so odds
are noone will use it, and the general sentiment on this list (and
lucene-dev)
is that prefix queries are evil because it'
I have code that does just this.
The calls to "DFields.*" should be replaced with the approp String e.g.
"title", "url" etc.
A bit of boosting is done too under the heuristic that a title match is
better
than a body match.
Only hassle is this is not integrated into the query parser but it's easy
Eric Jain wrote:
- Support for PowerPoint documents
May I ask how you extract text from PowerPoint documents? Any open
source tool, or your own code?
FYI I recently discovered "ppthtml" in this package:
http://chicago.sourceforge.net/xlhtml/
Also "antiword" seems to work well for word do
Doug Cutting wrote:
David Spencer wrote:
Code rewritten, automagically chooses lots of defaults, lets you
override
the defs thru the static vars at the bottom or the non-static vars
also at the bottom.
Has anyone used this? Was it useful?
I've put it up on my "demo" site (
[EMAIL PROTECTED] wrote:
Here's the results of some tests using David's "more like.." class.
http://home.clara.net/markharwood/lucene/mlt.htm
Looks useful.
Thanks for testing.
I have a couple of suggestions in the review.
Your text copied here and my comments:
> Overall, a pretty useful cl
Doug Cutting wrote:
David Spencer wrote:
2 files attached, SubstringQuery (which you'll use) and
SubstringTermEnum ( used by the former to be
consistent w/ other Query code).
I find this kind of query useful to have and think that the query
parser should allow it in spite of the percepti
Bruce Ritchie wrote:
David Spencer wrote:
[c] "interesting words" - uses code from MoreLikeThis to give a table
of all interesting
words in the current "source" doc ordered by score.
Remember score is idf*tf as per Dougs mail (and as per my
hopefully correct understanding of
Bruce Ritchie wrote:
David Spencer wrote:
Code rewritten, automagically chooses lots of defaults, lets you
override
the defs thru the static vars at the bottom or the non-static vars
also at the bottom.
I've taken the liberty to update this code to handle multiple fields
and use th
Matt Quail wrote:
Hi all,
Is there any way to iterate through a TermEnum backwards? Okay, I know
that there isn't a way to do this via the TermEnum class, but is it
"implementable" on top of the underlying Lucene datastore?
My particular problem is this:
I have an index of documents, each docume
Parminder Singh wrote:
I've a CMS application that deploys metadata to a database. Is it possible to use lucene to search this database instead of it's (lucene's) index. If you could tell me the steps that would be involved in doing this, it'd be great help. I'm new to Lucene.
I've done this e
Out of curiosity - does anyone use a Filter based on string (token)
length. Use case is, say, you're indexing email msgs and if an
attachment is uuencoded into lines of 60 or whatever characters then you
don't want to index tokens that are so long as they can't possibly be
of use later and jus
Maybe I missed something but I always thought the stop list should be a
Set, not a Map (or Hashtable/Dictionary). After all, all you need to
know is existence and that's what a Set does.
Doug Cutting wrote:
Erik Hatcher wrote:
Well, one issue you didn't consider is changing a public method
si
SubstringQuery, my humble contribution.
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06388.html
Tomcat Programmer wrote:
I have a situation where I need to be able to find
incomplete word matches, for example a search for the
string 'ape' would return matches for 'grapes'
'naples' 'staples'
Karl Koch wrote:
If I create an standard index, what does Lucene store in this index?
What should be stored in an index at least? Just a link to the file and
keywords? Or also wordnumbers? What else?
Does somebody know a paper which discusses this problem of "what to put in
an good universal IR i
Otis Gospodnetic wrote:
Sure.
On click, get document Id (not internal docId, but something you use as
s surrogate primary key) of the clicked document. Retrieve the
document. Pull out the value of 'clickCount' field. +1 it. Delete
the document, and re-add it (there is no 'update(Document)' met
This reminds me - if you have a search engine that indexes a mail store
and you present results in a web page to a browser, you want to (of
course...well I think this is obvious) send back a URL that would cause
the users native mail client to pull up the msg.
IMAP has a URL format, and I use M
Haven't seen this discussed here.
See 7a at the link below:
http://www.asktog.com/columns/062top10ReasonsToNotShop.html
7a talks about searching on a camera site for the "Lowepro 100 AW".
He says this query works:"Lowepro 100 AW"
and this query does not work: "Lowepro 100AW"
Cross checking
Scott Sayles wrote:
Is there anyone out there that has page ranking implemented on top of
Lucene?
I recently discovered JUNG which has 2 impls of PageRank:
http://jung.sourceforge.net/api/1.4.1/edu/uci/ics/jung/algorithms/importance/PageRank.html
I did a test of hooking it up to my spider and ca
xuemei li wrote:
Hi,all,
see this:
http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex
Can we do search and update one index simultaneously?Is someone know sth
about it? I had done some experiments.Now the search will be blocked
when the index is being updated.The error in search node is like
Erik Hatcher wrote:
On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:
Well, a question again, how does Lucene compute the score between a
document and a query?
And I might add, thus, this approach to similarity gives more weight to
rare terms that match, which one might want for this kind of sim
Terry Steichen wrote:
Erik,
Could you expand on this just a wee bit, perhaps with an example of how to
compute this vector angle?
I'm tempted to write the code to see how it works, but FYI this doc
seems to nicely explain the concepts:
http://www.la2600.org/talks/files/20040102/Vector_Space_Searc
org.apache.lucene.analysis.Token t;
while ( (t = ts.next()) != null)
{
sb.append( t.termText() + " ");
}
return QueryParser.parse( sb.toString(),DFields.CONTENTS, a);
}
David Spencer <[EMAIL PROTECTED
Does anyone have any experiences with giving a bonus for exactly
matching case in queries?
One use case is in the java world maybe I want to see references to
"Map" (java.util.Map) but am not interested in a (geographical) "map".
I believe, in the context of Lucene, one way is to have an Analy
Using 1.4rc3.
Running an app that indexes 50k documents (thus it just uses an
IndexWriter).
One field has that boolean set for it to have a term vector stored for
it, while other 11 fields don't.
On stdout I see "No tvx file" 13 times.
Glancing thru the src it seems this comes from TermVectorRea
Does it ever make sense to set the Similartity obj in either (only one
of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can
I avoid setting it in IndexSearcher? Also, can I avoid setting it in
IndexWriter and only set it in IndexSearcher? I noticed Nutch sets it in
both place
Erik Hatcher wrote:
On Jun 9, 2004, at 8:53 AM, Terry Steichen wrote:
3) Is there a plan for adding QueryParser support for the SpanQuery
family?
Another important facet to Terry's question here is what syntax to use
to express all various types of queries? I suspect that Google stats
And other
Erik Hatcher wrote:
On Jun 9, 2004, at 12:21 PM, David Spencer wrote:
show us that most folks query with 1 - 3 words and do not use the
any of the advanced features.
But with automagic query expansion these things might be done behind
the scenes. Nutch, for one, expands simple queries to check
I've run across an amusing interaction between advanced
Analyzers/TokenStreams and the very useful "term highlighter":
http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/
I have a custom Analyzer I'm using to index javadoc-generated web pages.
The Analyzer in turn has
[EMAIL PROTECTED] wrote:
Yes, this issue has come up before with other choices of analyzers.
I think it should be fixable without changing any of the highlighter APIs
- can you email me or post here the source to your analyzer?
Code attached - don't make fun of it please :) - very prelim. I thi
Erik Hatcher wrote:
On Jun 19, 2004, at 2:29 AM, David Spencer wrote:
A naive analyzer would turn something like "SyncThreadPool" into one
token. Mine uses the great Lucene capability of Tokens being able to
have a "0" position increment to turn it into the token strea
[EMAIL PROTECTED] wrote:
I think this version of the highlighter should provide a fix: http://www.inperspective.com/lucene/hilite2beta.zip
Before I update the version of the highlighter in the sandbox I'd appreciate feedback from those troubled
with the issues to do with overlapping tokens in toke
Otis Gospodnetic wrote:
Hello William,
Lucene does not have a categorization engine, but you may want to look
at Carrot2 (http://sourceforge.net/projects/carrot2/)
May be getting off topic - but maybe not..I can't find an example of how
to use Carrot2. It builds easy enough, but there's no obvious
Engine
and com.dawidweiss.carrot.filter.stc.Processor is a class that drives this.
Lucene hook - hey - I'm trying to integrate the two. I think this is how
it would be done, get search results from Lucene then set up STCEngine a
la how Processor does.
Thx,
william.
From: David Spencer <[EMAIL PROTECTED]&
Hetan Shah wrote:
Hello,
You guys have been great! I read lots of threads and am learning a lot
about Lucene.
Can any one point me to right direction or show me a code sample where
I can build queries for 'any word' 'all words' and 'phrase. I tried
to look on the Lucene FAQ but I did not under
I've put together a kind of experimental site which indexes the javadoc
of OSS java projects (well, plus the JDK).
http://www.searchmorph.com/
This is meant to solve the problem where a java developer knows
something has been done before, but where, in what project - source
forge? jakarta? ecli
Alex McManus wrote:
Hi,
we are at the initial design stages of a public-facing web-based search
application for a U.S. Federal Agency. We have proposed a clustered Lucene
architecture as the best technical solution, as we feel their current system
(based on Oracle) won't give the best performance
Inspired by these guys who put results from Google into a treemap...
http://google.hivegroup.com/
I did up my own version running against my index of OSS/javadoc trees.
This query for "thread pool" shows it off nicely:
http://www.searchmorph.com/kat/tsearch.jsp?s=thread%20pool&side=300&goal=500
Thi
Stefan Groschupf wrote:
Possibly a silly question - but how would I go about searching multiple
indexes using lucene? Do I need to basically repeat the code I use to
search one index for each one, or is there a better way to do it?
Take a look to the nutch.org sourcecode. It does what you are sea
Stefan Groschupf wrote:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/
MultiSearcher.html
100% Right.
I personal found code samples more interesting then just java doc.
Good point.
That why my hint, here the code snippet from nutch:
But - warning - in normal use of Lucene you
-
for my site I do want to convert the custom spider/cache to use Nutch...
Do you know:
http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ?
Interesting - is there any code avail to draw the maps?
thx,
Dave
Cheers,
Stefan
Am 01.07.2004 um 23:28 schrieb David Spencer:
Inspired by
This in theory should not help, but anyway, just in case, the idea is to
call gc() periodically to "force" gc - this is the code I use which
tries to force it...
public static long gc()
{
long bef = mem();
System.gc();
sleep( 100);
Hetan Shah wrote:
My search results are only displaying the top portion of the indexed
documents. It does match the query in the later part of the document.
Where should I look to change the code in demo3 of default 1.3 final
distribution. In general if I want to show the block of document that
Wermus Fernando wrote:
Luceners,
My app is creating, updating and deleting from the index and searching
too. I need some information about sorting by a field. Does any one
could send me a link related to sorting?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Sort.html
Thank
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close()
What is the intent of IndexSearcher.close()?
I want to know how, in a web app, one can stop a search that's in
progress - use case is a user is limited to one search at at time, and
when one (expensive)
Anne Y. Zhang wrote:
Hi, I am assistanting a professor for a IR course.
We need to provide the student with a full-fuctioned
search engine package, and the professor prefers it
being powered by lucene. Since I am new to lucene,
can anyone provide me some information that where
can I get the packag
Aad Nales wrote:
Hi All,
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to be to create
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a
query that have zero matches, go thru all the terms(!) and calculate
the Levenshtein string distance, and return the best matches. A more
intelligent way of doing this is to instead
Honey George wrote:
Hi,
I know some of them.
1. PDF
+ http://www.pdfbox.org/
+ http://www.foolabs.com/xpdf/download.html
- I am using this and found good. It even supports
My dated experience from 2 years ago was that (the evil, native code)
foolabs pdf parser was the best, but obviously t
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to b
eks dev wrote:
Hi Doug,
Perhaps. Are folks really better at spelling the
beginning of words?
Yes they are. There were some comprehensive empirical
studies on this topic. Winkler modification on Jaro
string distance is based on this assumption (boosting
similarity if first n, I think 4, chars mat
Doug Cutting wrote:
Aad Nales wrote:
Before I start reinventing wheels I would like to do a short check to
see if anybody else has already tried this. A customer has requested us
to look into the possibility to perform a spell check on queries. So far
the most promising way of doing this seems to b
Doug Cutting wrote:
David Spencer wrote:
Doug Cutting wrote:
And one should not try correction at all for terms which occur in a
large proportion of the collection.
I keep thinking over this one and I don't understand it. If a user
misspells a word and the "did you mean" spel
JiÅÃ Kuhn wrote:
Thanks for the bug's id, it seems like my problem and I have a stand-alone code with
main().
What about slow garbage collector? This looks for me as wrong suggestion.
I've seen this written up before (javaworld?) as a way to probably
"force" GC instead of just a System.gc() call
ion of my code. I believe that the code should
run endlesly (I have said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemor
said it before: in version 1.4 final it does).
Jiri.
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Monday, September 13, 2004 5:34 PM
To: Lucene Users List
Subject: force gc idiom - Re: OutOfMemory example
JiÅÃ Kuhn wrote:
Thanks for the bug's id, it see
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like
things are never removed from it
Replying to my own postthis could be the problem.
If I put in a print statement here in FieldSortedHitQueue, recompile,
and
ay be causing this leak.
David Spencer wrote:
David Spencer wrote:
Just noticed something else suspicious.
FieldSortedHitQueue has a field called Comparators and it seems like
things are never removed from it
Replying to my own postthis could be the problem.
If I put in a print statement
Daniel Naber wrote:
On Monday 13 September 2004 15:06, JiÅÃ Kuhn wrote:
I think I can reproduce memory leaking problem while reopening
an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My
JVM is:
Could you try with the latest Lucene version from CVS? I cannot reproduce
Honey George wrote:
Hi,
This might be more of a questing related to the
PorterStemmer algorithm rather than with lucene, but
if anyone has the knowledge please share.
You might want to also try the Snowball stemmer:
http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/
And KStem:
http://c
Andrzej Bialecki wrote:
David Spencer wrote:
I can/should send the code out. The logic is that for any terms in a
query that have zero matches, go thru all the terms(!) and calculate
the Levenshtein string distance, and return the best matches. A more
intelligent way of doing this is to instead
Tate Avery wrote:
I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp
How embarassing!
Sorry!
Fixed!
T
-Original Message-
From: David Spencer [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 14, 2004 3:23 PM
To: Lucene Users
Andrzej Bialecki wrote:
David Spencer wrote:
...or prepare in advance a fast lookup index - split all existing
terms to bi- or trigrams, create a separate lookup index, and then
simply for each term ask a phrase query (phrase = all n-grams from
an input term), with a slop > 0, to get simi
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a
term in the index, but the next 2 are. So it ignores the last 2 terms
("recursive" and "descent") and s
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be:
1. my expect
Andrzej Bialecki wrote:
Aad Nales wrote:
David,
Perhaps I misunderstand somehting so please correct me if I do. I used
http://www.searchmorph.com/kat/spell.jsp to look for conts without
changing any of the default values. What I got as results did not
include 'const' which has quite a high frequenc
Aad Nales wrote:
By trying: if you type const you will find that it returns 216 hits. The
third sports 'const' as a term (space seperated and all). I would expect
'conts' to return with const as well. But again I might be mistaken. I
am now trying to figure what the problem might be:
1. my expect
Andrzej Bialecki wrote:
David Spencer wrote:
To restate the question for a second.
The misspelled word is: "conts".
The sugggestion expected is "const", which seems reasonable enough as
it's just a transposition away, thus the string distance is low.
But - I guess the pr
Doug Cutting wrote:
David Spencer wrote:
[1] The user enters a query like:
recursize descent parser
[2] The search code parses this and sees that the 1st word is not a
term in the index, but the next 2 are. So it ignores the last 2 terms
("recursive" and "descent") and s
Crump, Michael wrote:
You have to close the IndexReader after doing the delete, before opening the
IndexWriter for the addition. See information at this link:
http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex
Recently I thought I observed that if I use this batch update idiom (1st
delete the
Morus Walter wrote:
Hi David,
Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2
phases. First you build a "fast lookup index" as mentioned above. Then
to correct a word you do a query in this index based on the ngrams in
the misspelled word.
Let's see.
[1] Source is attached
[EMAIL PROTECTED] wrote:
Hello,
I can successfully index and search the PDF documents, however i am not
able to highlight the searched text in my original PDF file (ie: like
dtSearch
highlights on original file)
I took a look at the highlighter in sandbox, compiled it and have it
ready. I am wond
sam s wrote:
Hi Folks,
Is there any place where I can do a better search on lucene mailing
archives?
I tried JGuru and looks like their search is paid.
Apache maintained archives lags efficient searching.
Of course one of the ironies is, shouldn't we be able to use Lucene to
search the mailing li
Erik Hatcher wrote:
Have a look at the WordNet contribution in the Lucene sandbox
repository. It could be leveraged for part of a solution.
It's something I contributed.
Relevant links are:
http://jakarta.apache.org/lucene/docs/lucene-sandbox/
http://www.tropo.com/techno/java/lucene/wordnet.html
Suggestions
[a]
Try invoking the VM w/ an option like "-XX:CompileThreshold=100" or even
a smaller number. This encourages the hotspot VM to compile methods
sooner, thus the app will take less time to "warm up".
http://java.sun.com/docs/hotspot/VMOptions.html#additional
You might want to sea
Otis Gospodnetic wrote:
Hm, if you can index 11, you should be able to index 8 as well. In any
case, you most likely want to make sure that your Analyzer is not just
In theory you could have a "length" filter tossing out tokens that are
too short or too long, and maybe you're getting rid of all
ore frequent in
my index than "map" and "tree"...I'm sure "hash java" occurs more
frequently than "hash map" - or any other freq, non-stop word, and it's
dubious that "hash java" is a useful suggestion...
So
if you type fast, it doe
Bruce Ritchie wrote:
Christoph,
I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox
Ot oh, sorry,
petite_abeille wrote:
Well, the subject says it all...
If there is one thing which is overly cumbersome in Lucene, it's
updating documents, therefore this Request For Enhancement:
Please consider enhancing the IndexWriter API to include an
updateDocument(...) method to take care of all the gory
TED]> wrote:
Christoph,
I'm not entirely certain if this is what you want, but a while back
David Spencer did code up a 'More Like This' class which can be used
for generating similarities between documents. I can't seem to find
this class in the sandbox so I've attached i
Bruce Ritchie wrote:
From the code I looked at, those calls don't recalculate on
every call.
I was referring to this fragment below from BooksLikeThis.docsLike(),
and was mentioning it as the javadoc
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
dex/TermFreqVector.html
does n
Bruce Ritchie wrote:
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2&item=source
Well done, uses a term vector, instead of reparsing the orig
doc, to form the similarity query. Also I like the way you
exclude the source doc in th
1 - 100 of 141 matches
Mail list logo