Re: existing or not existing

2001-12-04 Thread David Spencer
I think the 'create' flag really indicates whether it's ok to *overwrite* the *possibly*existing* index. Despite the tricky nuance it works great. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory,%20org.apache.lucene.

Re: continue - delete/update

2001-12-10 Thread David Spencer
I think if you call IndexReader.close() then the deleted item really goes away. "Serge A. Redchuk" wrote: > Hello All ! > I see delete method in IndexReader, but when I delete item from > reader - this item will not be deleted physically. > So I must rewrite all index after e

Re: Lucene Book

2002-05-04 Thread David Spencer
I don't think there's a book, but if you want to read a good book on the theory of search engines, data structures to support them that handle large amounts of data, avoiding i/o and so on I thought "Managing Gigabytes" was a great book. This is my associates link to Amazon: http://www.amazon.com

Re: Reading index from a jar file

2002-06-18 Thread David Spencer
There has been some discussion of ZipDirectory: http://www.google.com/search?q=lucene+ZipDirectory "Shah, Lokesh" wrote: > Hi, > I am a new user, so pleas be patient with me. > Is there a way to read index from a jar file, instead of a directory? > Regards, > Lokesh > -- http://www.tro

Re: MAX Index Size POLL

2003-02-27 Thread David Spencer
Samir Satam wrote: Thanks for ur reply. Maybe i asked the wrong question. Lets Say Just, Number of documents indexed. (No. of Document objects in the index) AND The index size one has had yet. Regardless of the no of document objects. (To determine one the max index size one is working with.

[ANN] Sample code to index an IMAP message store

2003-02-28 Thread David Spencer
I've written what I'd like to donate as example code to the project. I'm not on the list to have CVS write permissions, so if one of the power users agrees then please put this into the sandbox. This code indexes the mail in an IMAP message store. By default it reads all email from an IMAP server a

Re: lucene performance question

2003-03-03 Thread David Spencer
Is it possible that there's some combo of: - the index of your data set being small relative to the Solaris disk cache/RAM - stringA being rare such that it would explain some of your results? Harry Foxwell wrote: I have a project for which I want to characterize Lucene query performance on di

my experiences - Re: Parsing Word Docs

2003-03-05 Thread David Spencer
FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 33% of them with exceptions being thrown about the formats being invalid. I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free *.exe, and it wo

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread David Spencer
ntiText( file_name_of_word_file; Your assistance is greatly appreciated. Eric Anderson 815-505-6132 Quoting David Spencer <[EMAIL PROTECTED]>: FYI I tried the textmining.org/poi combo and on a collection of 350 word docs people have developed here over the years, and it failed on 3

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread David Spencer
u use the library at http://textmining.org. contrary to what David Spencer says, it should work on all documents created with Word 97 or above. I have literally indexed 100,000s of unique documents using my library. Ryan Ackley - Original Message - From: "Eric Anderson" <[EM

Re: my experiences - Re: Parsing Word Docs

2003-03-06 Thread David Spencer
(most likely Word 6.0) or its not a word document. If this isn't the case you need to email me so I can fix it and make it better for the benefit of everyone. I plan on adding support for Word 6 in the future. Ryan Ackley - Original Message - From: "David Spencer" <[EM

Re: Quick Question On Adding Fields

2003-03-20 Thread David Spencer
Rob Outar wrote: What happens if I add the same name/value pair to a Lucene Document? Does it override it? Does it append it so you have duplicates? I believe it 'appends' in the sense that if you add 2 fields with the same name then the Document has the union of the content of both fields added

Re: I need a list of the indexed words

2003-04-01 Thread David Spencer
jcrowell wrote: Thanks for responding. Are you referring to the solution under the title: "How do I retrieve all the values of a particular field that exists within an index, across all documents" ? Here's some code that might do what you want. It's shows the frequency of each term also. Args ar

Re: Find Documents 'Similar' to Another

2003-05-30 Thread David Spencer
John Cwikla wrote: Depends what "similar" means. If by similar, you mean they contain alot of the same words/phrases, then you can probably use a query (although the number you can have is limited to 32 or 64 I think) and get documents using lucene. I have a demo site that does this. I thought I

DoubleMetaphoneQuery

2003-12-19 Thread David Spencer
I've seen discussions about using the double metaphone algorithm with Lucene (basically: like soundex, used to find works that sound similar in English at least) but couldn't find an implementation, so I spent a few minutes and wrote a Query and TermEnum object for this. I may have missed the pr

Re: DoubleMetaphoneQuery

2003-12-21 Thread David Spencer
so in the next few days hopefully. Erik On Friday, December 19, 2003, at 02:51 PM, David Spencer wrote: I've seen discussions about using the double metaphone algorithm with Lucene (basically: like soundex, used to find works that sound similar in English at least) but couldn'

Re: Leading Wild Card Search

2004-02-11 Thread David Spencer
Vipul Sagare wrote: Lucene docs, FAQs and other research indicates Note: Leading wildcards (e.g. *ook) are not supported. Is there any work around for implementation of such feature (if one has to implement)? I've written a PrefixQuery and it's not hard to do -I can post it too. Proble

code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

2004-02-11 Thread David Spencer
Doug Cutting wrote: Karl Koch wrote: Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? Using term frequencies of the document is not really possible since lucene is not providing access to a document vector, isn't it?

MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

2004-02-12 Thread David Spencer
Dubious that they do.. in Integer I don't know for sure. Otis --- David Spencer <[EMAIL PROTECTED]> wrote: Doug Cutting wrote: Karl Koch wrote: Do you know good papers about strategies of how to sele

SubstringQuery -- Re: Leading Wild Card Search

2004-02-12 Thread David Spencer
Kristian Hermsdorf wrote: Hi I've written a PrefixQuery and it's not hard to do -I can post it too. Problem is that it is not integrated into the query parser (.jj) so odds are noone will use it, and the general sentiment on this list (and lucene-dev) is that prefix queries are evil because it'

Re: a search like Google

2004-02-12 Thread David Spencer
I have code that does just this. The calls to "DFields.*" should be replaced with the approp String e.g. "title", "url" etc. A bit of boosting is done too under the heuristic that a title match is better than a body match. Only hassle is this is not integrated into the query parser but it's easy

ppt text extraction - Re: SearchBlox J2EE Search Component Version 1.2 released

2004-02-17 Thread David Spencer
Eric Jain wrote: - Support for PowerPoint documents May I ask how you extract text from PowerPoint documents? Any open source tool, or your own code? FYI I recently discovered "ppthtml" in this package: http://chicago.sourceforge.net/xlhtml/ Also "antiword" seems to work well for word do

Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

2004-02-18 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: Code rewritten, automagically chooses lots of defaults, lets you override the defs thru the static vars at the bottom or the non-static vars also at the bottom. Has anyone used this? Was it useful? I've put it up on my "demo" site (

Re: MoreLikeThis Query generator - Re: code for "more like this"

2004-02-18 Thread David Spencer
[EMAIL PROTECTED] wrote: Here's the results of some tests using David's "more like.." class. http://home.clara.net/markharwood/lucene/mlt.htm Looks useful. Thanks for testing. I have a couple of suggestions in the review. Your text copied here and my comments: > Overall, a pretty useful cl

Re: SubstringQuery -- Re: Leading Wild Card Search

2004-02-18 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: 2 files attached, SubstringQuery (which you'll use) and SubstringTermEnum ( used by the former to be consistent w/ other Query code). I find this kind of query useful to have and think that the query parser should allow it in spite of the percepti

Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

2004-02-18 Thread David Spencer
Bruce Ritchie wrote: David Spencer wrote: [c] "interesting words" - uses code from MoreLikeThis to give a table of all interesting words in the current "source" doc ordered by score. Remember score is idf*tf as per Dougs mail (and as per my hopefully correct understanding of

Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

2004-02-25 Thread David Spencer
Bruce Ritchie wrote: David Spencer wrote: Code rewritten, automagically chooses lots of defaults, lets you override the defs thru the static vars at the bottom or the non-static vars also at the bottom. I've taken the liberty to update this code to handle multiple fields and use th

Re: Iterating TermEnum backwards

2004-02-26 Thread David Spencer
Matt Quail wrote: Hi all, Is there any way to iterate through a TermEnum backwards? Okay, I know that there isn't a way to do this via the TermEnum class, but is it "implementable" on top of the underlying Lucene datastore? My particular problem is this: I have an index of documents, each docume

Re: Database

2004-02-26 Thread David Spencer
Parminder Singh wrote: I've a CMS application that deploys metadata to a database. Is it possible to use lucene to search this database instead of it's (lucene's) index. If you could tell me the steps that would be involved in doing this, it'd be great help. I'm new to Lucene. I've done this e

StrlenFilter contribution and discussion

2004-03-01 Thread David Spencer
Out of curiosity - does anyone use a Filter based on string (token) length. Use case is, say, you're indexing email msgs and if an attachment is uuencoded into lines of 60 or whatever characters then you don't want to index tokens that are so long as they can't possibly be of use later and jus

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread David Spencer
Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method si

Re: incomplete word match

2004-03-11 Thread David Spencer
SubstringQuery, my humble contribution. http://www.mail-archive.com/[EMAIL PROTECTED]/msg06388.html Tomcat Programmer wrote: I have a situation where I need to be able to find incomplete word matches, for example a search for the string 'ape' would return matches for 'grapes' 'naples' 'staples'

Re: Lucene index - information

2004-03-19 Thread David Spencer
Karl Koch wrote: If I create an standard index, what does Lucene store in this index? What should be stored in an index at least? Just a link to the file and keywords? Or also wordnumbers? What else? Does somebody know a paper which discusses this problem of "what to put in an good universal IR i

Re: Scoring documents by Click Count

2004-05-06 Thread David Spencer
Otis Gospodnetic wrote: Sure. On click, get document Id (not internal docId, but something you use as s surrogate primary key) of the clicked document. Retrieve the document. Pull out the value of 'clickCount' field. +1 it. Delete the document, and re-add it (there is no 'update(Document)' met

now maybe Mozlla/IMAP URLs - Re: StandardTokenizer and e-mail

2004-05-21 Thread David Spencer
This reminds me - if you have a search engine that indexes a mail store and you present results in a web page to a browser, you want to (of course...well I think this is obvious) send back a URL that would cause the users native mail client to pull up the msg. IMAP has a URL format, and I use M

asktog on search problems

2004-05-21 Thread David Spencer
Haven't seen this discussed here. See 7a at the link below: http://www.asktog.com/columns/062top10ReasonsToNotShop.html 7a talks about searching on a camera site for the "Lowepro 100 AW". He says this query works:"Lowepro 100 AW" and this query does not work: "Lowepro 100AW" Cross checking

Re: Page ranking

2004-06-01 Thread David Spencer
Scott Sayles wrote: Is there anyone out there that has page ranking implemented on top of Lucene? I recently discovered JUNG which has 2 impls of PageRank: http://jung.sourceforge.net/api/1.4.1/edu/uci/ics/jung/algorithms/importance/PageRank.html I did a test of hooking it up to my spider and ca

Re: about search and update one index simultaneously

2004-06-01 Thread David Spencer
xuemei li wrote: Hi,all, see this: http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex Can we do search and update one index simultaneously?Is someone know sth about it? I had done some experiments.Now the search will be blocked when the index is being updated.The error in search node is like

Re: similarity of two texts - another question

2004-06-01 Thread David Spencer
Erik Hatcher wrote: On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote: Well, a question again, how does Lucene compute the score between a document and a query? And I might add, thus, this approach to similarity gives more weight to rare terms that match, which one might want for this kind of sim

Re: similarity of two texts

2004-06-02 Thread David Spencer
Terry Steichen wrote: Erik, Could you expand on this just a wee bit, perhaps with an example of how to compute this vector angle? I'm tempted to write the code to see how it works, but FYI this doc seems to nicely explain the concepts: http://www.la2600.org/talks/files/20040102/Vector_Space_Searc

Re: similarity of two texts - another question

2004-06-02 Thread David Spencer
org.apache.lucene.analysis.Token t; while ( (t = ts.next()) != null) { sb.append( t.termText() + " "); } return QueryParser.parse( sb.toString(),DFields.CONTENTS, a); } David Spencer <[EMAIL PROTECTED

bonus for exact case match

2004-06-03 Thread David Spencer
Does anyone have any experiences with giving a bonus for exactly matching case in queries? One use case is in the java world maybe I want to see references to "Map" (java.util.Map) but am not interested in a (geographical) "map". I believe, in the context of Lucene, one way is to have an Analy

"No tvx reader"

2004-06-05 Thread David Spencer
Using 1.4rc3. Running an app that indexes 50k documents (thus it just uses an IndexWriter). One field has that boolean set for it to have a term vector stored for it, while other 11 fields don't. On stdout I see "No tvx file" 13 times. Glancing thru the src it seems this comes from TermVectorRea

Setting Similarity in IndexWriter and IndexSearcher

2004-06-07 Thread David Spencer
Does it ever make sense to set the Similartity obj in either (only one of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can I avoid setting it in IndexSearcher? Also, can I avoid setting it in IndexWriter and only set it in IndexSearcher? I noticed Nutch sets it in both place

Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote: On Jun 9, 2004, at 8:53 AM, Terry Steichen wrote: 3) Is there a plan for adding QueryParser support for the SpanQuery family? Another important facet to Terry's question here is what syntax to use to express all various types of queries? I suspect that Google stats And other

extensible query parser - Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote: On Jun 9, 2004, at 12:21 PM, David Spencer wrote: show us that most folks query with 1 - 3 words and do not use the any of the advanced features. But with automagic query expansion these things might be done behind the scenes. Nutch, for one, expands simple queries to check

amusing interaction between advanced tokenizers and highlighter package

2004-06-18 Thread David Spencer
I've run across an amusing interaction between advanced Analyzers/TokenStreams and the very useful "term highlighter": http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/ I have a custom Analyzer I'm using to index javadoc-generated web pages. The Analyzer in turn has

Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
[EMAIL PROTECTED] wrote: Yes, this issue has come up before with other choices of analyzers. I think it should be fixable without changing any of the highlighter APIs - can you email me or post here the source to your analyzer? Code attached - don't make fun of it please :) - very prelim. I thi

Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
Erik Hatcher wrote: On Jun 19, 2004, at 2:29 AM, David Spencer wrote: A naive analyzer would turn something like "SyncThreadPool" into one token. Mine uses the great Lucene capability of Tokens being able to have a "0" position increment to turn it into the token strea

Re: Fix for "advanced tokenizers and highlighter" problem

2004-06-22 Thread David Spencer
[EMAIL PROTECTED] wrote: I think this version of the highlighter should provide a fix: http://www.inperspective.com/lucene/hilite2beta.zip Before I update the version of the highlighter in the sandbox I'd appreciate feedback from those troubled with the issues to do with overlapping tokens in toke

carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
Otis Gospodnetic wrote: Hello William, Lucene does not have a categorization engine, but you may want to look at Carrot2 (http://sourceforge.net/projects/carrot2/) May be getting off topic - but maybe not..I can't find an example of how to use Carrot2. It builds easy enough, but there's no obvious

Re: carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
Engine and com.dawidweiss.carrot.filter.stc.Processor is a class that drives this. Lucene hook - hey - I'm trying to integrate the two. I think this is how it would be done, get search results from Lucene then set up STCEngine a la how Processor does. Thx, william. From: David Spencer <[EMAIL PROTECTED]&

Re: Various kind of queries

2004-06-24 Thread David Spencer
Hetan Shah wrote: Hello, You guys have been great! I read lots of threads and am learning a lot about Lucene. Can any one point me to right direction or show me a code sample where I can build queries for 'any word' 'all words' and 'phrase. I tried to look on the Lucene FAQ but I did not under

ANN: Experimental site for searching javadoc of OSS projects

2004-06-25 Thread David Spencer
I've put together a kind of experimental site which indexes the javadoc of OSS java projects (well, plus the JDK). http://www.searchmorph.com/ This is meant to solve the problem where a java developer knows something has been done before, but where, in what project - source forge? jakarta? ecli

Re: Making a case for Lucene

2004-06-30 Thread David Spencer
Alex McManus wrote: Hi, we are at the initial design stages of a public-facing web-based search application for a U.S. Federal Agency. We have proposed a clustered Lucene architecture as the best technical solution, as we feel their current system (based on Oracle) won't give the best performance

Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
Inspired by these guys who put results from Google into a treemap... http://google.hivegroup.com/ I did up my own version running against my index of OSS/javadoc trees. This query for "thread pool" shows it off nicely: http://www.searchmorph.com/kat/tsearch.jsp?s=thread%20pool&side=300&goal=500 Thi

Re: search multiple indexes

2004-07-01 Thread David Spencer
Stefan Groschupf wrote: Possibly a silly question - but how would I go about searching multiple indexes using lucene? Do I need to basically repeat the code I use to search one index for each one, or is there a better way to do it? Take a look to the nutch.org sourcecode. It does what you are sea

Re: search multiple indexes

2004-07-01 Thread David Spencer
Stefan Groschupf wrote: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ MultiSearcher.html 100% Right. I personal found code samples more interesting then just java doc. Good point. That why my hint, here the code snippet from nutch: But - warning - in normal use of Lucene you

Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
- for my site I do want to convert the custom spider/cache to use Nutch... Do you know: http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ? Interesting - is there any code avail to draw the maps? thx, Dave Cheers, Stefan Am 01.07.2004 um 23:28 schrieb David Spencer: Inspired by

Re: Running OutOfMemory while optimizing and searching

2004-07-02 Thread David Spencer
This in theory should not help, but anyway, just in case, the idea is to call gc() periodically to "force" gc - this is the code I use which tries to force it... public static long gc() { long bef = mem(); System.gc(); sleep( 100);

Re: Search Result

2004-07-02 Thread David Spencer
Hetan Shah wrote: My search results are only displaying the top portion of the indexed documents. It does match the query in the later part of the document. Where should I look to change the code in demo3 of default 1.3 final distribution. In general if I want to show the block of document that

Re: about search sorting

2004-09-03 Thread David Spencer
Wermus Fernando wrote: Luceners, My app is creating, updating and deleting from the index and searching too. I need some information about sorting by a field. Does any one could send me a link related to sorting? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Sort.html Thank

IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread David Spencer
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close() What is the intent of IndexSearcher.close()? I want to know how, in a web app, one can stop a search that's in progress - use case is a user is limited to one search at at time, and when one (expensive)

Re: Full web search engine package using Lucene

2004-09-08 Thread David Spencer
Anne Y. Zhang wrote: Hi, I am assistanting a professor for a IR course. We need to provide the student with a full-fuctioned search engine package, and the professor prefers it being powered by lucene. Since I am new to lucene, can anyone provide me some information that where can I get the packag

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Aad Nales wrote: Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead

Re: Existing Parsers

2004-09-09 Thread David Spencer
Honey George wrote: Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports My dated experience from 2 years ago was that (the evil, native code) foolabs pdf parser was the best, but obviously t

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to b

Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
eks dev wrote: Hi Doug, Perhaps. Are folks really better at spelling the beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars mat

frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to b

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spel

force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
JiÅÃ Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably "force" GC instead of just a System.gc() call

OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
ion of my code. I believe that the code should run endlesly (I have said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemor

FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example JiÅÃ Kuhn wrote: Thanks for the bug's id, it see

Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
David Spencer wrote: Just noticed something else suspicious. FieldSortedHitQueue has a field called Comparators and it seems like things are never removed from it Replying to my own postthis could be the problem. If I put in a print statement here in FieldSortedHitQueue, recompile, and

SegmentReader - Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
ay be causing this leak. David Spencer wrote: David Spencer wrote: Just noticed something else suspicious. FieldSortedHitQueue has a field called Comparators and it seems like things are never removed from it Replying to my own postthis could be the problem. If I put in a print statement

Re: OutOfMemory example

2004-09-13 Thread David Spencer
Daniel Naber wrote: On Monday 13 September 2004 15:06, JiÅÃ Kuhn wrote: I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: Could you try with the latest Lucene version from CVS? I cannot reproduce

Re: PorterStemfilter

2004-09-14 Thread David Spencer
Honey George wrote: Hi, This might be more of a questing related to the PorterStemmer algorithm rather than with lucene, but if anyone has the knowledge please share. You might want to also try the Snowball stemmer: http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ And KStem: http://c

NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Tate Avery wrote: I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp How embarassing! Sorry! Fixed! T -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 3:23 PM To: Lucene Users

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop > 0, to get simi

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and s

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote: By trying: if you type const you will find that it returns 216 hits. The third sports 'const' as a term (space seperated and all). I would expect 'conts' to return with const as well. But again I might be mistaken. I am now trying to figure what the problem might be: 1. my expect

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote: Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high frequenc

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote: By trying: if you type const you will find that it returns 216 hits. The third sports 'const' as a term (space seperated and all). I would expect 'conts' to return with const as well. But again I might be mistaken. I am now trying to figure what the problem might be: 1. my expect

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: To restate the question for a second. The misspelled word is: "conts". The sugggestion expected is "const", which seems reasonable enough as it's just a transposition away, thus the string distance is low. But - I guess the pr

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and s

IndexReader.close() semantics and optimize -- Re: problem with locks when updating the data of a previous stored document

2004-09-16 Thread David Spencer
Crump, Michael wrote: You have to close the IndexReader after doing the delete, before opening the IndexWriter for the addition. See information at this link: http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex Recently I thought I observed that if I use this batch update idiom (1st delete the

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread David Spencer
Morus Walter wrote: Hi David, Based on this mail I wrote a "ngram speller" for Lucene. It runs in 2 phases. First you build a "fast lookup index" as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached

Re: Highlighting PDF file after the search

2004-09-20 Thread David Spencer
[EMAIL PROTECTED] wrote: Hello, I can successfully index and search the PDF documents, however i am not able to highlight the searched text in my original PDF file (ie: like dtSearch highlights on original file) I took a look at the highlighter in sandbox, compiled it and have it ready. I am wond

Re: Efficient search on lucene mailing archives

2004-10-14 Thread David Spencer
sam s wrote: Hi Folks, Is there any place where I can do a better search on lucene mailing archives? I tried JGuru and looks like their search is paid. Apache maintained archives lags efficient searching. Of course one of the ironies is, shouldn't we be able to use Lucene to search the mailing li

Re: Thesaurus ...

2004-10-19 Thread David Spencer
Erik Hatcher wrote: Have a look at the WordNet contribution in the Lucene sandbox repository. It could be leveraged for part of a solution. It's something I contributed. Relevant links are: http://jakarta.apache.org/lucene/docs/lucene-sandbox/ http://www.tropo.com/techno/java/lucene/wordnet.html

Re: Looking for consulting help on project

2004-10-27 Thread David Spencer
Suggestions [a] Try invoking the VM w/ an option like "-XX:CompileThreshold=100" or even a smaller number. This encourages the hotspot VM to compile methods sooner, thus the app will take less time to "warm up". http://java.sun.com/docs/hotspot/VMOptions.html#additional You might want to sea

Re: Single Digit Indexing

2004-12-06 Thread David Spencer
Otis Gospodnetic wrote: Hm, if you can index 11, you should be able to index 8 as well. In any case, you most likely want to make sure that your Analyzer is not just In theory you could have a "length" filter tossing out tokens that are too short or too long, and maybe you're getting rid of all

Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-11 Thread David Spencer
ore frequent in my index than "map" and "tree"...I'm sure "hash java" occurs more frequently than "hash map" - or any other freq, non-stop word, and it's dubious that "hash java" is a useful suggestion... So if you type fast, it doe

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox Ot oh, sorry,

Re: [RFE] IndexWriter.updateDocument()

2004-12-14 Thread David Spencer
petite_abeille wrote: Well, the subject says it all... If there is one thing which is overly cumbersome in Lucene, it's updating documents, therefore this Request For Enhancement: Please consider enhancing the IndexWriter API to include an updateDocument(...) method to take care of all the gory

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
TED]> wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached i

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does n

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2&item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in th

  1   2   >