Re: knowing which field contributed the search result

2005-02-22 Thread David Spencer
into the src if necessary. getValue() is the score, so all that's missing is the name of the field and I'm not sure if that's directly returned or not. Thanks -John On Mon, 21 Feb 2005 12:20:15 -0800, David Spencer [EMAIL PROTECTED] wrote: John Wang wrote: Anyone has any thoughts on this? Does

Re: Handling Synonyms

2005-02-21 Thread David Spencer
Luke Shannon wrote: Hello; Does anyone see a problem with the following approach? No, no problem with it and it's in fact what my Wordnet Query Expansion sandbox module does. The nice thing about Lucene is you at least have the option of doing things the other way - you can write a custom

Re: knowing which field contributed the search result

2005-02-21 Thread David Spencer
John Wang wrote: Anyone has any thoughts on this? Does this help? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Searchable.html#explain(org.apache.lucene.search.Query,%20int) Thanks -John On Wed, 16 Feb 2005 14:39:52 -0800, John Wang [EMAIL PROTECTED] wrote: Hi: Is there way

Re: Search Performance

2005-02-18 Thread David Spencer
Noone has mentioned JVM options yet. [a] -server [b] -XX:CompileThreshold=1000 [c] Raise the -Xms value if you haven't done so (-Xms...) I think by default the VM runs with -client but -server makes more sense for web containers (Tomcat etc). [b] tells the hotspot compiler to compile methods

Re: Search Performance

2005-02-18 Thread David Spencer
Are you using the highlighter or doing anything non-trivial in displaying the results? Are the pages being compressed (mod_gzip or some servlet equivalent)? This definitely helps, though to see the effect you may have to make sure your simulated users are remote. Also consider caching search

Re: Search Performance

2005-02-18 Thread David Spencer
Michael Celona wrote: Just tried that... works like a charm... thanks... Could you clarify what the problem was - just the overhead of opening IndexSearchers? Michael -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Friday, February 18, 2005 4:42 PM To: Lucene

Re: Document comparison

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote: Matt, Erik and I have some code for this in Lucene in Action, but David Spencer did this since the book was published: http://www.lucenebook.com/blog/announcements/more_like_this.html If you want an informal way of doing it you're right, just feed the words

Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
Otis Gospodnetic wrote: The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. Yeah, but

Re: Lucene vs. in-DB-full-text-searching

2005-02-18 Thread David Spencer
markharw00d wrote: But this brings up - has anyone run Lucene off a database trigger or are triggers known to be slow and bad for this use? I suspect the tricky bit would be knowing when to balancing the calls to Reader/Writer closes, opens and optimizes. Record updates are the usual fun and

Re: Document Clustering

2005-02-08 Thread David Spencer
Owen Densmore wrote: I would like to be able to analyze my document collection (~1200 documents) and discover good buckets of categories for them. I'm pretty sure this is termed Document Clustering .. finding the emergent clumps the documents fall naturally into judging from their term

Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
Many times I've written ad-hoc code that pulls in data from an RDBMS and builds a Lucene index. The use case is a typical database-driven dynamic website which would be a hassle to spider (say, due to tricky authentication). I had a feeling this had been done in a general manner but didn't see

Re: Configurable indexing of an RDBMS, has it been done before?

2005-02-07 Thread David Spencer
then for the 'normally' stored documents. For this latter situation the search logic assumes that the query is appropriately configured by the application. I am not sure if this is the kind of solution that you are looking for, but everything we produce is 100% open source. Cheers, Aad David Spencer wrote: Many

competition - Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-02-01 Thread David Spencer
) http://www.indexengines.com/ -- Also, out of curiosity, do people have appliance h/w vendors they like? These guys seem like they have nice options for pretty colors: http://www.mbx.com/oem/index.cfm http://www.mbx.com/oem/options/ David Spencer wrote: This reminds me, has

Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-31 Thread David Spencer
Otis Gospodnetic wrote: Adam, Dawid posted some code that lets you use Carrot2 locally with Lucene, see embedded zip url here for carrot2/lucene code - it may also be in the carrot2 cvs tree too - this is what I used in the wikipedia/cluster stuff as the basis

Re: query term frequency

2005-01-27 Thread David Spencer
Jonathan Lasko wrote: What do I call to get the term frequencies for terms in the Query? I can't seem to find it in the Javadoc... Do you mean the # of docs that have a term? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#docFreq(org.apache.lucene.index.Term)

rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
This reminds me, has anyone every discussed something similar: - rackmount server ( or for coolness factor, that mini mac) - web i/f for config/control - of course the server would have the following s/w: -- web server -- lucene / nutch Part of the work here I think is having a decent web i/f to

Re: rackmount lucene/nutch - Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
: google mini? who needs it when Lucene is there I discuss this with myself a lot inside my head... :) Seriously, I agree with Erik. I think this is a business opportunity. How many people are hating me now and going shh? Raise your hands! Otis --- David Spencer [EMAIL PROTECTED] wrote

Re: google mini? who needs it when Lucene is there

2005-01-27 Thread David Spencer
Xiaohong Yang (Sharon) wrote: Hi, I agree that Google mini is quite expensive. It might be similar to the desktop version in quality. Anyone knows google's ratio of index to text? Is it true that Lucene's index is about 500 times the original text size (not including image size)? I don't

lucenebook.com -- Re: Search on heterogenous index

2005-01-26 Thread David Spencer
Erik Hatcher wrote: On Jan 26, 2005, at 5:44 AM, Simeon Koptelov wrote: Heterogenous Documents/indices are OK - check out the second hit: http://www.lucenebook.com/search?query=heterogenous+different Thanks, I'll consider buying Lucene in Action. Our master plan is working! :) Just

Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-24 Thread David Spencer
Pierrick Brihaye wrote: Hi, David Spencer a écrit : One example of expansion with the synonym boost set to 0.9 is the query big dog expands to: Interesting. Do you plan to add expansion on other Wordnet relationships ? Hypernyms and hyponyms would be a good start point for thesaurus-like search

Re: carrot2 question too - Re: Fun with the Wikipedia

2005-01-17 Thread David Spencer
Dawid Weiss wrote: Hi David, I apologize about the delay in answering this one, Lucene is a busy mailing list and I had a hectic last week... Again, sorry for belated answer, hope you still find it useful. Oh no problem, and yes carrot2 is useful and fun. It's a rich package so it takes a

Re: Question about Analyzer and words spelled in different languages

2005-01-17 Thread David Spencer
Mariella Di Giacomo wrote: Hi ALL, We are trying to index scientic articles written in english, but whose authors can be spelled in any language (depending on the author's nazionality) E.g. Schäffer In the XML document that we provide to Lucene the author name is written in the following way

MoreLikeThis and other similarity query generators checked in + online demo

2005-01-17 Thread David Spencer
Based on mail from Doug I wrote a more like this query generator, named, well, MoreLikeThis. Bruce Ritchie and Mark Harwood made changes to it (esp term vector support) and bug fixes. Thanks to everyone. I've checked in the code to the sandbox under contributions/similarity. The package it ends

stop words and index size

2005-01-13 Thread David Spencer
Does anyone know how much stop words are supposed to affect the index size? I did an experiment of building an index once with, and once without, stop words. The corpus is the English Wikipedia, and I indexed the title and body of the articles. I used a list of 525 stop words. With stopwords

Re: full text as input ?

2005-01-13 Thread David Spencer
Hunter Peress wrote: is it efficient and feasible to use lucene to do full text comparisions. eg : take an entire text thats reasonably large ( eg more than 10 words) and find the result set within the lucene search index that is statistically similar with all the text. I do this kind of stuff

Re: How do you handle dynamic html pages?

2005-01-10 Thread David Spencer
Kevin L. Cobb wrote: I don't like to periodically re-index everything because 1) you can't be confident that your searches are as up to date as they could be, and 2) you are wasting cycles either checking for documents that may or may not need to be updated, or re-indexing documents that don't

Re: SYNONYM + GOOGLE

2005-01-10 Thread David Spencer
Erik Hatcher wrote: Karthik, Thanks for that info. I knew I was behind the times with WordNet using the sandbox code, but it was good enough for my purposes at the time. I will definitely try out the latest WordNet offerings in the future Hi...I wrote the WordNet sandbox code - but I'm not

Re: Quick question about highlighting.

2005-01-07 Thread David Spencer
Jim Lynch wrote: I've read as much as I could find on the highlighting that is now in the sandbox. I didn't find the javadocs. I have a copy here: http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/highlighter/build/docs/api/overview-summary.html I found a link to them, but it

google suggest / incremental search - Re: Lucene appreciation

2004-12-17 Thread David Spencer
Rony Kahan wrote: Thanks for feedback. PA - Since rss readers usually visit at least once per day, we only show jobs from past few days. This allows us to use a smaller, faster index for traffic intensive rss searching. Ben Praveen - Thanks for the UI suggestions. Hope to have that %3A %22

Re: TFIDF Implementation

2004-12-15 Thread David Spencer
Christoph Kiefer wrote: David, Bruce, Otis, Thank you all for the quick replies. I looked through the BooksLikeThis example. I also agree, it's a very good and effective way to find similar docs in the index. Nevertheless, what I need is really a similarity matrix holding all TF*IDF values. For

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox Ot oh, sorry, I'll try to get

Re: [RFE] IndexWriter.updateDocument()

2004-12-14 Thread David Spencer
petite_abeille wrote: Well, the subject says it all... If there is one thing which is overly cumbersome in Lucene, it's updating documents, therefore this Request For Enhancement: Please consider enhancing the IndexWriter API to include an updateDocument(...) method to take care of all the gory

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does

Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in

Re: LIMO problems

2004-12-13 Thread David Spencer
Daniel Cortes wrote: Hi, I want to know what library do you use for search in PPT files? I use this (native code): http://chicago.sourceforge.net/xlhtml POI support this? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-12 Thread David Spencer
about how it works. On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer [EMAIL PROTECTED] wrote: Google just came out with a page that gives you feedback as to how many pages will match your query and variations on it: http://www.google.com/webhp?complete=1hl=en I had an unexposed experiment I had

Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-11 Thread David Spencer
other freq, non-stop word, and it's dubious that hash java is a useful suggestion... So if you type fast, it doesn't hit the server until you pause. There are some more detailed postings on slashdot about how it works. On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer [EMAIL PROTECTED] wrote: Google

Incremental Search experiment with Lucene, sort of like the new Google Suggestion page

2004-12-10 Thread David Spencer
Google just came out with a page that gives you feedback as to how many pages will match your query and variations on it: http://www.google.com/webhp?complete=1hl=en I had an unexposed experiment I had done with Lucene a few months ago that this has inspired me to expose - it's not the same,

Re: Single Digit Indexing

2004-12-06 Thread David Spencer
Otis Gospodnetic wrote: Hm, if you can index 11, you should be able to index 8 as well. In any case, you most likely want to make sure that your Analyzer is not just In theory you could have a length filter tossing out tokens that are too short or too long, and maybe you're getting rid of all

Re: Looking for consulting help on project

2004-10-27 Thread David Spencer
Suggestions [a] Try invoking the VM w/ an option like -XX:CompileThreshold=100 or even a smaller number. This encourages the hotspot VM to compile methods sooner, thus the app will take less time to warm up. http://java.sun.com/docs/hotspot/VMOptions.html#additional You might want to search

Re: Thesaurus ...

2004-10-19 Thread David Spencer
Erik Hatcher wrote: Have a look at the WordNet contribution in the Lucene sandbox repository. It could be leveraged for part of a solution. It's something I contributed. Relevant links are: http://jakarta.apache.org/lucene/docs/lucene-sandbox/ http://www.tropo.com/techno/java/lucene/wordnet.html

Re: Efficient search on lucene mailing archives

2004-10-14 Thread David Spencer
sam s wrote: Hi Folks, Is there any place where I can do a better search on lucene mailing archives? I tried JGuru and looks like their search is paid. Apache maintained archives lags efficient searching. Of course one of the ironies is, shouldn't we be able to use Lucene to search the mailing

Re: Highlighting PDF file after the search

2004-09-20 Thread David Spencer
[EMAIL PROTECTED] wrote: Hello, I can successfully index and search the PDF documents, however i am not able to highlight the searched text in my original PDF file (ie: like dtSearch highlights on original file) I took a look at the highlighter in sandbox, compiled it and have it ready. I am

IndexReader.close() semantics and optimize -- Re: problem with locks when updating the data of a previous stored document

2004-09-16 Thread David Spencer
Crump, Michael wrote: You have to close the IndexReader after doing the delete, before opening the IndexWriter for the addition. See information at this link: http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex Recently I thought I observed that if I use this batch update idiom (1st delete

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-16 Thread David Spencer
Morus Walter wrote: Hi David, Based on this mail I wrote a ngram speller for Lucene. It runs in 2 phases. First you build a fast lookup index as mentioned above. Then to correct a word you do a query in this index based on the ngrams in the misspelled word. Let's see. [1] Source is attached

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote: By trying: if you type const you will find that it returns 216 hits. The third sports 'const' as a term (space seperated and all). I would expect 'conts' to return with const as well. But again I might be mistaken. I am now trying to figure what the problem might be: 1. my

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote: Aad Nales wrote: David, Perhaps I misunderstand somehting so please correct me if I do. I used http://www.searchmorph.com/kat/spell.jsp to look for conts without changing any of the default values. What I got as results did not include 'const' which has quite a high

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Aad Nales wrote: By trying: if you type const you will find that it returns 216 hits. The third sports 'const' as a term (space seperated and all). I would expect 'conts' to return with const as well. But again I might be mistaken. I am now trying to figure what the problem might be: 1. my

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: To restate the question for a second. The misspelled word is: conts. The sugggestion expected is const, which seems reasonable enough as it's just a transposition away, thus the string distance is low. But - I guess the problem w/ the algorithm

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-15 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives

Re: PorterStemfilter

2004-09-14 Thread David Spencer
Honey George wrote: Hi, This might be more of a questing related to the PorterStemmer algorithm rather than with lucene, but if anyone has the knowledge please share. You might want to also try the Snowball stemmer: http://jakarta.apache.org/lucene/docs/lucene-sandbox/snowball/ And KStem:

NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Tate Avery wrote: I get a NullPointerException shown (via Apache) when I try to access http://www.searchmorph.com/kat/spell.jsp How embarassing! Sorry! Fixed! T -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 3:23 PM To: Lucene Users

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: ...or prepare in advance a fast lookup index - split all existing terms to bi- or trigrams, create a separate lookup index, and then simply for each term ask a phrase query (phrase = all n-grams from an input term), with a slop 0, to get similar

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives

force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem and I have a stand-alone code with main(). What about slow garbage collector? This looks for me as wrong suggestion. I've seen this written up before (javaworld?) as a way to probably force GC instead of just a System.gc() call. I

OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
that the code should run endlesly (I have said it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example Ji Kuhn wrote

FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
it before: in version 1.4 final it does). Jiri. -Original Message- From: David Spencer [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:34 PM To: Lucene Users List Subject: force gc idiom - Re: OutOfMemory example Ji Kuhn wrote: Thanks for the bug's id, it seems like my problem

Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
David Spencer wrote: Just noticed something else suspicious. FieldSortedHitQueue has a field called Comparators and it seems like things are never removed from it Replying to my own postthis could be the problem. If I put in a print statement here in FieldSortedHitQueue, recompile

SegmentReader - Re: FieldSortedHitQueue.Comparators -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread David Spencer
be causing this leak. David Spencer wrote: David Spencer wrote: Just noticed something else suspicious. FieldSortedHitQueue has a field called Comparators and it seems like things are never removed from it Replying to my own postthis could be the problem. If I put in a print statement here

Re: OutOfMemory example

2004-09-13 Thread David Spencer
Daniel Naber wrote: On Monday 13 September 2004 15:06, Ji Kuhn wrote: I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: Could you try with the latest Lucene version from CVS? I cannot reproduce

Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
eks dev wrote: Hi Doug, Perhaps. Are folks really better at spelling the beginning of words? Yes they are. There were some comprehensive empirical studies on this topic. Winkler modification on Jaro string distance is based on this assumption (boosting similarity if first n, I think 4, chars

frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread David Spencer
Doug Cutting wrote: David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the did you mean spelling correction

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Aad Nales wrote: Hi All, Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Andrzej Bialecki wrote: David Spencer wrote: I can/should send the code out. The logic is that for any terms in a query that have zero matches, go thru all the terms(!) and calculate the Levenshtein string distance, and return the best matches. A more intelligent way of doing this is to instead

Re: Existing Parsers

2004-09-09 Thread David Spencer
Honey George wrote: Hi, I know some of them. 1. PDF + http://www.pdfbox.org/ + http://www.foolabs.com/xpdf/download.html - I am using this and found good. It even supports My dated experience from 2 years ago was that (the evil, native code) foolabs pdf parser was the best, but obviously

Re: combining open office spellchecker with Lucene

2004-09-09 Thread David Spencer
Doug Cutting wrote: Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to

IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread David Spencer
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close() What is the intent of IndexSearcher.close()? I want to know how, in a web app, one can stop a search that's in progress - use case is a user is limited to one search at at time, and when one (expensive)

Re: about search sorting

2004-09-03 Thread David Spencer
Wermus Fernando wrote: Luceners, My app is creating, updating and deleting from the index and searching too. I need some information about sorting by a field. Does any one could send me a link related to sorting? http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Sort.html

Re: Running OutOfMemory while optimizing and searching

2004-07-02 Thread David Spencer
This in theory should not help, but anyway, just in case, the idea is to call gc() periodically to force gc - this is the code I use which tries to force it... public static long gc() { long bef = mem(); System.gc(); sleep( 100);

Re: Search Result

2004-07-02 Thread David Spencer
Hetan Shah wrote: My search results are only displaying the top portion of the indexed documents. It does match the query in the later part of the document. Where should I look to change the code in demo3 of default 1.3 final distribution. In general if I want to show the block of document that

Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
Inspired by these guys who put results from Google into a treemap... http://google.hivegroup.com/ I did up my own version running against my index of OSS/javadoc trees. This query for thread pool shows it off nicely: http://www.searchmorph.com/kat/tsearch.jsp?s=thread%20poolside=300goal=500 This

Re: search multiple indexes

2004-07-01 Thread David Spencer
Stefan Groschupf wrote: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ MultiSearcher.html 100% Right. I personal found code samples more interesting then just java doc. Good point. That why my hint, here the code snippet from nutch: But - warning - in normal use of Lucene

Re: Visualization of Lucene search results with a treemap

2004-07-01 Thread David Spencer
- but - for my site I do want to convert the custom spider/cache to use Nutch... Do you know: http://websom.hut.fi/websom/comp.ai.neural-nets-new/html/root.html ? Interesting - is there any code avail to draw the maps? thx, Dave Cheers, Stefan Am 01.07.2004 um 23:28 schrieb David Spencer: Inspired

ANN: Experimental site for searching javadoc of OSS projects

2004-06-25 Thread David Spencer
I've put together a kind of experimental site which indexes the javadoc of OSS java projects (well, plus the JDK). http://www.searchmorph.com/ This is meant to solve the problem where a java developer knows something has been done before, but where, in what project - source forge? jakarta?

carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
Otis Gospodnetic wrote: Hello William, Lucene does not have a categorization engine, but you may want to look at Carrot2 (http://sourceforge.net/projects/carrot2/) May be getting off topic - but maybe not..I can't find an example of how to use Carrot2. It builds easy enough, but there's no

Re: carrot2 - Re: Categorization

2004-06-23 Thread David Spencer
and com.dawidweiss.carrot.filter.stc.Processor is a class that drives this. Lucene hook - hey - I'm trying to integrate the two. I think this is how it would be done, get search results from Lucene then set up STCEngine a la how Processor does. Thx, william. From: David Spencer [EMAIL PROTECTED] Reply-To: Lucene Users

Re: Fix for advanced tokenizers and highlighter problem

2004-06-22 Thread David Spencer
[EMAIL PROTECTED] wrote: I think this version of the highlighter should provide a fix: http://www.inperspective.com/lucene/hilite2beta.zip Before I update the version of the highlighter in the sandbox I'd appreciate feedback from those troubled with the issues to do with overlapping tokens in

amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
I've run across an amusing interaction between advanced Analyzers/TokenStreams and the very useful term highlighter: http://cvs.apache.org/viewcvs/jakarta-lucene-sandbox/contributions/highlighter/ I have a custom Analyzer I'm using to index javadoc-generated web pages. The Analyzer in turn has

Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
[EMAIL PROTECTED] wrote: Yes, this issue has come up before with other choices of analyzers. I think it should be fixable without changing any of the highlighter APIs - can you email me or post here the source to your analyzer? Code attached - don't make fun of it please :) - very prelim. I

Re: amusing interaction between advanced tokenizers and highlighter package

2004-06-19 Thread David Spencer
Erik Hatcher wrote: On Jun 19, 2004, at 2:29 AM, David Spencer wrote: A naive analyzer would turn something like SyncThreadPool into one token. Mine uses the great Lucene capability of Tokens being able to have a 0 position increment to turn it into the token stream: Sync (incr = 0) Thread

Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote: On Jun 9, 2004, at 8:53 AM, Terry Steichen wrote: 3) Is there a plan for adding QueryParser support for the SpanQuery family? Another important facet to Terry's question here is what syntax to use to express all various types of queries? I suspect that Google stats And

extensible query parser - Re: Proximity Searches behavior

2004-06-09 Thread David Spencer
Erik Hatcher wrote: On Jun 9, 2004, at 12:21 PM, David Spencer wrote: show us that most folks query with 1 - 3 words and do not use the any of the advanced features. But with automagic query expansion these things might be done behind the scenes. Nutch, for one, expands simple queries to check

Setting Similarity in IndexWriter and IndexSearcher

2004-06-07 Thread David Spencer
Does it ever make sense to set the Similartity obj in either (only one of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can I avoid setting it in IndexSearcher? Also, can I avoid setting it in IndexWriter and only set it in IndexSearcher? I noticed Nutch sets it in both

No tvx reader

2004-06-05 Thread David Spencer
Using 1.4rc3. Running an app that indexes 50k documents (thus it just uses an IndexWriter). One field has that boolean set for it to have a term vector stored for it, while other 11 fields don't. On stdout I see No tvx file 13 times. Glancing thru the src it seems this comes from

bonus for exact case match

2004-06-03 Thread David Spencer
Does anyone have any experiences with giving a bonus for exactly matching case in queries? One use case is in the java world maybe I want to see references to Map (java.util.Map) but am not interested in a (geographical) map. I believe, in the context of Lucene, one way is to have an Analyzer

Re: similarity of two texts

2004-06-02 Thread David Spencer
Terry Steichen wrote: Erik, Could you expand on this just a wee bit, perhaps with an example of how to compute this vector angle? I'm tempted to write the code to see how it works, but FYI this doc seems to nicely explain the concepts:

Re: similarity of two texts - another question

2004-06-02 Thread David Spencer
; while ( (t = ts.next()) != null) { sb.append( t.termText() + ); } return QueryParser.parse( sb.toString(),DFields.CONTENTS, a); } David Spencer [EMAIL PROTECTED] 06/01/04 08:25PM Erik Hatcher wrote: On Jun 1, 2004, at 4

Re: Page ranking

2004-06-01 Thread David Spencer
Scott Sayles wrote: Is there anyone out there that has page ranking implemented on top of Lucene? I recently discovered JUNG which has 2 impls of PageRank: http://jung.sourceforge.net/api/1.4.1/edu/uci/ics/jung/algorithms/importance/PageRank.html I did a test of hooking it up to my spider and

Re: about search and update one index simultaneously

2004-06-01 Thread David Spencer
xuemei li wrote: Hi,all, see this: http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex Can we do search and update one index simultaneously?Is someone know sth about it? I had done some experiments.Now the search will be blocked when the index is being updated.The error in search node is like

Re: similarity of two texts - another question

2004-06-01 Thread David Spencer
Erik Hatcher wrote: On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote: Well, a question again, how does Lucene compute the score between a document and a query? And I might add, thus, this approach to similarity gives more weight to rare terms that match, which one might want for this kind of

now maybe Mozlla/IMAP URLs - Re: StandardTokenizer and e-mail

2004-05-21 Thread David Spencer
This reminds me - if you have a search engine that indexes a mail store and you present results in a web page to a browser, you want to (of course...well I think this is obvious) send back a URL that would cause the users native mail client to pull up the msg. IMAP has a URL format, and I use

asktog on search problems

2004-05-21 Thread David Spencer
Haven't seen this discussed here. See 7a at the link below: http://www.asktog.com/columns/062top10ReasonsToNotShop.html 7a talks about searching on a camera site for the Lowepro 100 AW. He says this query works:Lowepro 100 AW and this query does not work: Lowepro 100AW Cross checking with

Re: Scoring documents by Click Count

2004-05-06 Thread David Spencer
Otis Gospodnetic wrote: Sure. On click, get document Id (not internal docId, but something you use as s surrogate primary key) of the clicked document. Retrieve the document. Pull out the value of 'clickCount' field. +1 it. Delete the document, and re-add it (there is no 'update(Document)'

Re: Lucene index - information

2004-03-19 Thread David Spencer
Karl Koch wrote: If I create an standard index, what does Lucene store in this index? What should be stored in an index at least? Just a link to the file and keywords? Or also wordnumbers? What else? Does somebody know a paper which discusses this problem of what to put in an good universal IR

Re: incomplete word match

2004-03-11 Thread David Spencer
SubstringQuery, my humble contribution. http://www.mail-archive.com/[EMAIL PROTECTED]/msg06388.html Tomcat Programmer wrote: I have a situation where I need to be able to find incomplete word matches, for example a search for the string 'ape' would return matches for 'grapes' 'naples' 'staples'

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread David Spencer
Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method

Re: Database

2004-02-26 Thread David Spencer
Parminder Singh wrote: I've a CMS application that deploys metadata to a database. Is it possible to use lucene to search this database instead of it's (lucene's) index. If you could tell me the steps that would be involved in doing this, it'd be great help. I'm new to Lucene. I've done this

  1   2   >