How to fire a query ?
Hi guys, How to fire digital camera when someone fire digital cam .. ? Do i need to make manual list for such items and look up at search time or theree is any better way to do this... -Bhavin pandya
Re: lucene link database
if you search the archive for database you'll bet a bunch of threads This was a hybrid implementation I did which worked with HSQLDB and Derby: http://www.mail-archive.com/java-user@lucene.apache.org/msg02953.html Cheers Mark - Original Message From: Erick Erickson [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Sunday, 8 October, 2006 8:33:59 PM Subject: Re: lucene link database A quick word of caution about doc IDs. Lucene assigns a document id at index time, but that ID is *not* guaranteed to remain the same for a given document. For instance... you index docs A, B, and C. They get Lucene IDs 1, 2, 3. Then you remove doc B and optimize the index. As I understand it, doc C will get re-assigned ID 2, and ID 3 won't exist. In reality, I don't think that the algorithm is quite as simplistic as that, but that's the idea. So be sure to assign your own unique identifiers that you add to your docs as a field value. Others on this list have talked abouta hybrid solution. That is, have *both* lucene and a database, each doing what they do best. It's more complicated, especially keeping the two in synch. some tools have been mentioned, I think if you search the archive for database you'll bet a bunch of threads. But I thought I'd mention it.. Best of luck Erick On 10/8/06, Cam Bazz [EMAIL PROTECTED] wrote: Dear Erick; Thank you for your detailed insight. I have been trying to code a graph object database for sometime. I have prototyped on relational as well as object oriented databases, including opensource and commercial implementations. (so far, I have tried hibernate, objectivity/db, db4o) while object databases excel in traversing links, they are poor when searching. lucene so far solves the problem of solving. I am thinking of a document as a list of tuples. (sequence of fields) and I can do searches with lucene, it is really nice. now I have to solve the problem of linking. if I keep the nodes with a lucene index, and I can fetch documents with a doc_id, or some sort of surrogate identifier, and use those identifiers as node_id in an object graph, that will be what I want. but in order to do that I need to be able to query the lucene index by document_id. I was referring to the link db of the nutch. They do have some sort of link db implementation, that runs with hadoop, but I have not understood the full code. I am trying to understand the structure of this link database. I was thinking of using documents with src and dst fields, that have document id's as values. (one idea, I will try it tomorrow) Again thanks a bunch. Best Regards, C.B. Erick Erickson wrote: Aproach it in whatever way you want as long as it solves your problem G. My first question is why use lucene? Would a database suit your needs better? Of course, I can't say. Lucene shines at full-text searching, so it's a closer call if you aren't searching on parts of text. By that I mean that if you're not searching on *parts* of your links, you may want to consider a DB solution. That said, and if I understand your requirement, you have a pretty simple design. Each document has two fields, incominglinks and outgoing links. But see the note below. Lucene indexes what you give it, so the fact that some of the links aren't hypertext links is immaterial to Lucene. Since you control both the indexer and searcher, these confrom to whatever your requirements are. It's up to you to map semantics onto these entities. One common trap DB-savvy people have is that they think of documents as entries in a table, all with the same fields. There is nothing requiring you to have the *same* fields in each document in an index. You could have an index for which no two documents shared *any* common field if you choose. So, if you want to find out what, say, which documents have link X as an incoming link, just search on incominglinks:X. If you wanted to find the documents that had any incoming links X, Y, Z that matched an outgoing link in another document, just search the OR of these in outgoinglinks. If you want some kind of map of the whole web of links, you'll have to write some iterative loop and keep track. There's nothing built in that I know of that lets you answer Given link X, show me all the documents no more than 3 hops away. Lucene is an *engine*, designed to have apps built on top of it. Lucene doesn't deal with relations between documents, just searching what you've indexed. It's easy enough to store a variable number of links in your incominglinks or outgoinglinks field. Just be sure they're tokenized appropriately. You can add them any way you choose, either concatenate them all into a big string and index that, or index them into the same field, e.g. Document doc = new Document(); doc.add(incoming, link1); doc.add(incoming, link2); . . . writer.add(doc); According to a
Incremental updates / slow searches.
Hi, we are using a search system based on Lucene and have recently tried to add incremental updating of the index instead of building a new index every now and then. However we now run into problems as our searches starts to take very long time to complete. Our index is about 8-9GB large and we are sending lots of updates / second (we are probably merging in 200 - 300 in a few seconds). Today we buffer a bunch of updates and then merge them into the existing index like a batch, first doing deletes and then inserts. We are currently not using any special tuning of Lucene. Does anyone have any similiar experiences from Lucene or advices on how to reduce the amount of times it takes to perform a search? In particular what would be an optimal combination of update size, merge factor, max buffered docs? /Rickard
Re: Performing a like query
Hi Steve Thanks for your response. I was just wondering whether there is a difference between the regular expression you sent me i.e. (i) \s*(?:\b|(?=\S)(?=\s)|(?=\s)(?=\S))\s* and (ii) \\b as they lead to the same output. For example, the string search testing a-new string=3/4 results in the same output : Item is : Item is : testing Item is : Item is : a Item is : - Item is : new Item is : Item is : string Item is : = Item is : 3 Item is : / Item is : 4 What Id like to do though is remove the split over space characters so that the output is such : Item is : testing Item is : a Item is : - Item is : new Item is : string Item is : = Item is : 3 Item is : / Item is : 4 Im not great at regular expressions so would really appreciate if you could provide me with some insight into expression (i) . Thanks for all your help Rahil Steven Rowe wrote: Hi Rahil, Rahil wrote: I couldnt figure out a valid regular expression to write a valid Pattern.compile(String regex) which can tokenise a string into O/E - visual acuity R-eye=6/24 into O,/,E, -, visual, acuity, R, -, eye, =, 6, /, 24. The following regular expression should match boundaries between word and non-word, or between space and non-space, in either order, and includes contiguous whitespace: \s*(?:\b|(?=\S)(?=\s)|(?=\s)(?=\S))\s* Note that with the above regex, the (%$#!) in some (%$#!) text will be tokenized as a single token. Hope it helps, Steve Erick Erickson wrote: Well, I'm not the greatest expert, but a quick look doesn't show me anything obvious. But I have to ask, wouldn't WhiteSpaceAnalyzer work for you? Although I don't remember whether WhiteSpaceAnalyzer lowercases or not. It sure looks like you're getting reasonable results given how you're tokenizing. If not that, you might want to think about PatternAnalyzer. It's in the memory contribution section, see import org.apache.lucene.index.memory.PatternAnalyzer. One note of caution, the regex identifies what is NOT a token, rather than what is. This threw me for a bit. I still claim that you could break the tokens up like 6, /, 12, and make SpanNearQuery work with a span of 0 (or 1, I don't remember right now), but that may well be more trouble than it's worth, it's up to you of course. What you get out of this is, essentially, is a query that's only satisfied if the terms you specify are right next to each other. So you'd find both your documents in your example, since you would have tokenized 6, /, 12 in, say positions 0, 1, 2 in doc1 and 4, 5, 6 in the second doc. But since they're tokens that are next to each other in each doc, searching with a SpanNearQuery for 6, /, and 12 that are right next to each other, which you specify with a slop of 0 as I remember you should get both. Alternatively, if you tokenize it this way, a PhraseQuery might work as well, Thus, searching for 6 / 12 (as a phrase query and note the spaces) might be just what you want. You'd have to tokenize the query, but that's relatively easy. This is probably much simpler than a SpanNearQuery now that I think about it. Be aware that if you use the *TermEnums we've been talking about, you'll probably wind up wrapping them in a ConstantScoreQuery. And if you have no *other* terms, you won't get any relevancy out of your search. This may be important. Anyway, that's as creative as I can be Sunday night G. Best of luck Erick On 10/1/06, Rahil [EMAIL PROTECTED] wrote: Hi Erick Thanks for your response. There's a lot to chew on in your reply and Im looking at the suggestions you've made. Yeah I have Luke installed and have queried my index but there isn't any great explanation Im getting out of it. A query for 6/12 is sent as TERM:6/12 which is quite straight-forward. I did an explanation of the query in my code though and got some more information but that too wasn't of much help either. -- Explanation explain = searcher.explain(query,0); OUTPUT: query: +TERM:6/12 explain.getDescription() : weight(TERM:6/12 in 0), product of: Detail 0 : 0.9994 = queryWeight(TERM:6/12), product of: 2.0986123 = idf(docFreq=1) 0.47650534 = queryNorm Detail 1 : 0.0 = fieldWeight(TERM:6/12 in 0), product of: 0.0 = tf(termFreq(TERM:6/12)=0) 2.0986123 = idf(docFreq=1) 0.5 = fieldNorm(field=TERM, doc=0) Number of results returned: 1 SampleLucene.displayIndexResults SCOREDESCRIPTIONSTATUSCONCEPTIDTERM 1.002602780076/12 (finding) -- My tokeniser called BaseAnalyzer extends Analyzer. Since I wanted to retain all non whitespace characters and not just letters and digits, I introduced the following block of code in the overridden tokenStream( ) -- public TokenStream tokenStream(String fieldName, Reader reader) { return new CharTokenizer(reader) { protected char normalize(char c) { return Character.toLowerCase(c); } protected boolean isTokenChar(char c) {
Re: highlight optimization
The fastest way to see if opening/closing your searcher is a problem would be to write a tiny little program that opened the index, fired off a few queries and timed each one. The queries can be canned, of course. I'm thinking this is, say, less that 20 lines (including imports). If you're familiar with junit, think of it in those terms. Once you've proved that is a bottleneck you want to work on, you probably need a search server. We've used XmlRpc, which has server code built-in, and has worked like a charm for us. It'll add a bit of complexity, but it'll keep your searchers open. That said, I suspect that someone will chime in with another solution, since this is already implemented G Erick On 10/9/06, Stelios Eliakis [EMAIL PROTECTED] wrote: Hi, I have a collection of 500 txt documents and I implement a web application(JSP) for searching these documents. In addition, the application shows the BestFragment of each result and highlights the query terms. My application is slow enough (about 2,5-3 seconds for each query) even if I run it from my computer (It's not publiched yet). Do you suggest me something in order to improve speed? I have read that you have to keep the indexsearcher open. Is it right? and how could I do that(when I must close it)? In Highlighting I use the following String result = highlighter.getBestFragment(tokenStream,text) text parameter must be String so I open the document and convert it to String. Of course it is time consuming. Is there a different way? Thanks in advance, -- Stelios Eliakis
How to search with empty content
I want to search without giving any input, when I search leaving blank the search text box it should give me all the documents present in the index. please give me some solution or pointers. regards Santhosh
Re: Performing a like query
Hi Rahil, Rahil wrote: I was just wondering whether there is a difference between the regular expression you sent me i.e. (i) \s*(?:\b|(?=\S)(?=\s)|(?=\s)(?=\S))\s* and (ii) \\b as they lead to the same output. For example, the string search testing a-new string=3/4 results in the same output : [...] There is a difference for strings like testing a- -new string=3/4 -- with (ii), you will get: ..., a, - -, new, ... but with (i), you will get: ..., a, -, -, new, ... What Id like to do though is remove the split over space characters [...] From my reading of org.apache.lucene.index.memory.PatternAnalyzer (assuming you're using this class), I don't think this is necessary, since it just throws away zero-length tokens. Actually, given the below-discussed algorithm for PatternAnalyzer, I don't think it's even possible to do what you want. Here's the PatternAnalyzer.next() method definition (from http://svn.apache.org/viewvc/lucene/java/trunk/contrib/memory/src/java/org/apache/lucene/index/memory/PatternAnalyzer.java?revision=450725view=markup): public Token next() { if (matcher == null) return null; while (true) { // loop takes care of leading and trailing boundary cases int start = pos; int end; boolean isMatch = matcher.find(); if (isMatch) { end = matcher.start(); pos = matcher.end(); } else { end = str.length(); matcher = null; // we're finished } if (start != end) { // non-empty match (header/trailer) String text = str.substring(start, end); if (toLowerCase) text = text.toLowerCase(locale); return new Token(text, start, end); } if (!isMatch) return null; } } This method finds token breakpoints, remembering the end of the previous breakpoint (in instance field pos), then compares the beginning of the current breakpoint with the end of the previous breakpoint (if (start != end)), creating a Token *only* if the text between breakpoints has longer than zero length. If you're familiar with Perl, this class emulates a Perl regex idiom: (iii) @tokens = grep { length 0 } split /my-regex/, $text; That is, return a list of tokens generated by breaking text on a regex, filtering out zero-length tokens. Actually, the way I usually write this in Perl is: (iv) @tokens = grep { /\S/ } split /my-regex/, $text; In the above version, tokens are kept only if they contain at least one non-space character (this also filters out zero-length tokens). PatternAnalyzer, OTOH, *will* emit whitespace-only tokens - it implements (iii), not (iv). Hope it helps, Steve - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to search with empty content
You can get all document by using MatchAllDocsQuery. Kumar, Samala Santhosh (TPKM) wrote: I want to search without giving any input, when I search leaving blank the search text box it should give me all the documents present in the index. please give me some solution or pointers. regards Santhosh -- Scott - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TermQuery and PhraseQuery..problem with word with space
I am using StandardAnalyzer while indexing the field.. I am also a creatign a field called full_text in which i am adding all these individual fields as TOKENIZED. here is the code while(choiceIt.hasNext()){ PersonProfileAnswer pa=(PersonProfileAnswer)choiceIt.next(); if(pa.getPersonProfileChoice()!=null) { doc.add(new Field(FULL_TEXT, pa.getPersonProfileChoice().getChoice(),Field.Store.NO,Field.Index.TOKENIZED )); LuceneProfileQuestion lpf=this.getLuceneProfileQuestion( pa.getPersonProfileChoice().getPersonProfileQuestion().getId()); doc.add(new Field(lpf.getLuceneFieldName(), pa.getPersonProfileChoice().getChoice(),Field.Store.NO, Field.Index.UN_TOKENIZED)); } } when i use luke i can see the term is there.. e.g. for a lucence field called fav_stores UN_TOKENIZED terms Ann Taylor and Banana Republic are there.. If i make a search on full_text.. and type banana or republic or banana republic i get the doucment as result.. In my java class i am using phrasequery for full_text and termquery for each individual filed.. e.g. TermQuery subjectQuery=new TermQuery(new Term(fav_stores,favStores)); In luke i do not see any option to select query type but when I make search on fav_stores with term Banana Republic there is no result. On 10/9/06, Doron Cohen [EMAIL PROTECTED] wrote: I am trying to index a field which has more than one word with space e.g . My Word i am indexng it UN_TOKENIZED .. but when i use TermQuery to query My Word its not yielding any result.. Seems that it should work. Few things to check: - make sure you are indexing with UN_TOKENIZED. - check that either both field and query text are lower-cased or both are not lower-cased. - use Luke to examine the content of the index (when adding as un-tokenized); print the query (toString); - do they match each other? match your expectation? Is term qurey limited to one word? i mean if we index a word with space and index it UN_TOKENIZED.. shouldnt TermQuery yeild result to My Word. Ismail There is no such limitation. Hope this helps, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
deleteDocuments being ingnored
Hello, I'm brand new to this, so hopefully you can help me. I'm attempting to use the IndexReader object in lucene v2 to delete and readd documents. I very easily set up an index and my documents are added. Now I'm trying to update the same index by deleting the document before readding. The problem is that it appears that my deleteDocument() instruction is being ignored. I've tried using the IndexModifier object and the IndexReader and both have the same behavior. If anyone can point out my error, or help me debug this I'll be forever in your debt. Here is the jist of the code. This is the main section: IndexWriter writer = new IndexWriter(indexDir,new StandardAnalyzer(), false); writer.setUseCompoundFile(false); indexDirectory(writer, dataDir); int numIndexed = writer.docCount(); writer.optimize(); writer.close(); Down at the point just before readding my document I have the following code (i know batch is better, just doing it this for now): IndexReader ir = IndexReader.open(indexDir); System.out.println( + ir.numDocs()); ir.delete(new Term(filename,f.getAbsolutePath())); System.out.println(deletes? + ir.hasDeletions()); ir.close(); if (deleted 0) { System.out.println(deleted old index of + f.getAbsolutePath()); } System.out.println(Indexing + f.getAbsolutePath()); Document doc = new Document(); doc.add(new Field(contents,loadContents (doc),Field.Store.NO,Field.Index.TOKENIZED)); doc.add(new Field(filename, f.getAbsolutePath(),Field.Store.YES,Field.Index.TOKENIZED)); writer.addDocument(doc); Thanks, Chris
Re: deleteDocuments being ingnored
My apologies, the IndexReader code I included was a commented out trial. Here is the active version. Sorry for the error: IndexReader ir = IndexReader.open(indexDir); System.out.println( + ir.numDocs()); int deleted = ir.deleteDocuments(new Term(filename ,f.getAbsolutePath())); System.out.println(deletes? + ir.hasDeletions()); ir.close(); if (deleted 0) { System.out.println(deleted old index of + f.getAbsolutePath()); }
Re: deleteDocuments being ingnored
System.out.println(Indexing + f.getAbsolutePath()); Document doc = new Document(); doc.add(new Field(contents,loadContents (doc),Field.Store.NO,Field.Index.TOKENIZED)); doc.add(new Field(filename, f.getAbsolutePath(),Field.Store.YES,Field.Index.TOKENIZED)); writer.addDocument(doc); Hi Chris, Do you open the writer to add your update to before you close the indexreader to delete the outdated index document? Another question comes in mind, do you open a new IndexReader for your searches after the update was written to the index? You have to follow these steps: 1. Add you document 2. close writer 3. open reader 4. delete the outdated stuff 5. close reader 6. open writer 7. add the update 8. close writer 9. release new searcher / reader hope that gives you a little help. best regards simon
Re: TermQuery and PhraseQuery..problem with word with space
OK, when you look in the fav_stores field in Luke, what do you see? And, are you searching on Banana Republic with the capitals? If so, and your index has the letters in lower case, that's your problem. Erick On 10/9/06, Ismail Siddiqui [EMAIL PROTECTED] wrote: I am using StandardAnalyzer while indexing the field.. I am also a creatign a field called full_text in which i am adding all these individual fields as TOKENIZED. here is the code while(choiceIt.hasNext()){ PersonProfileAnswer pa=(PersonProfileAnswer)choiceIt.next(); if(pa.getPersonProfileChoice()!=null) { doc.add(new Field(FULL_TEXT, pa.getPersonProfileChoice().getChoice(),Field.Store.NO, Field.Index.TOKENIZED )); LuceneProfileQuestion lpf=this.getLuceneProfileQuestion( pa.getPersonProfileChoice().getPersonProfileQuestion().getId()); doc.add(new Field(lpf.getLuceneFieldName(), pa.getPersonProfileChoice().getChoice(),Field.Store.NO, Field.Index.UN_TOKENIZED)); } } when i use luke i can see the term is there.. e.g. for a lucence field called fav_stores UN_TOKENIZED terms Ann Taylor and Banana Republic are there.. If i make a search on full_text.. and type banana or republic or banana republic i get the doucment as result.. In my java class i am using phrasequery for full_text and termquery for each individual filed.. e.g. TermQuery subjectQuery=new TermQuery(new Term(fav_stores,favStores)); In luke i do not see any option to select query type but when I make search on fav_stores with term Banana Republic there is no result. On 10/9/06, Doron Cohen [EMAIL PROTECTED] wrote: I am trying to index a field which has more than one word with space e.g . My Word i am indexng it UN_TOKENIZED .. but when i use TermQuery to query My Word its not yielding any result.. Seems that it should work. Few things to check: - make sure you are indexing with UN_TOKENIZED. - check that either both field and query text are lower-cased or both are not lower-cased. - use Luke to examine the content of the index (when adding as un-tokenized); print the query (toString); - do they match each other? match your expectation? Is term qurey limited to one word? i mean if we index a word with space and index it UN_TOKENIZED.. shouldnt TermQuery yeild result to My Word. Ismail There is no such limitation. Hope this helps, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TermQuery and PhraseQuery..problem with word with space
I would guess that one of your assumptions is wrong... The assumptions to check are: At indexing: - lpf.getLuceneFieldName() == fav_stores - pa.getPersonProfileChoice().getChoice() == Banana Republic At search: - the query is created like this: new TermQuery(new Term(fav_stores,Banana Republic)) - the searcher is opened after closing the writes that added that doc. Best to check this by writing a tiny stand-alone program that demonstrates this behavior. Ismail Siddiqui [EMAIL PROTECTED] wrote on 09/10/2006 08:59:39: I am using StandardAnalyzer while indexing the field.. I am also a creatign a field called full_text in which i am adding all these individual fields as TOKENIZED. here is the code while(choiceIt.hasNext()){ PersonProfileAnswer pa=(PersonProfileAnswer)choiceIt.next(); if(pa.getPersonProfileChoice()!=null) { doc.add(new Field(FULL_TEXT, pa.getPersonProfileChoice().getChoice(),Field.Store.NO,Field.Index.TOKENIZED )); LuceneProfileQuestion lpf=this.getLuceneProfileQuestion( pa.getPersonProfileChoice().getPersonProfileQuestion().getId()); doc.add(new Field(lpf.getLuceneFieldName(), pa.getPersonProfileChoice().getChoice(),Field.Store.NO, Field.Index.UN_TOKENIZED)); } } when i use luke i can see the term is there.. e.g. for a lucence field called fav_stores UN_TOKENIZED terms Ann Taylor and Banana Republic are there.. If i make a search on full_text.. and type banana or republic or banana republic i get the doucment as result.. In my java class i am using phrasequery for full_text and termquery for each individual filed.. e.g. TermQuery subjectQuery=new TermQuery(new Term(fav_stores,favStores)); In luke i do not see any option to select query type but when I make search on fav_stores with term Banana Republic there is no result. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TermQuery and PhraseQuery..problem with word with space
in fav_stores i see Banana Republic and Ann Taylor there .. and i am searching it with the capitals. On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote: OK, when you look in the fav_stores field in Luke, what do you see? And, are you searching on Banana Republic with the capitals? If so, and your index has the letters in lower case, that's your problem. Erick
Re: Incremental updates / slow searches.
The biggest thing would be to limit how often you open a new IndexSearcher, and when you do, warm up the new searcher in the background while you continue serving searches with the existing searcher. This is the strategy that Solr uses. There is also the issue of if you are analyzing/merging docs on the same servers that you are executing searches on. You can use a separate box to build the index and distribute changes to boxes used for searching. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server On 10/9/06, Rickard Bäckman [EMAIL PROTECTED] wrote: Hi, we are using a search system based on Lucene and have recently tried to add incremental updating of the index instead of building a new index every now and then. However we now run into problems as our searches starts to take very long time to complete. Our index is about 8-9GB large and we are sending lots of updates / second (we are probably merging in 200 - 300 in a few seconds). Today we buffer a bunch of updates and then merge them into the existing index like a batch, first doing deletes and then inserts. We are currently not using any special tuning of Lucene. Does anyone have any similiar experiences from Lucene or advices on how to reduce the amount of times it takes to perform a search? In particular what would be an optimal combination of update size, merge factor, max buffered docs? /Rickard - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: threadsafe QueryParser?
On 10/9/06, Stanislav Jordanov [EMAIL PROTECTED] wrote: Method static public Query parse(String query, String field, Analyzer analyzer) in class QueryParser is deprecated in 1.9.1 and the suggestion is: /Use an instance of QueryParser and the [EMAIL PROTECTED] #parse(String)} method instead./ My question is: in the context of multi threaded app, is it safe that distinct threads utilize the same instance of QueryParser for parsing their queries? ps. After writing this letter, I incidentally ran into the answer in the end of the class comment of QueryParser: / * pNote that QueryParser is emnot/em thread-safe./p/ So, is this it? Yes. A single QueryParser object should not be used from multiple threads. It's unclear why one would want to do so anyway. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene searching algorithm
Hi Michael, I think there are a number of good resources on this: 1. http://lucene.apache.org/java/scoring.html covers the basics of searching. The bottom has some pseudo code as well. 2. Lucene In Action 3. Search this list and other places for information on the Vector Space Model. The Wiki also has a number of links, etc. that may prove useful, including a variety of talks and articles. 4. Last of all, and probably best of all, the code! Have a look at how TermQuery and BooleanQuery work, as well as the Searchers, etc. Hope this helps, Grant On Oct 8, 2006, at 6:57 AM, Michael Chan wrote: Hi, Does anyone know where I can find descriptions of Lucene's searching algorithm, besides the lecture at University of Pisa 2004? Has it been published? I'm trying to find a reference to the algorithm. Thanks, Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcard and span queries
OK, I'm using the surround code, and it seems to be working...with the following questions (always, more questions)... I'm gettng an exception sometimes of TooManyBasicQueries. I can control this by initializing BasicQueryFactory with a larger number. Do you have any cautions about upping this number? There's a hard-coded value minimumPrefixLength set to 3 down in the code Surround query parser (allowedSuffix). I see no method to change this. I assume that this is to prevent using up too much memory/time. What should I know about this value? I'm mostly interested in a justification for the product manager why allowing, say, two character (or one character) prefixes is a bad idea G. I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal to Surround queries. That is, trying RegexSpanQuery doesn't want to work at all with the same search clause, as it runs out of memory pretty quickly.. However, working with three-letter prefixes is blazingly fast. Thanks again... Erick On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote: Mark, On Friday 06 October 2006 22:46, Mark Miller wrote: Paul's parser is beyond my feeble comprehension...but I would start by looking at SrndTruncQuery. It looks to me like this enumerates each possible match just like a SpanRegexQuery does...I am too lazy to figure out what the visitor pattern is doing so I don't know if they then get added to a boolean query, but I don't know what else would happen. If They can also be added to a SpanOrQuery as SpanTermQuery, this depends on the context of the query (distance query or not). The visitor pattern is used to have the same code for distance queries and other queries as far as possible. this is the case, I am wondering if it is any more efficient than the SpanRegex implementation...which could be changed to a SpanWildcard I don't think the surround implementation of expanding terms is more efficient that the Lucene implementation. Surround does have the functionality of a SpanWildCard, but the implementation of the expansion is shared, see above. implementation. How exactly is this better at avoiding a toomanyclauses exception or ram fillup. Is it just the fact that the (lets say) three wildcard terms are anded so this should dramatically reduce the matches? The limitation in BasicQueryFactory works for a complete surround query, which can be nested. In Lucene only the max nr of clauses for a single level BooleanQuery can be controlled. ... Regards, Paul Elschot - Mark Erick Erickson wrote: Paul: Splendid! Now if I just understood a single thing about the SrndQuery family G. I followed your link, and took a look at the text file. That should give me enough to get started. But if you wanted to e-mail me any sample code or long explanations of what this all does, I would forever be your lackey G I should also fairly easily be able to run a few of these against the partial index I already have to get some sense of now it'll all work out in my problem space. I suspect that the actual number of distinct terms won't grow too much after the first 4,000 books, so it'll probably be pretty safe to get this running in the worst case, find out if/where things blow up, and put in some safeguards. Or perhaps discover that it's completely and entirely perfect G. Thanks again Erick On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 06 October 2006 14:37, Erick Erickson wrote: ... Fortunately, the PM agrees that it's silly to think about span queries involving OR or NOT for this app. So I'm left with something like Jo*n AND sm*th AND jon?es WITHIN 6. OR works much the same as term expansion for wildcards. The only approach that's occurred to me is to create a filter on for the terms, giving me a subset of my docs that have any terms satisfying the above. For each doc in the filter, get creative with TermPositionVector for determining whether the document matches. It seems that this would involve creating a list of all positions in each doc in my filter that match jo*n, another for sm*th, and another for jon?es and seeing if the distance (however I define that) between any triple of terms (one from each list) is less than 6. My gut feel is that this explodes time-wise based upon the number of terms that match. In this particular application, we are indexing 20K books. Based on indexing 4K of them, this amounts to about a 4G index (although I acutally expect this to be somewhat larger since I haven't indexed all the fields, just the text so far). I can't imagine that comparing the expanded terms for, say, 10,000 docs will be fast. I'm putting together an experiment to test this though. But someone could save me a lot of work by telling me that this is solved already. This is your chance G.. It's solved :) here:
Re: wildcard and span queries
OK, forget the stuff about TooManyBooleanClauses. I finally figured out that if I specify the surround to have the same semantics as a SpanRegex ( i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes more sense to me now. Specifying 20w(eri*, mal*) is what I was using before. Erick On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote: OK, I'm using the surround code, and it seems to be working...with the following questions (always, more questions)... I'm gettng an exception sometimes of TooManyBasicQueries. I can control this by initializing BasicQueryFactory with a larger number. Do you have any cautions about upping this number? There's a hard-coded value minimumPrefixLength set to 3 down in the code Surround query parser (allowedSuffix). I see no method to change this. I assume that this is to prevent using up too much memory/time. What should I know about this value? I'm mostly interested in a justification for the product manager why allowing, say, two character (or one character) prefixes is a bad idea G. I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal to Surround queries. That is, trying RegexSpanQuery doesn't want to work at all with the same search clause, as it runs out of memory pretty quickly.. However, working with three-letter prefixes is blazingly fast. Thanks again... Erick On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote: Mark, On Friday 06 October 2006 22:46, Mark Miller wrote: Paul's parser is beyond my feeble comprehension...but I would start by looking at SrndTruncQuery. It looks to me like this enumerates each possible match just like a SpanRegexQuery does...I am too lazy to figure out what the visitor pattern is doing so I don't know if they then get added to a boolean query, but I don't know what else would happen. If They can also be added to a SpanOrQuery as SpanTermQuery, this depends on the context of the query (distance query or not). The visitor pattern is used to have the same code for distance queries and other queries as far as possible. this is the case, I am wondering if it is any more efficient than the SpanRegex implementation...which could be changed to a SpanWildcard I don't think the surround implementation of expanding terms is more efficient that the Lucene implementation. Surround does have the functionality of a SpanWildCard, but the implementation of the expansion is shared, see above. implementation. How exactly is this better at avoiding a toomanyclauses exception or ram fillup. Is it just the fact that the (lets say) three wildcard terms are anded so this should dramatically reduce the matches? The limitation in BasicQueryFactory works for a complete surround query, which can be nested. In Lucene only the max nr of clauses for a single level BooleanQuery can be controlled. ... Regards, Paul Elschot - Mark Erick Erickson wrote: Paul: Splendid! Now if I just understood a single thing about the SrndQuery family G. I followed your link, and took a look at the text file. That should give me enough to get started. But if you wanted to e-mail me any sample code or long explanations of what this all does, I would forever be your lackey G I should also fairly easily be able to run a few of these against the partial index I already have to get some sense of now it'll all work out in my problem space. I suspect that the actual number of distinct terms won't grow too much after the first 4,000 books, so it'll probably be pretty safe to get this running in the worst case, find out if/where things blow up, and put in some safeguards. Or perhaps discover that it's completely and entirely perfect G. Thanks again Erick On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 06 October 2006 14:37, Erick Erickson wrote: ... Fortunately, the PM agrees that it's silly to think about span queries involving OR or NOT for this app. So I'm left with something like Jo*n AND sm*th AND jon?es WITHIN 6. OR works much the same as term expansion for wildcards. The only approach that's occurred to me is to create a filter on for the terms, giving me a subset of my docs that have any terms satisfying the above. For each doc in the filter, get creative with TermPositionVector for determining whether the document matches. It seems that this would involve creating a list of all positions in each doc in my filter that match jo*n, another for sm*th, and another for jon?es and seeing if the distance (however I define that) between any triple of terms (one from each list) is less than 6. My gut feel is that this explodes time-wise based upon the number of terms that match. In this particular application, we are indexing 20K books. Based on
Re: wildcard and span queries
Erick, On Monday 09 October 2006 21:20, Erick Erickson wrote: OK, forget the stuff about TooManyBooleanClauses. I finally figured out that if I specify the surround to have the same semantics as a SpanRegex ( i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes more sense to me now. Specifying 20w(eri*, mal*) is what I was using before. Erick On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote: OK, I'm using the surround code, and it seems to be working...with the following questions (always, more questions)... I'm gettng an exception sometimes of TooManyBasicQueries. I can control this by initializing BasicQueryFactory with a larger number. Do you have any cautions about upping this number? There's a hard-coded value minimumPrefixLength set to 3 down in the code Surround query parser (allowedSuffix). I see no method to change this. I assume that this is to prevent using up too much memory/time. What should I know about this value? I'm mostly interested in a justification for the product manager why allowing, say, two character (or one character) prefixes is a bad idea G. Once BasicQueryFactory has a satisfactory limitation, that is one that a user can understand when the exception for too many basic queries is thrown, there is no need to keep this minimim prefix length at 3, 1 or 2 will also do. When using many thousands as the max. basic queries, the term expansion itself might take some time to reach that maximum. You might want to ask the PM for a reasonable query involving such short prefixes, though. In most western languages, they do not make much sense. I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal to Surround queries. That is, trying RegexSpanQuery doesn't want to work at all with the same search clause, as it runs out of memory pretty quickly.. However, working with three-letter prefixes is blazingly fast. Your index is probably not very large (yet). Make sure to reevaluate the max. number of basic queries as it grows. Did you try nesting like this: 20d( 4w(lucene, action), 5d(hatch*, gospod*)) ? Could you tell a bit more about the target grammar? Regards, Paul Elschot Thanks again... Erick On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote: Mark, On Friday 06 October 2006 22:46, Mark Miller wrote: Paul's parser is beyond my feeble comprehension...but I would start by looking at SrndTruncQuery. It looks to me like this enumerates each possible match just like a SpanRegexQuery does...I am too lazy to figure out what the visitor pattern is doing so I don't know if they then get added to a boolean query, but I don't know what else would happen. If They can also be added to a SpanOrQuery as SpanTermQuery, this depends on the context of the query (distance query or not). The visitor pattern is used to have the same code for distance queries and other queries as far as possible. this is the case, I am wondering if it is any more efficient than the SpanRegex implementation...which could be changed to a SpanWildcard I don't think the surround implementation of expanding terms is more efficient that the Lucene implementation. Surround does have the functionality of a SpanWildCard, but the implementation of the expansion is shared, see above. implementation. How exactly is this better at avoiding a toomanyclauses exception or ram fillup. Is it just the fact that the (lets say) three wildcard terms are anded so this should dramatically reduce the matches? The limitation in BasicQueryFactory works for a complete surround query, which can be nested. In Lucene only the max nr of clauses for a single level BooleanQuery can be controlled. ... Regards, Paul Elschot - Mark Erick Erickson wrote: Paul: Splendid! Now if I just understood a single thing about the SrndQuery family G. I followed your link, and took a look at the text file. That should give me enough to get started. But if you wanted to e-mail me any sample code or long explanations of what this all does, I would forever be your lackey G I should also fairly easily be able to run a few of these against the partial index I already have to get some sense of now it'll all work out in my problem space. I suspect that the actual number of distinct terms won't grow too much after the first 4,000 books, so it'll probably be pretty safe to get this running in the worst case, find out if/where things blow up, and put in some safeguards. Or perhaps discover that it's completely and entirely perfect G. Thanks again Erick On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 06 October 2006
Re: wildcard and span queries
I've already started that conversation with the PM, I'm just trying to get a better idea of what's possible. I'll whimper tooth and nail to keep from having to do a lot of work to add a feature to a product that nobody in their right mind would ever use G. As far as the grammar, we don't actually have one yet. That's part of what this exploration is all about. The kicker is that what we are indexing is OCR data, some of which is pretty trashy. So you wind up with interesting words in your index, things like rtyHrS. So the whole question of allowing very specific queries on detailed wildcards (combined with spans) is under discussion. It's not at all clear to me that there's any value to the end users in the capability of, say, two character prefixes. And, it's an easy rule that prefix queries must specify at least 3 non-wildcard characters Thanks for your advice. You're quite correct that the index isn't very large yet. My task tonight is to index about 4K books. I suspect that the number of terms won't increase dramatically after that many books, but that's an assumption on my part. Thanks again Erick On 10/9/06, Paul Elschot [EMAIL PROTECTED] wrote: Erick, On Monday 09 October 2006 21:20, Erick Erickson wrote: OK, forget the stuff about TooManyBooleanClauses. I finally figured out that if I specify the surround to have the same semantics as a SpanRegex ( i.e, and(eri*, mal*)) it blows up with TooManyBooleanClauses. So that makes more sense to me now. Specifying 20w(eri*, mal*) is what I was using before. Erick On 10/9/06, Erick Erickson [EMAIL PROTECTED] wrote: OK, I'm using the surround code, and it seems to be working...with the following questions (always, more questions)... I'm gettng an exception sometimes of TooManyBasicQueries. I can control this by initializing BasicQueryFactory with a larger number. Do you have any cautions about upping this number? There's a hard-coded value minimumPrefixLength set to 3 down in the code Surround query parser (allowedSuffix). I see no method to change this. I assume that this is to prevent using up too much memory/time. What should I know about this value? I'm mostly interested in a justification for the product manager why allowing, say, two character (or one character) prefixes is a bad idea G. Once BasicQueryFactory has a satisfactory limitation, that is one that a user can understand when the exception for too many basic queries is thrown, there is no need to keep this minimim prefix length at 3, 1 or 2 will also do. When using many thousands as the max. basic queries, the term expansion itself might take some time to reach that maximum. You might want to ask the PM for a reasonable query involving such short prefixes, though. In most western languages, they do not make much sense. I'm a bit confused. It appears that TooManyBooleanClauses is orthogonal to Surround queries. That is, trying RegexSpanQuery doesn't want to work at all with the same search clause, as it runs out of memory pretty quickly.. However, working with three-letter prefixes is blazingly fast. Your index is probably not very large (yet). Make sure to reevaluate the max. number of basic queries as it grows. Did you try nesting like this: 20d( 4w(lucene, action), 5d(hatch*, gospod*)) ? Could you tell a bit more about the target grammar? Regards, Paul Elschot Thanks again... Erick On 10/6/06, Paul Elschot [EMAIL PROTECTED] wrote: Mark, On Friday 06 October 2006 22:46, Mark Miller wrote: Paul's parser is beyond my feeble comprehension...but I would start by looking at SrndTruncQuery. It looks to me like this enumerates each possible match just like a SpanRegexQuery does...I am too lazy to figure out what the visitor pattern is doing so I don't know if they then get added to a boolean query, but I don't know what else would happen. If They can also be added to a SpanOrQuery as SpanTermQuery, this depends on the context of the query (distance query or not). The visitor pattern is used to have the same code for distance queries and other queries as far as possible. this is the case, I am wondering if it is any more efficient than the SpanRegex implementation...which could be changed to a SpanWildcard I don't think the surround implementation of expanding terms is more efficient that the Lucene implementation. Surround does have the functionality of a SpanWildCard, but the implementation of the expansion is shared, see above. implementation. How exactly is this better at avoiding a toomanyclauses exception or ram fillup. Is it just the fact that the (lets say) three wildcard terms are anded so this should dramatically reduce the matches? The limitation in BasicQueryFactory works for a complete surround query, which can be nested. In Lucene only the max nr of clauses for a single level
Re: wildcard and span queries
Erick Erickson [EMAIL PROTECTED] wrote on 09/10/2006 13:09:21: ... The kicker is that what we are indexing is OCR data, some of which is pretty trashy. So you wind up with interesting words in your index, things like rtyHrS. So the whole question of allowing very specific queries on detailed wildcards (combined with spans) is under discussion. It's not at all clear to me that there's any value to the end users in the capability of, say, two character prefixes. And, it's an easy rule that prefix queries must specify at least 3 non-wildcard characters Erick, I may be out of course here, but, fwiw, have you considered n-gram indexing/search for a degree of fuzziness to compensate for OCR errors..? For a four words query you would probably get ~20 tokens (bigrams?) - no matter what the index size is. You would then probably want to score higher by LA (lexical affinity - query terms appear close to each other in the document) - and I am not sure to what degree a span query (made of n-gram terms) would serve that, because (1) all terms in the span need to be there (well, I think:-); and, (2) you would like to increase doc score for close-by terms only for close-by query n-grams. So there might not be a ready to use solution in Lucene for this, but perhaps this is a more robust direction to try than the wild card approach - I mean, if users want to type a wild card query, it is their right to do so, but for an application logic this does not seem the best choice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FieldSelectorResult instance descriptions?
: If you read the entire source as I did, I becomes clear ! :) : The interesting code is in FieldsReader. Not neccessarily. There can be differneces between how constants are used and how they are suppose to be used (depending on wether or not the code using them has any bugs in it) : NO_LOAD : skip the field, it's value won't be available Should the client expecation for NO_LOAD fileds be that the Document.getField/getFieldable will return will null, and that the List returned by getFields() will not contain anything for these fields, or should clients assume there may be an empty Fieldable object returned by any of these methods (or included in the list) : LAZY_LOAD : do not load the field value, but if you request it later, it will : be loaded on request. Note that it can be lazy-loaded only if the reader is : still opened. What should clicents expected to happen if the reader has already been closed? : LOAD_FOR_MERGE : internal use when merging segments: it avoids uncompressing : and recompressing data, the data is merged binarily. this seems like a second-class citizen then correct? not intende for client code to use in their FieldSelector ? ... so what if the do use it? ... can they expect the data n the Field object to be uncompressed on the fly if they attempt to access it later? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: QueryParser syntax French Operator
Hi, I was thinking of something along those lines. Last week, I was able to take time to understand the JavaCC syntax and possiblities. I have some cleaning up, testing and documentation to do, but basically, I was able to expand the AND / OR / NOT patterns at runtime using the ResourceBundle paradigm. I'll keep you posted. Patrick -Message d'origine- De : karl wettin [mailto:[EMAIL PROTECTED] Envoyé : 8 octobre, 2006 10:14 À : java-user@lucene.apache.org Objet : Re: QueryParser syntax French Operator On 10/8/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Patrick, If I were trying to do this, I'd modify QueryParser.jj to construct the grammar for boolean operators based on something like Locale (or LANG env. variable?). I'd try adding code a la: en_AND = AND en_OR = OR en_NOT = NOT fr_AND = ET fr_OR = OU fr_NOT = SAUF And then: if (locale is 'fr') // construct the grammar with fr_* ... Something like that. It is a good thought, but as number of locales grows with similar languages you'll get deterministic errors in the lexer. So I would absolutely recommend one grammar file per language. Not sure if JavaCC allows inheritance, but with ANTlr this would be a very simple and effective way to solve the problem. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: FieldSelectorResult instance descriptions?
See http://www.gossamer-threads.com/lists/lucene/java-dev/33964? search_string=Lazy%20Field%20Loading;#33964 for the discussion on Java Dev from wayback if you want more background info. To some extent, I still think Lazy Fields are in the early adopter stage, since they haven't officially been released, so these questions are good for vetting them. And there is still the question of how to handle Document.getField() versus Document.getFieldable ()... but that is a discussion for the dev list. See below for more... HTH, Grant On Oct 9, 2006, at 5:22 PM, Chris Hostetter wrote: : If you read the entire source as I did, I becomes clear ! :) : The interesting code is in FieldsReader. Not neccessarily. There can be differneces between how constants are used and how they are suppose to be used (depending on wether or not the code using them has any bugs in it) I will put some javadocs on these (or if someone wants to add a patch...) : NO_LOAD : skip the field, it's value won't be available Should the client expecation for NO_LOAD fileds be that the Document.getField/getFieldable will return will null, and that the List returned by getFields() will not contain anything for these fields, or should clients assume there may be an empty Fieldable object returned by any of these methods (or included in the list) My understanding is in the NO_LOAD case, doc.add(Field) is not called, so Document.getField() will return null. Again, I will try to get some javadocs on this part. : LAZY_LOAD : do not load the field value, but if you request it later, it will : be loaded on request. Note that it can be lazy-loaded only if the reader is : still opened. What should clicents expected to happen if the reader has already been closed? Search the dev list for Semantics of a closed IndexInput for some discussion on this between Doug and I. Unfortunately, the answer isn't all that satisfying, since it is undefined. I would prefer better treatment than that, but it isn't obvious. I originally thought there would be an exception to catch or something (in fact, my original test cases had expected it to be handled), but ended up putting the handling on the application, since the app should know when it has been closed. : LOAD_FOR_MERGE : internal use when merging segments: it avoids uncompressing : and recompressing data, the data is merged binarily. this seems like a second-class citizen then correct? not intende for client code to use in their FieldSelector ? ... so what if the do use it? ... can they expect the data n the Field object to be uncompressed on the fly if they attempt to access it later? I would agree it is a second-class citizen, but maybe Otis can add his thoughts, as I think he added this feature. I am unsure of the results of using it outside of the merge scope. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: wildcard and span queries
Doron: Thanks for the suggestion, I'll certainly put it on my list, depending upon what the PM decides. This app is geneaology reasearch, and users *can* put in their own wildcards... This is why I love this list... lots of smart people giving me suggestions I never would have thought of G... Thanks Erick On 10/9/06, Doron Cohen [EMAIL PROTECTED] wrote: Erick Erickson [EMAIL PROTECTED] wrote on 09/10/2006 13:09:21: ... The kicker is that what we are indexing is OCR data, some of which is pretty trashy. So you wind up with interesting words in your index, things like rtyHrS. So the whole question of allowing very specific queries on detailed wildcards (combined with spans) is under discussion. It's not at all clear to me that there's any value to the end users in the capability of, say, two character prefixes. And, it's an easy rule that prefix queries must specify at least 3 non-wildcard characters Erick, I may be out of course here, but, fwiw, have you considered n-gram indexing/search for a degree of fuzziness to compensate for OCR errors..? For a four words query you would probably get ~20 tokens (bigrams?) - no matter what the index size is. You would then probably want to score higher by LA (lexical affinity - query terms appear close to each other in the document) - and I am not sure to what degree a span query (made of n-gram terms) would serve that, because (1) all terms in the span need to be there (well, I think:-); and, (2) you would like to increase doc score for close-by terms only for close-by query n-grams. So there might not be a ready to use solution in Lucene for this, but perhaps this is a more robust direction to try than the wild card approach - I mean, if users want to type a wild card query, it is their right to do so, but for an application logic this does not seem the best choice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incremental updates / slow searches.
don't forget to optimize your index every now and then as well... deleting a document just marks it as deleted it still gets inspectected by every query during scoring at least once to see that it can skip it, optimizing is the only thing that truely removes the deleted documents. : Date: Mon, 9 Oct 2006 13:49:34 -0400 : From: Yonik Seeley [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Re: Incremental updates / slow searches. : : The biggest thing would be to limit how often you open a new : IndexSearcher, and when you do, warm up the new searcher in the : background while you continue serving searches with the existing : searcher. This is the strategy that Solr uses. : : There is also the issue of if you are analyzing/merging docs on the : same servers that you are executing searches on. You can use a : separate box to build the index and distribute changes to boxes used : for searching. : : -Yonik : http://incubator.apache.org/solr Solr, the open-source Lucene search server : : On 10/9/06, Rickard Bäckman [EMAIL PROTECTED] wrote: : Hi, : : we are using a search system based on Lucene and have recently tried to add : incremental updating of the index instead of building a new index every now : and then. However we now run into problems as our searches starts to take : very long time to complete. : : Our index is about 8-9GB large and we are sending lots of updates / second : (we are probably merging in 200 - 300 in a few seconds). Today we buffer a : bunch of updates and then merge them into the existing index like a batch, : first doing deletes and then inserts. : : We are currently not using any special tuning of Lucene. : : Does anyone have any similiar experiences from Lucene or advices on how to : reduce the amount of times it takes to perform a search? In particular what : would be an optimal combination of update size, merge factor, max buffered : docs? : : /Rickard : : : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incremental updates / slow searches.
On 10/9/06, Chris Hostetter [EMAIL PROTECTED] wrote: don't forget to optimize your index every now and then as well... deleting a document just marks it as deleted it still gets inspectected by every query during scoring at least once to see that it can skip it, optimizing is the only thing that truely removes the deleted documents. I'd refine that statement to optimizing is the easiest way to remove any deleted documents that still exist in the index. Deleted documents are removed from segments that are merged, so it depends on things like the mergeFactor, maxBufferedDocs, and where the deleted docs are in the index (in the smallest or largest segments). Some deleted docs will be removed quickly, but some won't. Optimizing an index also has a beneficial effect on search speed even beyond removing all of the deleted docs. Each index segment is actually a complete index on it's own... so if search is generally O(log(N)), searching across M segments of since N will take M * log(N). If those segments are optimized into a single segment, the search will be O(log(M*N)). -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]