can we do partial optimization?
Hello, I am very new to Lucene.I am facing one problem. I have one very large index which is constantly getting update(add and delete) at a regular interval.after which I am optimizing the whole index (otherwise searches will be slow) but optimization takes time.So I was thinking to merge only the segments of lesser size(I guess it will be a good compromise between search time and optimization time) i.e. suppose I have 10 segment 1 of 10,000,000 doc 4 of 100,000 doc 4 of 10,000 doc and 1 of 5 doc. I want to merger 9 segment of lesser size in to one(I believe this would not take much time and searching will improve a lot).But I don't know how to do partial merging.Whether Lucene allow it or not?? or if I can extend indexWriter and add a method optimize of my own where I can specify which cfs file to chose for optimization? Thanks and Regards, Nizam
Re: can we do partial optimization?
The current trunk of Lucene (unreleased 2.3-dev) has a new method on IndexWriter: optimize(int maxNumSegments). This method should do what you want: you tell it how many segments to optimize down to, and it will try to pick the least cost merges to get the index to that point. It's very new (only committed a few days ago), plus the trunk may have bugs, so tread carefully! If that doesn't seem to do the right merges for your index, it's also very simple to create your own MergePolicy. You can subclass the default LogByteSizeMergePolicy and override the "findMergesForOptimize" method. This feature (separate MergePolicy) is also only available in 2.3-dev (trunk). Mike "Nizamul" <[EMAIL PROTECTED]> wrote: > Hello, > I am very new to Lucene.I am facing one problem. > I have one very large index which is constantly getting update(add and > delete) at a regular interval.after which I am optimizing the whole > index (otherwise searches will be slow) but optimization takes time.So I > was thinking to merge only the segments of lesser size(I guess it will > be a good compromise between search time and optimization time) i.e. > suppose I have 10 segment > 1 of 10,000,000 doc > 4 of 100,000 doc > 4 of 10,000 doc > and 1 of 5 doc. > > I want to merger 9 segment of lesser size in to one(I believe this > would not take much time and searching will improve a lot).But I don't > know how to do partial merging.Whether Lucene allow it or not?? or if I > can extend indexWriter and add a method optimize of my own where I can > specify which cfs file to chose for optimization? > > Thanks and Regards, > Nizam - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FieldCache Implementations
Does any out there using Lucene implement their own version of FieldCache.java? We are proposing to make it an abstract class, which violates our general rule about back-compatibility (see https://issues.apache.org/jira/browse/LUCENE-1045) -Grant -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Applying SpellChecker to a phrase
Have you actually tried this and done a query.toString() to see how this is actually expanded? Not that I'm all that familiar with SpellChecker, but before presuming how things work you would get answers faster if you ran a test. And, why do you care about performance? I know that's a silly question, but you haven't supplied any parameters about your index and usage to give us a clue whether this matters. If your index is 3M, you'll never see the difference between the two ways of expanding the query. If your index is distributed over 10 machines and is 1T, you really, really, really care. And under any circumstances, you can always generate your own query of the second form by a bit of pre-processing. More info please. Best Erick On Dec 2, 2007 10:14 PM, smokey <[EMAIL PROTECTED]> wrote: > Suppose I have an index containing the terms impostor, imposter, fraud, > and > fruad, then presumably regardless of whether I spell impostor and fraud > correctly, Lucene SpellChecker will offer the improperly spelled versions > as > corrections. This means that the phrase "The login fraud involves an > impostor" would need to expand to: > > "The login fraud involves an impostor" OR "The login fruad involves an > impostor" OR "The login fraud involves an imposter" OR "The login fruad > involves an imposter" to cover all cases and thus find all possible > matches. > > However, that feels like an aweful a lot of matches to perform on the > index. > A more efficient approach would be to expand the query to "The login > (fraud > OR fruad) involves an (impostor OR imposter)", which should be logically > equivalent to the first (longer) query. > > So my question is > (1) if others have generated the "The login (fraud OR fruad) involves an > (impostor OR imposter)" types of queries when applying SpellChecker to a > phrase, and agreed that this indeed performs better than the first one. > (2) if others have observed any problems in doing so in terms of > performance > or anything else > > Any information would be appreciated. >
Re: BooleanQuery TooManyClauses in wildcard search
First time I tried this I made it WAY more complex than it is WARNING: this is from an older code base so you may have to tweak it. Might be 1.9 code public class WildcardTermFilter extends Filter { private static final long serialVersionUID = 1L; protected BitSet bits = null; private String field; private String value; public WildcardTermFilter(String field, String value) { this.field = field; this.value = value; } public BitSet bits(IndexReader reader) throws IOException { bits = new BitSet(reader.maxDoc()); TermDocs termDocs = reader.termDocs(); WildcardTermEnum wildEnum = new WildcardTermEnum(reader, new Term(field, value)); for (Term term = null; (term = wildEnum.term()) != null; wildEnum.next()) { termDocs.seek(new Term( field, term.text())); while (termDocs.next()) { bits.set(termDocs.doc()); } } return bits; } } On Dec 2, 2007 8:34 AM, Ruchi Thakur <[EMAIL PROTECTED]> wrote: > Erick can you please point me to some example of creating a filtered > wildcard query. I have not used filters anytime before. Tried reading but > still am really not able to understand how filters actually work and will > help me getting rid of MaxClause Exception. > > Regards, > Ruchika > > Erick Erickson <[EMAIL PROTECTED]> wrote: > See below: > > On Dec 1, 2007 1:16 AM, Ruchi Thakur wrote: > > > > > Erick/John, thank you so much for the reply. I have gone through the > > mailing list u have redirected me to. I know i need to read more, but > some > > quick questions. Please bear with me if they appear to be too simple. > > Below is the code snippet of my current search. Also i need to get score > > info of each of my document returned in search, as i display the search > > result in the order of scroing. > > { > > Directory fsDir = FSDirectory.getDirectory(aIndexDir, false); > > IndexSearcher is = new IndexSearcher(fsDir); > > ELSAnalyser elsAnalyser = new ELSStopAnalyser(); > > Analyzer analyzer = elsAnalyser.getAnalyzer(); > > QueryParser parser = new QueryParser(aIndexField, analyzer); > > Query query = parser.parse(aSearchStr); > > hits = is.search(query); > > } > > > > EOE: Minor point that you probably already know, but opening a searcher is > expensive. > I'm assuming you put it in here for clarity, but in case not be > aware you should > open a reader and re-use it as much as posslble. > > Also, it looks like you're using an older version of Lucene, since > the > getDirecotory(dir, bool) is deprecated. > > > > > > Now as i have understood, through the mail archives you have suggsted, > > below is what we need to do. > > 1)The second was to build a *Filter* that uses WildcardTermEnum -- not a > > Query. > > because it's a filter, the scoring aspects of each document are taken > out > > of the equation (I am worried abt it , as i need scoring info) > > > > This is true *for the wildcard clause*. It's a legitimate question to ask > what > scoring means for a wildcard clause. Rather, it's legitimate to ask > whether > that adds much value. I managed to convince my product manager that > the end user experience didn't suffer enough to matter, but it can be > argued > either way. > > That said, I'm pretty sure that if you make this a sub-clause of a boolean > query, > you still get scoring for the *other* parts of the query. That is, > BooleanQuery bq = > bq.add(regular query); > bq.add(filtered wildcard query); > > search (bq); > > (note, really sloppy pseudo code there) will give you scoring for > the "regular query" part of the bq. Of course that requires you to > break up the incoming query to the wildcard parts and the not > wildcard parts... > > > > > > 2)Once you have a "WildcardFilter" wrapping it in a ConstantScoreQuery > > would give you a drop in replacement for WildCardQuery that would > sacrifive > > the TF/IDF scoring factors for speed and garunteed execution on any > pattern > > in any index regardless of size. (Does that mean it will solve my > scoring > > issue and i will get scoring info) > > > > I'm pretty sure that you don't get scoring here. ConstantScoreQuery is > named that way on purpose . > > > > > > Also it suggests "SpanNearQuery on a wildcard". I am kinda cofused which > > is the approach that should be actually used. Please suggest. At the > same > > time i am studing more abt it. Thanks a lot for ur help on this. > > > > I think I was looking at this for a method of highlighting, but span > queries > won't fix up wildcard queries. > > handling arbitrary wildcard queries, that is queries with, say, only > one or two leading letters is an area of Lucene that requires that > one really dig into the guts of querying and do some custom work. > We've had quite reasonable results by imposing the restriction that > wildcard queries MUST hav
SpellChecker performance and usage
My question is for anyone who has experience with Lucene's SpellChecker, especially around its performance characteristics/ramifications. 1. Given the fact that SpellChecker expands a query by adding all the permutations of potentially misspelled word, how does it perform in general? 2. How are others handling the case where SpellChecker would NOT perform well if you expand the query adding all the permutations? In other words, what kind of techniques are people using to get around or alleviate the performance hit if any? Any sharing of information or pointers would be appreciated.
Re: Applying SpellChecker to a phrase
I have not tried this yet. I am trying to understand the best practices from others who have experiences with SpellChecker before actually implementing it. If I understand it correctly, the spell check class suggests alternate but similar words for a single input term. So I believe I will have to parse the phrase string and apply spell checker for each member term to construct the final expanded query. I don't think there is a higher level support that lets me apply spell check to a phase and do query.toString() to examine how it internally expanded the query (although it would have been nice to have something like that - has anyone written or found such class?) As for performance, we're dealing with hundreds of indexes where each index typically grows well above 1G in size, so performance is the single most important factor to consider. On Dec 3, 2007 8:12 AM, Erick Erickson <[EMAIL PROTECTED]> wrote: > Have you actually tried this and done a query.toString() to see > how this is actually expanded? Not that I'm all that familiar > with SpellChecker, but before presuming how things work > you would get answers faster if you ran a test. > > And, why do you care about performance? I know that's > a silly question, but you haven't supplied any parameters > about your index and usage to give us a clue whether this > matters. If your index is 3M, you'll never see the difference > between the two ways of expanding the query. If your > index is distributed over 10 machines and is 1T, you really, > really, really care. > > And under any circumstances, you can always generate > your own query of the second form by a bit of pre-processing. > > More info please. > > Best > Erick > > On Dec 2, 2007 10:14 PM, smokey <[EMAIL PROTECTED]> wrote: > > > Suppose I have an index containing the terms impostor, imposter, fraud, > > and > > fruad, then presumably regardless of whether I spell impostor and fraud > > correctly, Lucene SpellChecker will offer the improperly spelled > versions > > as > > corrections. This means that the phrase "The login fraud involves an > > impostor" would need to expand to: > > > > "The login fraud involves an impostor" OR "The login fruad involves an > > impostor" OR "The login fraud involves an imposter" OR "The login fruad > > involves an imposter" to cover all cases and thus find all possible > > matches. > > > > However, that feels like an aweful a lot of matches to perform on the > > index. > > A more efficient approach would be to expand the query to "The login > > (fraud > > OR fruad) involves an (impostor OR imposter)", which should be logically > > equivalent to the first (longer) query. > > > > So my question is > > (1) if others have generated the "The login (fraud OR fruad) involves an > > (impostor OR imposter)" types of queries when applying SpellChecker to a > > phrase, and agreed that this indeed performs better than the first one. > > (2) if others have observed any problems in doing so in terms of > > performance > > or anything else > > > > Any information would be appreciated. > > >
Re: FieldCache Implementations
I have implemented a custom version of FieldCache to handle multi-valued fields, but this requires an interface change so it isn't applicable to what you're suggesting. However, it would be great to have a standard solution for handling multiple values. Grant Ingersoll wrote: Does any out there using Lucene implement their own version of FieldCache.java? We are proposing to make it an abstract class, which violates our general rule about back-compatibility (see https://issues.apache.org/jira/browse/LUCENE-1045) -Grant -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SpellChecker performance and usage
I didn't have performance issues when using the spell checker. Can you describe what you tried and how long it took, so people can relate to that. AFAIK the spell checker in o.a.l.search.spell does not "expand a query by adding all the permutations of potentially misspelled word". It is based on building an auxiliary index whose *documents* are *words* of the main index, going through n-gram tokenization. A checked word is tokenized that way too, and used as a query on. the auxiliary index. There's more wisdom in the query tokenization, but a simplifying example an help to see how it works: - a misspelled word 'helo' is tokenized as 'he el lo', - the auxiliary index contains a document for the correct word "hello" that was tokenized as 'he el ll lo' - the score of the document 'hello' would be high when searching the auxiliary index for 'he el lo'. The only performance hit is when refreshing/rebuilding the auxiliary index after the lexicon of the actual index has changed a lot. But this can be done in the background when adequate for the application using Lucene and the spell checker. Doron smokey <[EMAIL PROTECTED]> wrote on 03/12/2007 17:23:21: > My question is for anyone who has experience with Lucene's SpellChecker, > especially around its performance characteristics/ramifications. > > 1. Given the fact that SpellChecker expands a query by adding all the > permutations of potentially misspelled word, how does it > perform in general? > > 2. How are others handling the case where SpellChecker would NOT perform > well if you expand the query adding all the permutations? In other words, > what kind of techniques are people using to get around or alleviate the > performance hit if any? > > Any sharing of information or pointers would be appreciated. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: can we do partial optimization?
It doesn't make sense to optimize() after every document add. Lucene in fact implements a logic in the spirit of what you describe below, when it decides to merge segments on the fly. There are various ways to tell Lucene how often to flush recently added/updated documents and what to merge. But it will pay to check the simple things first, like - Are you closing and opening the index writer after each document add (don't)? Are you deleting using IndexRedaer or IndexWriter (use IndexWriter if you can)? etc. It's a good start to going again through Lucene FAQ, http://wiki.apache.org/lucene-java/LuceneFAQ and in addition see this wiki page on performance: http://wiki.apache.org/lucene-java/BasicsOfPerformance Good luck, and let us know how it went! Doron "Nizamul" <[EMAIL PROTECTED]> wrote on 03/12/2007 10:48:34: > Hello, > I am very new to Lucene.I am facing one problem. > I have one very large index which is constantly getting > update(add and delete) at a regular interval.after which I am > optimizing the whole index (otherwise searches will be slow) > but optimization takes time.So I was thinking to merge only the > segments of lesser size(I guess it will be a good compromise > between search time and optimization time) i.e. suppose I have > 10 segment > 1 of 10,000,000 doc > 4 of 100,000 doc > 4 of 10,000 doc > and 1 of 5 doc. > > I want to merger 9 segment of lesser size in to one(I believe > this would not take much time and searching will improve a > lot).But I don't know how to do partial merging.Whether Lucene > allow it or not?? or if I can extend indexWriter and add a > method optimize of my own where I can specify which cfs file to > chose for optimization? > > Thanks and Regards, > Nizam - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Applying SpellChecker to a phrase
See below - smokey <[EMAIL PROTECTED]> wrote on 03/12/2007 05:14:23: > Suppose I have an index containing the terms impostor, > imposter, fraud, and > fruad, then presumably regardless of whether I spell impostor and fraud > correctly, Lucene SpellChecker will offer the improperly > spelled versions as > corrections. This means that the phrase "The login fraud involves an > impostor" would need to expand to: > > "The login fraud involves an impostor" OR "The login fruad involves an > impostor" OR "The login fraud involves an imposter" OR "The login fruad > involves an imposter" to cover all cases and thus find all > possible matches. > > However, that feels like an aweful a lot of matches to perform > on the index. > A more efficient approach would be to expand the query to "The > login (fraud > OR fruad) involves an (impostor OR imposter)", which should be logically > equivalent to the first (longer) query. > > So my question is > (1) if others have generated the "The login (fraud OR fruad) involves an > (impostor OR imposter)" types of queries when applying SpellChecker to a > phrase, and agreed that this indeed performs better than the first one. > (2) if others have observed any problems in doing so in terms > of performance > or anything else > > Any information would be appreciated. Lucene phrase query does not support 'sub parts'. But you may want to look at o.a.l.search.spans. It seems that a span-near query made of span-term queries and span-or queries, setting (max)span as ~the length of your phrase and setting in-order=true would get pretty close. About performance I hope others can comment, cause I never compared this to phrase query. When you do try this, please tell us of any interesting performance results! Regards, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]