can we do partial optimization?

2007-12-03 Thread Nizamul
Hello, 
I am very new to Lucene.I am facing one problem.
 I have one very large index which is constantly getting update(add and delete) 
at a regular interval.after which I am optimizing the whole index (otherwise 
searches will be slow) but optimization takes time.So I was thinking to merge 
only the segments of lesser size(I guess it will be a good compromise between 
search time and optimization time) i.e. suppose I have 10 segment 
1 of 10,000,000 doc
4 of 100,000 doc
4 of 10,000 doc 
and 1 of 5 doc.

I want to merger 9 segment of lesser size  in to  one(I believe this would not 
take much time and searching will improve a lot).But I don't know how to do 
partial merging.Whether Lucene allow it or not?? or if I can extend indexWriter 
and add a method optimize of my own where I can specify which cfs file to chose 
for optimization?

Thanks and Regards,
Nizam

Re: can we do partial optimization?

2007-12-03 Thread Michael McCandless

The current trunk of Lucene (unreleased 2.3-dev) has a new method on
IndexWriter: optimize(int maxNumSegments).  This method should do what
you want: you tell it how many segments to optimize down to, and it
will try to pick the least cost merges to get the index to that
point.  It's very new (only committed a few days ago), plus the trunk
may have bugs, so tread carefully!

If that doesn't seem to do the right merges for your index, it's also
very simple to create your own MergePolicy.  You can subclass the
default LogByteSizeMergePolicy and override the
"findMergesForOptimize" method.  This feature (separate MergePolicy)
is also only available in 2.3-dev (trunk).

Mike

"Nizamul" <[EMAIL PROTECTED]> wrote:
> Hello, 
> I am very new to Lucene.I am facing one problem.
>  I have one very large index which is constantly getting update(add and
>  delete) at a regular interval.after which I am optimizing the whole
>  index (otherwise searches will be slow) but optimization takes time.So I
>  was thinking to merge only the segments of lesser size(I guess it will
>  be a good compromise between search time and optimization time) i.e.
>  suppose I have 10 segment 
> 1 of 10,000,000 doc
> 4 of 100,000 doc
> 4 of 10,000 doc 
> and 1 of 5 doc.
> 
> I want to merger 9 segment of lesser size  in to  one(I believe this
> would not take much time and searching will improve a lot).But I don't
> know how to do partial merging.Whether Lucene allow it or not?? or if I
> can extend indexWriter and add a method optimize of my own where I can
> specify which cfs file to chose for optimization?
> 
> Thanks and Regards,
> Nizam

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



FieldCache Implementations

2007-12-03 Thread Grant Ingersoll
Does any out there using Lucene implement their own version of  
FieldCache.java?  We are proposing to make it an abstract class, which  
violates our general rule about back-compatibility (see https://issues.apache.org/jira/browse/LUCENE-1045)


-Grant

--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Applying SpellChecker to a phrase

2007-12-03 Thread Erick Erickson
Have you actually tried this and done a query.toString() to see
how this is actually expanded? Not that I'm all that familiar
with SpellChecker, but before presuming how things work
you would get answers faster if you ran a test.

And, why do you care about performance? I know that's
a silly question, but you haven't supplied any parameters
about your index and usage to give us a clue whether this
matters. If your index is 3M, you'll never see the difference
between the two ways of expanding the query. If your
index is distributed over 10 machines and is 1T, you really,
really, really care.

And under any circumstances, you can always generate
your own query of the second form by a bit of pre-processing.

More info please.

Best
Erick

On Dec 2, 2007 10:14 PM, smokey <[EMAIL PROTECTED]> wrote:

> Suppose I have an index containing the terms impostor, imposter, fraud,
> and
> fruad, then presumably regardless of whether I spell impostor and fraud
> correctly, Lucene SpellChecker will offer the improperly spelled versions
> as
> corrections. This means that the phrase "The login fraud involves an
> impostor" would need to expand to:
>
> "The login fraud involves an impostor" OR "The login fruad involves an
> impostor" OR "The login fraud involves an imposter" OR "The login fruad
> involves an imposter" to cover all cases and thus find all possible
> matches.
>
> However, that feels like an aweful a lot of matches to perform on the
> index.
> A more efficient approach would be to expand the query to "The login
> (fraud
> OR fruad) involves an (impostor OR imposter)", which should be logically
> equivalent to the first (longer) query.
>
> So my question is
> (1) if others have generated the "The login (fraud OR fruad) involves an
> (impostor OR imposter)" types of queries when applying SpellChecker to a
> phrase, and agreed that this indeed performs better than the first one.
> (2) if others have observed any problems in doing so in terms of
> performance
> or anything else
>
> Any information would be appreciated.
>


Re: BooleanQuery TooManyClauses in wildcard search

2007-12-03 Thread Erick Erickson
First time I tried this I made it WAY more complex than it is 

WARNING: this is from an older code base so you may have to tweak
it. Might be 1.9 code

public class WildcardTermFilter
extends Filter {

private static final long serialVersionUID = 1L;


protected BitSet bits = null;
private String   field;
private String   value;

public WildcardTermFilter(String field, String value) {
this.field = field;
this.value = value;
}

public BitSet bits(IndexReader reader)
throws IOException {
bits = new BitSet(reader.maxDoc());

TermDocs termDocs = reader.termDocs();
WildcardTermEnum wildEnum = new WildcardTermEnum(reader, new
Term(field, value));

for (Term term = null; (term = wildEnum.term()) != null;
wildEnum.next()) {
termDocs.seek(new Term(
field,
term.text()));

while (termDocs.next()) {
bits.set(termDocs.doc());
}
}

return bits;
}
}


On Dec 2, 2007 8:34 AM, Ruchi Thakur <[EMAIL PROTECTED]> wrote:

> Erick can you please point me to some example of creating a filtered
> wildcard query. I have not used filters anytime before. Tried reading but
> still am really not able to understand how filters actually work and will
> help me getting rid of MaxClause Exception.
>
>  Regards,
>  Ruchika
>
> Erick Erickson <[EMAIL PROTECTED]> wrote:
>  See below:
>
> On Dec 1, 2007 1:16 AM, Ruchi Thakur wrote:
>
> >
> > Erick/John, thank you so much for the reply. I have gone through the
> > mailing list u have redirected me to. I know i need to read more, but
> some
> > quick questions. Please bear with me if they appear to be too simple.
> > Below is the code snippet of my current search. Also i need to get score
> > info of each of my document returned in search, as i display the search
> > result in the order of scroing.
> > {
> > Directory fsDir = FSDirectory.getDirectory(aIndexDir, false);
> > IndexSearcher is = new IndexSearcher(fsDir);
> > ELSAnalyser elsAnalyser = new ELSStopAnalyser();
> > Analyzer analyzer = elsAnalyser.getAnalyzer();
> > QueryParser parser = new QueryParser(aIndexField, analyzer);
> > Query query = parser.parse(aSearchStr);
> > hits = is.search(query);
> > }
> >
>
> EOE: Minor point that you probably already know, but opening a searcher is
> expensive.
> I'm assuming you put it in here for clarity, but in case not be
> aware you should
> open a reader and re-use it as much as posslble.
>
> Also, it looks like you're using an older version of Lucene, since
> the
> getDirecotory(dir, bool) is deprecated.
>
>
> >
> > Now as i have understood, through the mail archives you have suggsted,
> > below is what we need to do.
> > 1)The second was to build a *Filter* that uses WildcardTermEnum -- not a
> > Query.
> > because it's a filter, the scoring aspects of each document are taken
> out
> > of the equation (I am worried abt it , as i need scoring info)
> >
>
> This is true *for the wildcard clause*. It's a legitimate question to ask
> what
> scoring means for a wildcard clause. Rather, it's legitimate to ask
> whether
> that adds much value. I managed to convince my product manager that
> the end user experience didn't suffer enough to matter, but it can be
> argued
> either way.
>
> That said, I'm pretty sure that if you make this a sub-clause of a boolean
> query,
> you still get scoring for the *other* parts of the query. That is,
> BooleanQuery bq = 
> bq.add(regular query);
> bq.add(filtered wildcard query);
>
> search (bq);
>
> (note, really sloppy pseudo code there) will give you scoring for
> the "regular query" part of the bq. Of course that requires you to
> break up the incoming query to the wildcard parts and the not
> wildcard parts...
>
>
> >
> > 2)Once you have a "WildcardFilter" wrapping it in a ConstantScoreQuery
> > would give you a drop in replacement for WildCardQuery that would
> sacrifive
> > the TF/IDF scoring factors for speed and garunteed execution on any
> pattern
> > in any index regardless of size. (Does that mean it will solve my
> scoring
> > issue and i will get scoring info)
> >
>
> I'm pretty sure that you don't get scoring here. ConstantScoreQuery is
> named that way on purpose .
>
>
> >
> > Also it suggests "SpanNearQuery on a wildcard". I am kinda cofused which
> > is the approach that should be actually used. Please suggest. At the
> same
> > time i am studing more abt it. Thanks a lot for ur help on this.
> >
>
> I think I was looking at this for a method of highlighting, but span
> queries
> won't fix up wildcard queries.
>
> handling arbitrary wildcard queries, that is queries with, say, only
> one or two leading letters is an area of Lucene that requires that
> one really dig into the guts of querying and do some custom work.
> We've had quite reasonable results by imposing the restriction that
> wildcard queries MUST hav

SpellChecker performance and usage

2007-12-03 Thread smokey
My question is for anyone who has experience with Lucene's SpellChecker,
especially around its performance characteristics/ramifications.

1. Given the fact that SpellChecker expands a query by adding all the
permutations of potentially misspelled word, how does it perform in general?

2. How are others handling the case where SpellChecker would NOT perform
well if you expand the query adding all the permutations? In other words,
what kind of techniques are people using to get around or alleviate the
performance hit if any?

Any sharing of information or pointers would be appreciated.


Re: Applying SpellChecker to a phrase

2007-12-03 Thread smokey
I have not tried this yet. I am trying to understand the best practices from
others who have experiences with SpellChecker before actually implementing
it.

If I understand it correctly, the spell check class suggests alternate but
similar words for a single input term. So I believe I will have to parse the
phrase string and apply spell checker for each member term to construct the
final expanded query. I don't think there is a higher level support that
lets me apply spell check to a phase and do query.toString() to examine how
it internally expanded the query (although it would have been nice to have
something like that - has anyone written or found such class?)

As for performance, we're dealing with hundreds of indexes where each index
typically grows well above 1G in size, so performance is the single most
important factor to consider.

On Dec 3, 2007 8:12 AM, Erick Erickson <[EMAIL PROTECTED]> wrote:

> Have you actually tried this and done a query.toString() to see
> how this is actually expanded? Not that I'm all that familiar
> with SpellChecker, but before presuming how things work
> you would get answers faster if you ran a test.
>
> And, why do you care about performance? I know that's
> a silly question, but you haven't supplied any parameters
> about your index and usage to give us a clue whether this
> matters. If your index is 3M, you'll never see the difference
> between the two ways of expanding the query. If your
> index is distributed over 10 machines and is 1T, you really,
> really, really care.
>
> And under any circumstances, you can always generate
> your own query of the second form by a bit of pre-processing.
>
> More info please.
>
> Best
> Erick
>
> On Dec 2, 2007 10:14 PM, smokey <[EMAIL PROTECTED]> wrote:
>
> > Suppose I have an index containing the terms impostor, imposter, fraud,
> > and
> > fruad, then presumably regardless of whether I spell impostor and fraud
> > correctly, Lucene SpellChecker will offer the improperly spelled
> versions
> > as
> > corrections. This means that the phrase "The login fraud involves an
> > impostor" would need to expand to:
> >
> > "The login fraud involves an impostor" OR "The login fruad involves an
> > impostor" OR "The login fraud involves an imposter" OR "The login fruad
> > involves an imposter" to cover all cases and thus find all possible
> > matches.
> >
> > However, that feels like an aweful a lot of matches to perform on the
> > index.
> > A more efficient approach would be to expand the query to "The login
> > (fraud
> > OR fruad) involves an (impostor OR imposter)", which should be logically
> > equivalent to the first (longer) query.
> >
> > So my question is
> > (1) if others have generated the "The login (fraud OR fruad) involves an
> > (impostor OR imposter)" types of queries when applying SpellChecker to a
> > phrase, and agreed that this indeed performs better than the first one.
> > (2) if others have observed any problems in doing so in terms of
> > performance
> > or anything else
> >
> > Any information would be appreciated.
> >
>


Re: FieldCache Implementations

2007-12-03 Thread Thom Nelson
I have implemented a custom version of FieldCache to handle multi-valued 
fields, but this requires an interface change so it isn't applicable to 
what you're suggesting.  However, it would be great to have a standard 
solution for handling multiple values.


Grant Ingersoll wrote:
Does any out there using Lucene implement their own version of 
FieldCache.java?  We are proposing to make it an abstract class, which 
violates our general rule about back-compatibility (see 
https://issues.apache.org/jira/browse/LUCENE-1045)


-Grant

--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpellChecker performance and usage

2007-12-03 Thread Doron Cohen
I didn't have performance issues when using the spell checker.
Can you describe what you tried and how long it took, so
people can relate to that.

AFAIK the spell checker in o.a.l.search.spell does not "expand
a query by adding all the permutations of potentially misspelled
word". It is based on building an auxiliary index whose *documents*
are *words* of the main index, going through n-gram tokenization.
A checked word is tokenized that way too, and used as a query on.
the auxiliary index.

There's more wisdom in the query tokenization,
but a simplifying example an help to see how it works:
- a misspelled word 'helo' is tokenized as 'he el lo',
- the auxiliary index contains a document for the correct
  word "hello" that was tokenized as 'he el ll lo'
- the score of the document 'hello' would be high when searching
  the auxiliary index for 'he el lo'.

The only performance hit is when refreshing/rebuilding the
auxiliary index after the lexicon of the actual index
has changed a lot. But this can be done in the background when
adequate for the application using Lucene and the spell checker.

Doron

smokey <[EMAIL PROTECTED]> wrote on 03/12/2007 17:23:21:

> My question is for anyone who has experience with Lucene's SpellChecker,
> especially around its performance characteristics/ramifications.
>
> 1. Given the fact that SpellChecker expands a query by adding all the
> permutations of potentially misspelled word, how does it
> perform in general?
>
> 2. How are others handling the case where SpellChecker would NOT perform
> well if you expand the query adding all the permutations? In other words,
> what kind of techniques are people using to get around or alleviate the
> performance hit if any?
>
> Any sharing of information or pointers would be appreciated.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: can we do partial optimization?

2007-12-03 Thread Doron Cohen
It doesn't make sense to optimize() after every document add.
Lucene in fact implements a logic in the spirit of what you
describe below, when it decides to merge segments on the fly.

There are various ways to tell Lucene how often to flush
recently added/updated documents and what to merge.

But it will pay to check the simple things first, like - Are
you closing and opening the index writer after each document
add (don't)? Are you deleting using IndexRedaer or
IndexWriter (use IndexWriter if you can)? etc.

It's a good start to going again through Lucene FAQ,
http://wiki.apache.org/lucene-java/LuceneFAQ
and in addition see this wiki page on performance:
http://wiki.apache.org/lucene-java/BasicsOfPerformance

Good luck, and let us know how it went!
Doron

"Nizamul" <[EMAIL PROTECTED]> wrote on 03/12/2007 10:48:34:

> Hello,
> I am very new to Lucene.I am facing one problem.
>  I have one very large index which is constantly getting
> update(add and delete) at a regular interval.after which I am
> optimizing the whole index (otherwise searches will be slow)
> but optimization takes time.So I was thinking to merge only the
> segments of lesser size(I guess it will be a good compromise
> between search time and optimization time) i.e. suppose I have
> 10 segment
> 1 of 10,000,000 doc
> 4 of 100,000 doc
> 4 of 10,000 doc
> and 1 of 5 doc.
>
> I want to merger 9 segment of lesser size  in to  one(I believe
> this would not take much time and searching will improve a
> lot).But I don't know how to do partial merging.Whether Lucene
> allow it or not?? or if I can extend indexWriter and add a
> method optimize of my own where I can specify which cfs file to
> chose for optimization?
>
> Thanks and Regards,
> Nizam


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Applying SpellChecker to a phrase

2007-12-03 Thread Doron Cohen
See below -

smokey <[EMAIL PROTECTED]> wrote on 03/12/2007 05:14:23:

> Suppose I have an index containing the terms impostor,
> imposter, fraud, and
> fruad, then presumably regardless of whether I spell impostor and fraud
> correctly, Lucene SpellChecker will offer the improperly
> spelled versions as
> corrections. This means that the phrase "The login fraud involves an
> impostor" would need to expand to:
>
> "The login fraud involves an impostor" OR "The login fruad involves an
> impostor" OR "The login fraud involves an imposter" OR "The login fruad
> involves an imposter" to cover all cases and thus find all
> possible matches.
>
> However, that feels like an aweful a lot of matches to perform
> on the index.
> A more efficient approach would be to expand the query to "The
> login (fraud
> OR fruad) involves an (impostor OR imposter)", which should be logically
> equivalent to the first (longer) query.
>
> So my question is
> (1) if others have generated the "The login (fraud OR fruad) involves an
> (impostor OR imposter)" types of queries when applying SpellChecker to a
> phrase, and agreed that this indeed performs better than the first one.
> (2) if others have observed any problems in doing so in terms
> of performance
> or anything else
>
> Any information would be appreciated.

Lucene phrase query does not support 'sub parts'. But you may
want to look at o.a.l.search.spans. It seems that a span-near query
made of span-term queries and span-or queries, setting (max)span as
~the length of your phrase and setting in-order=true would get
pretty close.

About performance I hope others can comment, cause I never compared
this to phrase query. When you do try this, please tell us of any
interesting performance results!

Regards,
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]