SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
I've noticed that after stress-testing my application (uses Lucene 2.0) for I while, I have almost 200mb of byte[]s hanging around, the top two culprits being: 24 x SegmentReader.Norm.bytes = 112mb 2 x SegmentReader.ones = 16mb The second one isn't a big deal, but I wonder what's the e

Re: SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
Yonik Seeley wrote: On 12/11/06, Eric Jain <[EMAIL PROTECTED]> wrote: I've noticed that after stress-testing my application (uses Lucene 2.0) for I while, I have almost 200mb of byte[]s hanging around, the top two culprits being: 24 x SegmentReader.Norm.bytes = 112mb 2 x Segment

Re: SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
Yonik Seeley wrote: There is no real document boost at the index level... it is simply multiplied into the boost for every field of that document. So it comes down to what fields you want that index-time boost to take effect on (as well as length normalization). Come to think of it, I do have

Re: SegmentReader using too much memory?

2006-12-11 Thread Eric Jain
Yonik Seeley wrote: It's read on demand, per indexed field. So assuming your index is optimized (a single segment), then it increases by one byte[] each time you search on a new field. OK, makes sense then. Thanks! - To unsubs

Scoring by number of terms in field

2006-01-09 Thread Eric Jain
Lucene seems to prefer matches in shorter documents. Is it possible to influence the scoring mechanism to have matches in shorter fields score higher instead? For example, a query for "europe" should rank: 1. title:"Europe" 2. title:"History of Europe" 3. title:"Travel in Europe, Middle East a

Re: Scoring by number of terms in field

2006-01-09 Thread Eric Jain
Paul Elschot wrote: For example, a query for "europe" should rank: 1. title:"Europe" 2. title:"History of Europe" 3. title:"Travel in Europe, Middle East and Africa" 4. subtitle:"Fairy Tales from Europe" Perhaps with this query (assuming the default implicit OR): title:europe subtitle:europe^

Re: Scoring by number of terms in field

2006-01-10 Thread Eric Jain
Paul Elschot wrote: In case you prefer to use the maximum score over the clauses you can use the DisjunctionMaxQuery from the development version. Yes, that may help! I'll need to have a look... - To unsubscribe, e-mail: [EMAI

Generating phrase queries from term queries

2006-01-10 Thread Eric Jain
Is there an efficient way to determine if two or more terms frequently appear next to each other sequence? For a query like: a b c one or more of the following suggestions could be generated: "a b c" "a b" c a "b c" I could of course just run a search with all possible combinations, but perh

Re: Generating phrase queries from term queries

2006-01-11 Thread Eric Jain
Paul Elschot wrote: One way that might be better is to provide your own Scorer that works on the term positions of the three or more terms. This would be better for performance because it only uses one term positions object per query term (a, b, and c here). I'm trying to extract the actual phr

Re: Generating phrase queries from term queries

2006-01-12 Thread Eric Jain
Chris Hostetter wrote: (Assuming *I* understand it) what he's talking baout, is the ability for his search GUI to display suggested phrase searches you may want to try which consist of the words you just typed in grouped into phrases. Yes, that's precisely what I am talking about. Sorry for bei

QueryPrinter?

2006-02-18 Thread Eric Jain
I need to parse a query string, modify it a bit, and then output the modified query string. This works quite well with query.toString(), except that when I parse the query I set DEFAULT_OPERATOR_AND, and the output of BooleanQuery.toString() assumes DEFAULT_OPERATOR_OR... Would be great if this

Boolean Precedence

2006-02-21 Thread Eric Jain
I was wondering: Is there any good reason why x AND y OR z is interpreted as +(+x y z) rather than +(+(+x +y) z) ? If yes, any suggestions how this could be accomplished most easily? Searched the mailing list, found something about a "PrecedenceQueryParser", but this seems to have

Re: Boolean Precedence

2006-02-21 Thread Eric Jain
Daniel Noll wrote: http://tinyurl.com/hzsna Thanks! There is some mention of "open issues" with this parser. Anyone know what these are, and if anyone is still working on this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Boolean Precedence

2006-02-22 Thread Eric Jain
Erik Hatcher wrote: I worked on it to a point, but I don't recall what open issues there were when I left it though they were fiddly. The test case may point you in the right direction:

Frequency of phrase

2006-02-23 Thread Eric Jain
This is somewhat related to a question sent to this list a while ago: Is there an efficient way to count the number of occurrences of a phrase (not term) in an index? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional c

Re: Frequency of phrase

2006-02-24 Thread Eric Jain
Dave Kor wrote: Not sure if this is what you want, but what I have done is to issue exact phrase queries to Lucene and counted the number of hits found. This gives you the number of documents containing the phrase, rather than the number of occurrences of the phrase itself, but that may in fac

Indexing performance with Lucene 1.9

2006-02-25 Thread Eric Jain
After upgrading to Lucene 1.9, an index that used to take about 9h to build now requires 13h. Any one else notice a decrease in performance? This is how I configure the IndexWriter: writer = new IndexWriter(dir, analyzer, false); writer.mergeFactor = 100; writer.minMergeDocs = 100; writ

Re: Frequency of phrase

2006-02-25 Thread Eric Jain
Doug Cutting wrote: If you use a span query then you can get the actual number of phrase instances. Thanks, good to know! In this case (need to suggest phrase queries to the user) I've now settled with dividing the number of hits for a potential phrase by the number of documents that contain

Re: Indexing performance with Lucene 1.9

2006-02-28 Thread Eric Jain
Daniel Naber wrote: A fix has now been committed to trunk in SVN, it should be part of the next 1.9 release. Performance seems to have recovered, more or less, thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additiona

Re: Indexing performance with Lucene 1.9

2006-02-28 Thread Eric Jain
Otis Gospodnetic wrote: Regarding performance fix - if you can be more precise (is it really > just more or less or is it as good as before), that would be great > for those of us itching to use 1.9. Yes, I can confirm that performance differs by no more than 3.1 fraggles. ;-) --

Re: Indexing performance with Lucene 1.9

2006-02-28 Thread Eric Jain
Otis Gospodnetic wrote: Regarding performance fix - if you can be more precise (is it really just more or less or is it as good as before), that would be great for those of us itching to use 1.9. To be more precise: The patch reduced the time required to build one large index from 13 to 11 ho

Re: Indexing performance with Lucene 1.9

2006-03-01 Thread Eric Jain
Eric Jain wrote: I'll rerun the indexing procedure with the old version overnight, just to be sure. Just to confirm: There no longer seems to be any difference in indexing performance between the nightly build and

Re: Solr, the Lucene based Search Server

2006-03-01 Thread Eric Jain
Yonik Seeley wrote: Solr is a new open-source search server that's based on Lucene, and has XML/HTTP interfaces for updating and querying, declarative specification of analyzers and field types via a schema, extensive caching, replication, and a web admin interface. Just had a look, quite impre

MultiPhraseQuery

2006-03-05 Thread Eric Jain
I need to write a function that copies a MultiPhraseQuery and changes the field the query applies to. Unfortunately the API allows access to neither the contained terms nor the field! The other query classes I have so far dealt with all seem to allow access to the contained query terms...

QueryParser dropping constraints?

2006-03-05 Thread Eric Jain
I've noticed that while the QueryParser (both the default QueryParser and the PrecedenceQueryParser) refuse to parse foo bar) baz they both seem to interpret foo bar( baz as foo bar Bug or feature? In any case, would be great if there was a "strict" mode, and a more lenient mode whe

Re: Help interpreting explanation

2006-03-05 Thread Eric Jain
Eugene wrote: Any good links on extending the similarity class? A lot of posts discusses David Spencer's "More Like This" but i can;t find this anywhere. The "More Like This" code can be found here: http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/similarity/ --

Re: MultiPhraseQuery

2006-03-06 Thread Eric Jain
Daniel Naber wrote: Please try to add this to MultiPhraseQuery and let us know if it helps: public List getTerms() { return termArrays; } That is indeed all I need (the list wouldn't have to be mutable though). Any chance this could be committed? Incidentally, would be helpful if th

Re: Get only count

2006-03-07 Thread Eric Jain
Anton Potehin wrote: Now I create new search for get number of results. For example: IndexSearcher is = ... Query q = ... numberOfResults = Is.search(q).length(); Can I accelerate this example ? And how ? Perhaps something like: class CountingHitCollector implements HitCollector { pu

Re: sub search

2006-03-07 Thread Eric Jain
Anton Potehin wrote: After it I want to not make a new search, > I want to make search among found results... Perhaps something like this would work: final BitSet results = toBitSet(Hits); searcher.search(newQuery, new Filter() { public BitSet bits(IndexReader reader) { return results;

Re: speed

2006-03-10 Thread Eric Jain
[EMAIL PROTECTED] wrote: When I make search I get count = 37. May be I do something not correctly? I assume you are ran both variants repeatedly, in the same process (start up costs etc)? - To unsubscribe, e-mail: [EMAI

Re: Appending * to each search term

2006-03-17 Thread Eric Jain
Florian Hanke wrote: I'd like to append an * (create a WildcardQuery) to each search term in a query, such that a query that is entered as e.g. "term1 AND term2" is modified (effectively) to "term1* AND term2*". Parsing the search string is not very elegant (of course). I'm thinking that overri

Re: Lucene Performance Issues

2006-03-28 Thread Eric Jain
thomasg wrote: 1) By default, Lucene only indexes the first 10,000 words from each document. When increasing this default out-of-memory errors can occur. This implies that documents, or large sections thereof, are loaded into memory. ISYS has a very small memory footprint which is not affected by

Preventing phrase queries from matching across lines

2006-04-28 Thread Eric Jain
What is the best way to prevent a phrase query such as "eggs white" matching "fried eggs\nwhite snow"? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. "newline". 2. Have an analyzer somehow increment the position of a term for each line break it e

Re: Preventing phrase queries from matching across lines

2006-04-29 Thread Eric Jain
Erik Hatcher wrote: On Apr 28, 2006, at 5:35 AM, Eric Jain wrote: What is the best way to prevent a phrase query such as "eggs white" matching "fried eggs\nwhite snow"? Two possibilities I have thought about: 1. Replace all line breaks with a special string, e.g. &

Re: IndexUpdateListener

2006-05-15 Thread Eric Jain
Chris Hostetter wrote: THe only usefull callback/listner abstractions i can think of are when you want to know if someone has finished with a set of changes -- wether that change is adding one document, deleting one document, or adding/deleting a whole bunch of documents isn't really relevent, yo

Re: Avoiding ParseExceptions

2006-06-06 Thread Eric Jain
Chris Nokleberg wrote: I am using the QueryParser with a StandardAnalyzer. I would like to avoid or auto-correct anything that would lead to a ParseException. For example, I don't think you can get a parse exception from Google--even if you omit a closing quote it looks like it just closes it for

Precedence in PrecedenceQueryParser

2006-07-25 Thread Eric Jain
The query "foo NOT bar AND baz" seems to be interpreted as "+foo -(+bar +baz)" (using default operator AND). Is this a bug, or a feature? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTEC

Precedence in PrecedenceQueryParser (2)

2006-07-25 Thread Eric Jain
The query "foo bar OR baz" seems to be interpreted as "+foo bar baz", even when using default operator AND! "foo AND bar OR baz" on the other hand is interpreted as "(+foo +bar) baz", as expected. - To unsubscribe, e-mail: [EM

Re: Matching accented with non-accented characters

2006-07-25 Thread Eric Jain
Rajan, Renuka wrote: I am trying to match accented characters with non-accented characters in French/Spanish and > other Western European languages. ISOLatin1AccentFilter should do the job, though it works with single characters only, so "a umlaut" will match "a" but not "ae". --