Re: DateTools oddity....

2006-10-18 Thread Doug Cutting
Michael J. Prichard wrote: I get this output: Tue Aug 01 21:15:45 EDT 2006 That's August 2, 2006 at 01:15:45 GMT. 20060802 Huh?! Should it be: 20060801 DateTools uses GMT. Doug - To unsubscribe, e-mail: [EMAIL

Re: Searching by bit masks

2006-11-10 Thread Doug Cutting
Erick Erickson wrote: Something like Document doc = new Document(); doc.add("flag1", "Y"); doc.add("flag2", "Y"); IndexWriter.add(doc); Fields have overheads. It would be more efficient to implement this as a single field with a different value for each boolean flag (as others have suggested

Re: Oracle and Lucene Integration

2006-11-22 Thread Doug Cutting
Marcelo Ochoa wrote: Then I'll move the code outside the lucene-2.0 code tree to be packed as subdirectory of the contrib area, for example. Other alternative is to make an small zip file and send it to the list as attach as a preliminary (alpha-alpha version ;) This sounds like great potenti

Re: Lucene 2.0.1 release date

2006-12-19 Thread Doug Cutting
Steven Rowe wrote: "2.1" is much more likely to be the label used for the next release than "2.0.1". The roadmap in Jira shows 21 issues scheduled for 2.0.1. If there is in fact no intent to merge these into the 2.0 branch, these should probably be retargetted for 2.1.0, and the 2.0.1 versio

Re: Lucene scoring: coord_q_d factor

2006-12-19 Thread Doug Cutting
Karl Koch wrote: Are there any other papers that regard the combination of coordination level matching and TFxIDF as advantageous? We independently developed coordination-level matching combined with TFxIDF when I worked at Apple. This is documented in: http://www.informatik.uni-trier.de/~

Re: trying to boost a phrase higher than its individual words

2005-10-30 Thread Doug Cutting
Erik Hatcher wrote: On 28 Oct 2005, at 22:31, Andy Lee wrote: You know what, I was confusing Nutch and Lucene classes (as I've done before), in this case the IndexSearcher classes. Sorry. The Nutch names are bad. I'm continually amazed at Doug's ability to build these using only emacs - h

Re: Sentence boundary storage

2005-10-30 Thread Doug Cutting
Chris Hostetter wrote: : One thing that I know has bogged me is when matching a phrase where I : would expect mathematical formula (which is "just a subphrase"). I : would have liked the phrase-query to extend as far as it wishes but not : passed a given token... would this be possible ? : Presum

Re: Memory Usage

2005-11-14 Thread Doug Cutting
Marvin Humphrey wrote: You *can't* set it on the reader end. If you could set it, the reader would get out of sync and break. The value is set per-segment at write time, and the reader has to be able to adapt on the fly. It would actually not be too hard to change things so that there was

Re: Memory Usage

2005-11-16 Thread Doug Cutting
Daniel Noll wrote: Timings were obtained by performing the same search 1,000 times and averaging the total time. This was then performed five times in a row to get the range that's displayed below. Memory usage was obtained using a 20-second sleep after loading the index, and then using the Win

Re: Filtering on a SpanQuery without losing spans

2005-11-16 Thread Doug Cutting
Greg K wrote: Now, however, I'd like to be able restrict the search to certain documents in the index, so I don't have to stream through a couple of thousand spans to produce the 10 excerpts on a subset of the documents. I've tried added a term to the SpanNearQueries that targets a keyword field

Re: Memory Usage

2005-11-17 Thread Doug Cutting
Daniel Noll wrote: I actually did throw a lot of terms in, and eventually chose "one" for the tests because it was the slowest query to complete of them all (hence I figured it was already spending some fairly long time in I/O, and would be penalised the most.) Every other query was around 7ms

Re: Memory Usage

2005-11-17 Thread Doug Cutting
Daniel Noll wrote: Doug Cutting wrote: Daniel Noll wrote: I actually did throw a lot of terms in, and eventually chose "one" for the tests because it was the slowest query to complete of them all (hence I figured it was already spending some fairly long time in I/O, and would be

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Doug Cutting
Jay Booth wrote: I had a similar problem with threading, the problem turned out to be that in the back end of the FSDirectory class I believe it was, there was a synchronized block on the actual RandomAccessFile resource when reading a block of data from it... high-concurrency situations caused t

Re: IndexReader locking

2005-11-28 Thread Doug Cutting
IndexReader locks the index while opening it to prohibit an IndexWriter from deleting any of the files in that index until all are opened. Lock files are not stored in the index directory since write access to an index should not be required to lock it while opening an IndexReader. Doug Dani

Re: Lucene performance bottlenecks

2005-12-02 Thread Doug Cutting
Andrzej Bialecki wrote: For a simple TermQuery, if the DF(term) is above 10%, the response time from IndexSearcher.search() is around 400ms (repeatable, after warm-up). For such complex phrase queries the response time is around 1 sec or more (again, after warm-up). Are you specifying -server

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Paul Elschot wrote: Querying the host field like this in a web page index can be dangerous business. For example when term1 is "wikipedia" and term2 is "org", the query will match at least all pages from wikipedia.org. Note that if you search for wikipedia.org in Nutch this is interpreted as a

Re: Lucene performance bottlenecks

2005-12-07 Thread Doug Cutting
Andrzej Bialecki wrote: It's nice to have these couple percent... however, it doesn't solve the main problem; I need 50 or more percent increase... :-) and I suspect this can be achieved only by some radical changes in the way Nutch uses Lucene. It seems the default query structure is too compl

Re: Merging with IndexWriter.addIndexes(...)

2005-12-08 Thread Doug Cutting
J.J. Larrea wrote: So... I notice that both IndexWriter.addIndexes(...) merge methods start and end with calls to optimize() on the target index. I'm not sure whether that is causing the unpacking and repacking I observe, but it does wonder whether they truly need to be there: I don't recall

Re: IndexReader.open crashes JVM

2005-12-15 Thread Doug Cutting
chandler burgess wrote: Im using lucene1.4.3 on a XP machine with jdk1.5. Any help is appreciated. Try typing control-break to get some stack dumps. I also recommend building the current Lucene code from subversion and trying that. There have been lots of improvements since 1.4.3. It woul

Re: AW: Boolean Query

2006-01-12 Thread Doug Cutting
Klaus wrote: I have tried to study to lucene scoring in the default similarity. Can anyone explain me, how this similarity was designed? I have read a lot of IR literature, but I have never seen an equation like the one used in lucene. Why is this better then the normal cosine-measure? It degen

Re: BTree

2006-01-12 Thread Doug Cutting
B-Tree's are best for random, incremental updates. They require log_b(N) disk accesses for inserts, deletes and accesses, where b is the number of entries per page, and N is the total number of entries in the tree. But that's too slow for text indexing. Rather Lucene uses a combination of fi

Re: Lucene Logo? (high resolution)

2006-01-19 Thread Doug Cutting
Daniel Rabus wrote: I've created an Semantic Desktop application using Lucene. For a presentation I'd like to create a poster. Unfortunately I haven't found any high resolution version (or vector graphic) of the Lucene logo. At http://svn.apache.org/repos/asf/lucene/java/trunk/docs/images/ only

Re: Throughput doesn't increase when using more concurrent threads

2006-01-25 Thread Doug Cutting
Peter Keegan wrote: This is just fyi - in my stress tests on a 8-cpu box (that's 8 real cpus), the maximum throughput occurred with just 4 query threads. The query throughput decreased with fewer than 4 or greater than 4 query threads. The entire index was most likely in the file system cache, t

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting
Peter Keegan wrote: The throughput is worse with NioFSDIrectory than with the FSDIrectory (patched and unpatched). The bottleneck still seems to be synchronization, this time in NioFile.getChannel (7 of the 8 threads were blocked there during one snapshot). I tried this with 4 and 8 channels.

Re: Throughput doesn't increase when using more concurrent threads

2006-01-26 Thread Doug Cutting
Doug Cutting wrote: A 64-bit JVM with NioDirectory would really be optimal for this. Oops. I meant MMapDirectory, not NioDirectory. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL

Re: Performance tips?

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: We are sporting Solaris 10 on a Sun Fire-machine with four cores and 12GB of RAM and mirrored Ultra 320-disks. I guess I could try switching to FSDirectory and hope for the best. Or, since you're on a 64-bit platform, try MMapDirectory, which supports greater parallelism

Re: [SPAM] - Re: Performance tips? - Sending mail server found on bl.spamcop.net

2006-01-27 Thread Doug Cutting
Daniel Pfeifer wrote: Are we both talking about Lucene? I am using Lucene 1.4.3 and can't find a class called MapDirectory or MMapDirectory. It is post-1.4. You can download a nightly build of the current trunk at: http://cvs.apache.org/dist/lucene/java/nightly/ Doug ---

Re: CompoundFileReader question/'leaking' file descriptors ?

2006-02-13 Thread Doug Cutting
Paul Smith wrote: We're using Lucene 1.4.3, and after hunting around in the source code just to see what I might be missing, I came across this, and I'd just like some comments. Please try using a 1.9 build to see if this is something that's perhaps already been fixed. CompoundFileReader

Re: Boosting

2006-02-13 Thread Doug Cutting
Sebastian Menge wrote: Or, to put it more simple, what does a boost of "2" or "10" _mean_ in contrast to a boost of "0.5" or "0.1" !? Boosts are simply multiplied into scores. So they only mean something in the context of the rest of the scoring mechanism. http://lucene.apache.org/java/docs

Re: CompoundFileReader question/'leaking' file descriptors ?

2006-02-13 Thread Doug Cutting
Paul Smith wrote: is 1.9 binary backward compatible? (both source code and index format). That is the intent. Try a nightly build: http://cvs.apache.org/dist/lucene/java/nightly/ Doug - To unsubscribe, e-mail: [EMAIL PROTEC

Re: BM25 Similarity implementation

2006-02-16 Thread Doug Cutting
Trieschnigg, R.B. (Dolf) wrote: I would like to implement the Okapi BM25 weighting function using my own Similarity implementation. Unfortunately BM25 requires the document length in the score calculation, which is not provided by the Scorer. How do you want to measure document length? If th

Lucene 1.9 RC1 release available

2006-02-22 Thread Doug Cutting
Release 1.9 RC1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release candidate has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see: http://svn.apache.org/viewcvs.cgi/*checkout*/

Re: Indexing speed

2006-02-24 Thread Doug Cutting
revati joshi wrote: hi all, I just wnted to know how to increase the speed of indexing of files . I tried it by using Multithreading approach but couldn't get much better performance. It was same as it is in usual sequential indexing.Is there any other approach to get better Inde

Re: Frequency of phrase

2006-02-24 Thread Doug Cutting
Eric Jain wrote: This gives you the number of documents containing the phrase, rather than the number of occurrences of the phrase itself, but that may in fact be good enough... If you use a span query then you can get the actual number of phrase instances. Doug ---

Re: Hacking proximity search: looking for feedback

2006-03-01 Thread Doug Cutting
Jeff Rodenburg wrote: Following on the Range Query approach, how is performance? I found the range approach (albeit with the exact values) to be slower than the parsed-string approach I posited. Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps ConstantScoreRangeQuery, which i

Lucene 1.9-final release available

2006-03-01 Thread Doug Cutting
Release 1.9-final of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This release has many improvements since release 1.4.3, including new features, performance improvements, bug fixes, etc. For details, see: http://svn.apache.org/viewcvs.cgi/*checkout*/lucene/j

Lucene 1.9.1 release available

2006-03-03 Thread Doug Cutting
Release 1.9.1 of Lucene is now available from: http://www.apache.org/dyn/closer.cgi/lucene/java/ This fixes a serious bug in 1.9-final. It is strongly recommended that all 1.9-final users upgrade to 1.9.1. For details see: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_1_9_1/CHANGES.

Re: Lucene version 1.9

2006-03-07 Thread Doug Cutting
WATHELET Thomas wrote: I've created an index with the Lucene version 1.9 and when I try to open this index I have always this error mesage: java.lang.ArrayIndexOutOfBoundsException. if I use an index built with the lucene version 1.4.3 it's working. Wath's wrong? Are you perhaps trying to open

Re: Throughput doesn't increase when using more concurrent threads

2006-03-07 Thread Doug Cutting
Peter Keegan wrote: I ran a query performance tester against 8-cpu and 16-cpu Xeon servers (16/32 cpu hyperthreaded). on Linux. Here are the results: 8-cpu: 275 qps 16-cpu: 305 qps (the dual-core Opteron servers are still faster) Here is the stack trace of 8 of the 16 query threads during the

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-13 Thread Doug Cutting
RAMDirectory is indeed currently limited to 2GB. This would not be too hard to fix. Please file a bug report. Better yet, attach a patch. I assume you're running a 64bit JVM. If so, then MMapDirectory might also work well for you. Doug z shalev wrote: this is in continuation of a pr

Re: PhraseQuery and edit distance slightly confusing.

2006-03-15 Thread Doug Cutting
Dawid Weiss wrote: I get the concept implemented in PhraseQuery but isn't calling it an edit distance a little bit far fetched? Yes, it should probably be called "edit-distance-like" or something. Only the marginal elements (minimum and maximum distance from their respective query positions)

Re: Can Lucene load more then 2GB into RAM memory?

2006-03-16 Thread Doug Cutting
and it seems like performance is basically the same if not better!!! if anyone is interested let me know Doug Cutting <[EMAIL PROTECTED]> wrote: RAMDirectory is indeed currently limited to 2GB. This would not be too hard to fix. Please file a bug report. Better yet, attach a patch.

Re: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Doug Cutting
Erick Erickson wrote: Could you point me to any explanation of *why* range queries expand this way? It's just what they do. They were contributed a long time ago, before things like RangeFilter or ConstantScoreRangeQuery were written. The latter are relatively recent additions to Lucene and

Re: Lucene and Tomcat, too many open files

2006-03-16 Thread Doug Cutting
Are you changing the default mergeFactor or other settings? If so, how? Large mergeFactors are generally a bad idea: they don't make things faster in the long run and they chew up file handles. Are all searches reusing a single IndexReader? They should. This is the other most common reason

Re: Throughput doesn't increase when using more concurrent threads

2006-03-17 Thread Doug Cutting
Peter Keegan wrote: I did some additional testing with Chris's patch and mine (based on Doug's note) vs. no patch and found that all 3 produced the same throughput - about 330 qps - over a longer period. Was CPU utilizaton 100%? If not, where do you think the bottleneck now is? Network? Or

Re: Lucene job

2006-03-17 Thread Doug Cutting
Michael Wechner wrote: Maybe it would make sense to sort it alphabetically [ ... ] +1 This should be sorted alphabetically be business name or last name. That's what it says on the page, although a few entries are out of place. Please feel free to fix this. Doug -

Re: Lookup Issues

2006-03-22 Thread Doug Cutting
The Hits-based search API is optimized for returning earlier hits. If you want the lowest-scoring matches, then you could reverse-sort the hits, so that these are returned first. Or you could use the TopDocs-based API to retrieve hits up to your "toHits". (Hits-based search is implemented us

Re: Multiple threads in Lucene

2006-03-23 Thread Doug Cutting
Olivier Jaquemet wrote: IndexReader.unlock(indexDir); // unlock directory in case of unproper shutdown This should be used very carefully. In particular, you should only call it when you are certain that no other applications are accessing the index. Doug ---

Re: lucene NFS support

2006-03-23 Thread Doug Cutting
Dai, Chunhe wrote: Does anyone know whether Lucene plans to support NFS in later release(2.0)? We are planning to integrate Lucene into our products and cluster support is definitely needed. We want to check whether NFS support is in the plan or not before implementing a new file locking ourselve

Re: span query scoring vs boolean query scoring

2006-03-27 Thread Doug Cutting
Vincent Le Maout wrote: I am missing something ? Is it intented or is it a bug ? Looks like a bug. Can you submit a patch? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: span query scoring vs boolean query scoring

2006-03-27 Thread Doug Cutting
Vincent Le Maout wrote: I am missing something ? Is it intented or is it a bug ? Looks like a bug. Can you please submit a bug report, and, ideally, attach a patch? Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED]

Re: Lucene indexing on Hadoop distributed file system

2006-03-27 Thread Doug Cutting
Igor Bolotin wrote: If somebody is interested - I can post our changes in TermInfosWriter and SegmentTermEnum code, although they are pretty trivial. Please submit this as a patch attached to a bug report. I contemplated making this change to Lucene myself, when writing Nutch's FsDirectory, b

Re: Lucene indexing on Hadoop distributed file system

2006-03-27 Thread Doug Cutting
Igor Bolotin wrote: Does it make sense to change TermInfosWriter.FORMAT in the patch? Yes. This should be updated for any change to the format of the file, and this certainly constitutes a format change. This discussion should move to [EMAIL PROTECTED] Doug --

Re: Lucene Performance Issues

2006-03-28 Thread Doug Cutting
thomasg wrote: Hi, we are currently intending to implement a document storage / search tool using Jackrabbit and Lucene. We have been approached by a commercial search and indexing organisation called ISYS who are suggesting the following problems with using Lucene. We do have a requirement to st

Re: Data structure of a Lucene Index

2006-03-30 Thread Doug Cutting
I talked about this a bit in a presentation at Haifa last year: http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf See the section on "Seek versus Transfer". Doug Prasenjit Mukherjee wrote: It seems to me that lucene doesn't use B-tree for its indexing storage. Any paper

Re: Lucene Document order not being maintained?

2006-04-05 Thread Doug Cutting
Dan Armbrust wrote: My indexing process works as follows (and some of this is hold-over from the time before lucene had a compound file format - so bear with me) I open up a File based index - using a merge factor of 90, and in my current test, the compound index format. When I have added 100

Re: Distributed Lucene.. - clustering as a requirement

2006-04-10 Thread Doug Cutting
Dmitry Goldenberg wrote: For an enterprise-level application, Lucene appears too file-system and too byte-sequence-centric a technology. Just my opinion. The Directory API is just too low-level. There are good reasons why Lucene is not built on top of a RDBMS. An inverted index is not effi

Re: MultiReader and MultiSearcher

2006-04-11 Thread Doug Cutting
Peter Keegan wrote: Oops. I meant to say: Does this mean that an IndexSearcher constructed from a MultiReader doesn't merge the search results and sort the results as if there was only one index? It doesn't have to, since a MultiReader *is* a single index. A quick test indicates that it does

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Doug Cutting
karl wettin wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable.

Re: Using Lucene for searching tokens, not storing them.

2006-04-14 Thread Doug Cutting
karl wettin wrote: Do I have to worry about passing a null Directory to the default constructor? A null Directory should not cause you problems. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mai

Re: RAM Directory / querying Performance issue

2006-04-26 Thread Doug Cutting
Is this markedly faster than using an MMapDirectory? Copying all this data into the Java heap (as RAMDirectory does) puts a tremendous burden on the garbage collector. MMapDirectory should be nearly as fast, but keeps the index out of the Java heap. Doug z shalev wrote: I've rewritten

Re: Lucene search benchmark/stress test tool

2006-04-27 Thread Doug Cutting
Sunil Kumar PK wrote: I want to know is there any possibility or method to merge the weight calculation of index 1 and its search in a single RPC instead of doing the both function in separate steps. To score correctly, weights from all indexes must be created before any can be searched. This

Re: Ask for a better solution for the case

2006-04-28 Thread Doug Cutting
hu andy wrote: Hi, I hava an application that need mark the retrieved documents which have been read. So the next time I needn't read the marked documents again. You could mark the documents as deleted, then later clear deletions. So long as you don't close the IndexReader, the deletions wil

Re: How are results merged from a multisearcher?

2006-05-22 Thread Doug Cutting
Tom Emerson wrote: Thanks for the clarification. What then is the difference between a MultiSearcher and using an IndexSearcher on a MultiReader? The results should be identical. A MultiSearcher permits use of ParallelMultiSearcher and RemoteSearchable, for parallel and/or distributed operat

Re: Changing the scoring (newest doc date first)

2006-05-22 Thread Doug Cutting
Marcus Falck wrote: There is however one LARGE problem that we have run into. All search result should be displayed sorted with the newest document at top. We tried to accomplish this using Lucene's sort capabilites but quickly ran into large performance bottlenecks. So i figured since the default

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Doug Cutting
Rob Staveley (Tom) wrote: Is there a tool I can use to see how much of the index is occupied by the different fields I am indexing? Note that IndexReader has a main() that will list the contents of compound index files. Doug --

Re: warm up lucene, especially sort by cache

2005-03-02 Thread Doug Cutting
Morus Walter wrote: So if you use sort, doing one sort after creating the index might be useful. Yes, this is a good way to pre-load lots of things. For reading relevant parts of the index into OS caches, I'd rather use the most commonly searched terms, than the most frequent ones. If the index was

Re: help with boolean expression

2005-03-03 Thread Doug Cutting
Daniel Naber wrote: On Wednesday 02 March 2005 12:25, Erik Hatcher wrote: I agree that the current behavior is awkward. Is it worth breaking backwards compatibility to correct this with the patch applied? I'd vote for fixing this as long as the current QueryParser is still available in Lucene cor

Re: QueryParser refactoring

2005-03-08 Thread Doug Cutting
sergiu gordea wrote: So .. here is an example of how I parse a simple query string provided by a user ... the user checks a few flags and writes "test ko AND NOT bo" and the resulting query.toString() is saved in the database: +(+(subject:test description:test keywordsTerms:test koProperties:test

Re: fresh indexing bug?

2005-03-08 Thread Doug Cutting
eks dev wrote: When I reindex with the lucene from the latest svn snapshot, a lot of .tii files that are deletable appear (checked with luke). This is a bug I introduced yesterday. Thanks for catching it! The term index (.tii) was not closed, and on Windows this makes it undeleteable. I just com

Re: Find version of Lucene library

2005-03-09 Thread Doug Cutting
Andrzej Bialecki wrote: Hmmm... would not java.lang.Package various methods do the job? I'm not sure... I just tried to do Package.getPackage("org.apache.lucene") and got null, even though the manifest is present in the JAR. I looked into this. The package name in the manifest is "org/apache/l

Re: large indexes

2005-03-09 Thread Doug Cutting
Scott Smith wrote: I have the need to create an index which will potentially have a million+ documents. I know Lucene can accomplish this. However, the other requirement is that I need to be continually updating it during the date (adding 1-30 documents/minute). Have a look at this thread: http:/

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-09 Thread Doug Cutting
Yonik Seeley wrote: This strategy looks very promising. One drawback is that documents must be added directly to the main index for this to be efficient. This is a bit of a problem if there is a document uniqueness requirement (a unique id field). This is easy to do with a single index. Here's th

Re: Best Practices for Distributing Lucene Indexing and Searching

2005-03-09 Thread Doug Cutting
Yonik Seeley wrote: I'm trying to support an interface where documents can be added one at a time at a high rate (via HTTP POST). You don't know all of the documents ahead of time, so you can't delete them all ahead of time. A simple solution is to queue documents as they're posted. When either

Re: NumberTools

2005-03-22 Thread Doug Cutting
John Patterson wrote: It would be great if this could be incorporated into Lucene as it will make numeric searches much more efficient. I'd like to see benchmarks that demonstrate the improvement before we consider including such a patch. You're making a lot of assumptions about where time is sp

Re: NumberTools

2005-03-22 Thread Doug Cutting
Chuck Williams wrote: If there is going to be any generalization to built-in sorting representations, I'd like to suggest two things be included: 1. Fix issue 34028 (delete the one word "final") Done. 2. Include a provision for query-time parameters Can you provide a proposal? Doug --

Re: Problem with memory utilisation during Lucene search

2005-03-23 Thread Doug Cutting
Daniel Naber wrote: If that doesn't help: are you sure you're using Lucene the right way, e.g. having only one IndexReader/Searcher and using it for all searches? That's my first suggestion too. Memory consumption should not primarily grow per query, rather per IndexSearcher. You're seeing 80M

Re: Seeking advice on index parameter settings for large index

2005-03-30 Thread Doug Cutting
Chuck Williams wrote: index.setMaxBufferedDocs(10); // Buffer 10 documents at a time in memory (they could be big) You might use a larger value here for the index with the small documents. I've sucessfully used values as high as a 1000 when indexing documents that average a few kilobyte

Re: pre computing possible search results narrowing and hit counts on those

2005-03-30 Thread Doug Cutting
Antony Sequeira wrote: A user does a search for say "condominium", and i show him the 50,000 properties that meet that description. I need two other pieces of information for display - 1. I want to show a "select" box on the UI, which contains all the cities that appear in those 50,000 documents 2.

Re: searcher question

2005-03-30 Thread Doug Cutting
Omar Didi wrote: I am having a large index (100GB) and when i run the following code : String indexLocation = servlet.getServletContext().getInitParameter( "com.lucene.index" ); logger.log( Level.INFO, "got the index location from: " + indexLocation ); searcher = new IndexSearcher(indexLocation);

Re: scalability w/ number of fields

2005-04-04 Thread Doug Cutting
Yonik Seeley wrote: I know Lucene is very scalable in many ways, but how about number of fieldnames? We have an index using around 6000 unique fieldnames, How many of these fields are indexed? At this point I would recommend against having more than a handful of indexed fields. If the fields are

Re: scalability w/ number of fields

2005-04-06 Thread Doug Cutting
Yonik Seeley wrote: They are all indexed (and they all need to be under the current design). As I mentioned before, Lucene will not perform well with a large number of indexed fields. If these are not tokenized fields, then a simple way to reduce the number of indexed fields is to move the field

Re: Lucene Search Result with Line Numbers?

2005-04-11 Thread Doug Cutting
cerberus yao wrote: Does anyone knows how to add the Lucene search results with Line number in original source content? When you display each hit, first scan the text and build an array containing the positions of each newline. Then use the highlighter (in contrib/highlighter) to find fragment

Re: Corrupted index

2005-04-11 Thread Doug Cutting
Daniel Naber wrote: Yes, the *.cfs shows that this is a compound index which has *.fnm files only when it's being modified. When creating a compound segment, a "segments" file is never written that refers to the segment until the .cfs file is created and the .fnm files are removed. The real pro

Re: Corrupted index

2005-04-11 Thread Doug Cutting
Bill Tschumy wrote: So, did this happen because he copied the data while in an inconsistent state? I'm a bit surprised that an inconsistent index is ever left on disk (except for temporarily while something is being written). Would this happen if there was a Writer that was not closed? An inde

Re: Hungarian notation analyzer and phrase queries

2005-04-13 Thread Doug Cutting
Paul Smith wrote: I have written a custom analyzer to tokenize PowerQuery into 'power', 'query, and 'powerquery' and change the position increment to 0, but I don't quite get the desired behavior. The phrase query "use power query for advanced searches" does not match, but "use query for advanced

Re: Hungarian notation analyzer and phrase queries

2005-04-14 Thread Doug Cutting
Paul Smith wrote: So it sounds like there isn't a perfect solution, but I think the best tradeoff for me is to put them all in the same position unless anyone has more input on the subject? If they're all at the same position you can still use slop to match the phrase. So if 'power', 'query'

Re: Update performance/indexwriter.delete()?

2005-04-14 Thread Doug Cutting
Roy Klein wrote: So one thing I've been wondering: Why do you need to do deletes from an indexreader? Is this not in the FAQ? It should be... IndexWriter can only append documents to an index. An IndexReader is required to, given a term, find the document number to mark deleted. Also, in the cu

Re: Reverting QueryParser ?

2005-04-14 Thread Doug Cutting
Paul Libbrecht wrote: I am currently evaluating the need for an elaborate query data-structure (to be exchanged over XML-RPC) as opposed to working with plain strings. I'd opt for both. For example: "java based" -coffee site apache.org d

Re: Update performance/indexwriter.delete()?

2005-04-14 Thread Doug Cutting
Yonik Seeley wrote: There are times, however, when it would be nice for deletes to be able to be concurrent with adds. It would also be nice if good coffee was free. Q: can docids change after an add() (with merging segments going on behind the scenes) or is optimize() the only call that ends up ch

Re: Update performance/indexwriter.delete()?

2005-04-14 Thread Doug Cutting
Roy Klein wrote: I think this is a better way of asking my original questions: "Why was this designed this way?" In order to optimize updates. "Can it be changed to optimize updates?" Updates are fastest when additions and deletions are separately batched. That is the design. Doug -

Re: Fields with same name boosting

2005-04-15 Thread Doug Cutting
Peter Veentjer - Anchor Men wrote: I have question about field boosting. If I have 2 (or more) fields with the same fieldname in a single document, and I boost one of those, than only that one will be boosted? Or will all fields with the same name be boosted? I guess only one field is boosted, bu

Re: CVS Lucene 2.0

2005-04-25 Thread Doug Cutting
George Aroush wrote: I would like to see a source release of 1.9, a packaged source release as ZIP/TAR. Is that possible? There is no 1.9 release. It is a *planned* release at this point. When a release is actually made, then you will be able to download it. Doug --

Re: CVS Lucene 2.0

2005-04-26 Thread Doug Cutting
Yonik Seeley wrote: I don't think at this point anything structural has been proposed as different between 1.9 and 2.0. Are any of Paul Elschot's query and scorer changes being considered for 2.0? 1.9 and 2.0 will be what's in the SVN trunk. Many of Paul's changes have already been committed. Ar

Re: Indexing of virtual "made up" documents

2005-04-27 Thread Doug Cutting
Morus Walter wrote: Alternatively it should be able to write a query that does such a scoring directly (without the document start anchor) by the same means proximity query uses. Proximity query uses positional information so it should be possible to use that information for scoring based on docum

Re: Results ranking on filtered multi-field query

2005-05-02 Thread Doug Cutting
Chuck Williams wrote: I found this to be a problem as well and created alternative classes, DistributedMultiFieldQueryParser and MaxDisjunctionQuery, which are available here: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 You might check these out and see if they provide the ranking y

Re: PerFieldSimilarity

2005-05-04 Thread Doug Cutting
Robichaud, Jean-Philippe wrote: Again, I can change the similarity of the reader at run-time and issue specific queries, summing the score myself, but that is pretty inefficient. You can also specify a Similarity implementation per Query node in a complex query, e.g.: BooleanQuery query = new Boo

Re: PerFieldSimilarity

2005-05-04 Thread Doug Cutting
Robichaud, Jean-Philippe wrote: How cool, I did not knew that... that may help me... If I understand you correctly, I can create a boolean query where each "clause" use a different similarity ? Yes. That would look something like: BooleanQuery booleanQuery = new BooleanQuery(); TermQuery clause1

Re: Deletes and Hits

2005-05-04 Thread Doug Cutting
Scott Smith wrote: Any other solutions or comments? Use a different IndexReader for searching than you use for deletions? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Distribution Strategies?

2005-05-10 Thread Doug Cutting
Steven J. Owens wrote: A friend just asked me for advice about synchronizing lucene indexes across a very large number of servers. I haven't really delved that deeply into this sort of stuff, but I've seen a variety of comments here about similar topics. Are there are any well-known approach

  1   2   >