Re: order of Field objects within Document

2004-03-18 Thread Doug Cutting
Sam Hough wrote: Can anybody confirm that no guarantee is given that Fields retain their order within a Document? Version 1.3 seems to (although reversing the order on occasion). In 1.3 they're reversed as added, then reversed as read, so that hits have fields in their added order. In 1.4 I've

Re: Demoting results

2004-03-18 Thread Doug Cutting
Have you tried assigning these very small boosts (0 boost 1) and assigning other query clauses relatively large boosts (boost 1)? Boris Goldowsky wrote: Is there any way to build a query where the occurrence of a particular Term (in a Keyword field) causes the rank of the document to be

Re: Demoting results

2004-03-19 Thread Doug Cutting
Boris Goldowsky wrote: On Thu, 2004-03-18 at 13:32, Doug Cutting wrote: Have you tried assigning these very small boosts (0 boost 1) and assigning other query clauses relatively large boosts (boost 1)? I was trying to formulate a query like, say +(title: asparagus) (doctype:bad)^-3 which

Re: Demoting results

2004-03-19 Thread Doug Cutting
Doug Cutting wrote: On Thu, 2004-03-18 at 13:32, Doug Cutting wrote: Have you tried assigning these very small boosts (0 boost 1) and assigning other query clauses relatively large boosts (boost 1)? I don't think you understood my proposal. You should try boosting the documents when you add

Re: Cover density ranking?

2004-03-23 Thread Doug Cutting
Boris Goldowsky wrote: How difficult would it be to implement something like Cover Density ranking for Lucene? Has anyone tried it? Cover density is described at http://citeseer.ist.psu.edu/558750.html , and is supposed to be particularly good for short queries of the type that you get in many

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: Just to clarify things: Does the current solution require all fields that can be used for sorting to be loaded and kept in memory? (I guess you can answer this question faster than I can figure it out by myself :-) Field values are loaded into memory. But values are kept in an

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: That's reasonable. What I didn't quite understand yet: If I sort on a string field, will Lucene need to keep all values in memory all the time, or only during startup? It will cache one instance of each unique value. So if you have a million documents and string sort results on

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: I will need to have a look at the code, but I assume that in principal it should be possible to replace the strings with sequential integers once the sorting is done? I don't understand the question. Doug - To

Re: Lucene 1.4 - lobby for final release

2004-03-26 Thread Doug Cutting
Chad Small wrote: thanks Erik. Ok this is my official lobby effort for the release of 1.4 to final status. Anyone else need/want a 1.4 release? Does anyone have any information on 1.4 release plans? I'd like to make an RC once I manage to fix bug #27799, which will hopefully be soon. Doug

Re: Demoting results

2004-03-29 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I have not been able to work out how to get custom coordination going to demote results based on a specific term [ ... ] Yeah, it's a little more complicated than perhaps it should be. I've attached a class which does this. I think it's faster and more effective than

Re: Lucene 1.4 - lobby for final release

2004-03-29 Thread Doug Cutting
Charlie Smith wrote: I'll vote yes please release new version with too many files open fixed. There is no too many files open bug, except perhaps in your application. It is however an easy to encounter problem if you don't close indexes or if you change Lucene's default parameters. It will be

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote: We're using lucene with one large target index which right now is 5G. Every night we take sub-indexes which are about 500M and merging them into this main index. This merge (done via IndexWriter.addIndexes(Directory[]) is taking way too much time. Looking at the stats

Re: Overriding coordination

2004-03-29 Thread Doug Cutting
Boris Goldowsky wrote: I have a situation where I'm querying for something in several fields, with a clause similar to this: (title:(two words)^20 keywords:(two words)^10 body:(two words)) Some good documents are being scored too low if the query terms do not occur in the body field. I

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help

Re: Javadocs lucene 1.4

2004-03-29 Thread Doug Cutting
Lucene 1.4 has not been released. Until it is released, you need to check out the sources from CVS and build them, including javadoc. Doug Stephane James Vaucher wrote: Are the javadocs available on the site? I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery) somewhere on

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-30 Thread Doug Cutting
Esmond Pitt wrote: Don't want to start a buffer size war, but these have always seemed too small to me. I'd recommend upping both InputStream and OutputStream buffer sizes to at least 4k, as this is the cluster size on most disks these days, and also a common VM page size. Okay. Reading and

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: As a note of warning: I did find StandardTokenizer to be the major culprit in my tokenizing benchmarks (avg 75ms for 16k sized docs). I have found I can live without StandardTokenizer in my apps. FYI, the message with Mark's timings can be found at:

Re: Wierd Search Behavior

2004-04-01 Thread Doug Cutting
Terry, Can you please try to develop a reproducible test case? Otherwise it's impossible to verify and debug this. For something like this it would suffice to provide: 1. The initial index, which satisifies the test queries; 2. The new index you add; 3. Your merge and test code, as a

Re: Iterernal Document Numbers

2004-04-01 Thread Doug Cutting
Joe Rayguy wrote: So, assuming that sort as implemented in 1.4 doesn't work for me, my original question still stands. Do I have to worry about merges that occur as documents are added, or do I only have to rebuild my array after optimizations? Or, alternatively, how did everyone sort before

Re: starts with query functionality

2004-04-06 Thread Doug Cutting
Chad Small wrote: We have a requirement to return documents with a title field that starts with a certain letter. Is there a way to do something like this? We're using the StandardAnalyzer Example title fields: This is the title of a document. And this is a title of a different document.

Re: Index partitioning

2004-04-06 Thread Doug Cutting
Magnus Mellin wrote: i would like to partition an index over X number of remote searchers. Any ideas, or suggestions, on how to use the same term dictionary (one that represents the terms and frequencies for the whole document collection) over all my indices? Try using a ParallelMultiSearcher

Re: verifying index integrity

2004-04-06 Thread Doug Cutting
Weir, Michael wrote: I assume that it is possible to corrupt an index by crashing at just the right time. It should not be possible to corrupt an index this way. I notice that there's a method IndexReader.unlock(). Does this method ensure that the index has not been corrupted? If you use this

Re: verifying index integrity

2004-04-08 Thread Doug Cutting
Weir, Michael wrote: So if our server is the only process that ever opens the index, I should be able to run through the indexes at startup and simply unlock them? Yes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For

Re: Locking problems with NFS

2004-04-20 Thread Doug Cutting
Francesco Bellomi wrote: we are experiencing some difficulties in using Lucene with a NFS filesystem. Basically, locking seems not to work properly, since it appears that attempted concurring writing on the index (from different VMs) are not blocked, and this often causes the index to be

Re: Locking problems with NFS

2004-04-20 Thread Doug Cutting
Francesco Bellomi wrote: The only problem is that, as lucene 1.4rc2, FSDirectory is 'final'. Please submit a patch to lucene-dev to make FSDirectory non-final. In fact, a third architectural approach would be to define an API for pluggable lock implementations: IMHO that would be more robust to

new Lucene release: 1.4 RC3

2004-05-11 Thread Doug Cutting
Version 1.4 RC3 of Lucene is available for download from: http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc3/ Changes are described at: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.85 Doug -

Re: BooleanQuery.add()

2004-05-13 Thread Doug Cutting
Leonid Portnoy wrote: Am I misunderstanding something here, or is the documentation unclear? The documentation is unclear. Can you propose an improvement? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

Re: SQLDirectory implementation

2004-05-11 Thread Doug Cutting
. ( see test code ) 2.) The first search is always really slow as everything initializes and the cache fills ;) so don't let that discourage you. -vito On Mon, 2004-04-26 at 14:59, Doug Cutting wrote: Anthony Vito wrote: I noticed some talk on SQLDirectory a month or so ago. . Did you ever

Re: exact the same score from different documents

2004-05-14 Thread Doug Cutting
hui wrote: I am getting the exactly same score like 0. 04809519 for different size documents for some queries and this happens quite frequently. Based on the score formula, it seems this should rarely happen. Or I misunderstand the formula? Normalization factors ( document boosts) are represented

Re: Not deleting temp files after updating/optimising.

2004-04-26 Thread Doug Cutting
Win32 seems to sometimes not permit one to delete a file immediately after it has been closed. Because of this, Lucene keeps a list of files that need to be deleted in the 'deleteable' file. Are your files listed in this file? If so, Lucene will again try to delete these files the next time

Re: SQLDirectory implementation

2004-04-26 Thread Doug Cutting
Anthony Vito wrote: I noticed some talk on SQLDirectory a month or so ago. ( I just joined the list :) ) I have a JDBC implementation that stores the files in a couple of tables and stores the data for the files as blocks (BLOBs) of a certain size ( 16k by default ). It also has an LRU cache for

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? A few folks have implemented an SQL-based Lucene Directory, but

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Incze Lajos wrote: Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index available over the network for multiple lucene engines -- if I'm right.)

Re: Understanding Boolean Queries

2004-04-29 Thread Doug Cutting
Please don't crosspost to lucene-user and lucene-dev! Tate Avery wrote: 3) The maxClauseCount threshold appears not to care whether or not my clauses are 'required' or 'prohibited'... only how many of them there are in total. That's correct. It is an attempt to stop out-of-memory errors which

Re: Help with scoring, coordination factor?

2004-04-29 Thread Doug Cutting
Matthew W. Bilotti wrote: We suspect the coordination term in driving down these documents' ranks and we would like to bring those documents back up to where they should be. That sounds right to me. Is there a relatively easy way to implement what we want using Lucene? Would it be better to

Re: Memory usage

2004-05-26 Thread Doug Cutting
James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If

Re: Memory usage

2004-05-26 Thread Doug Cutting
requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting [EMAIL PROTECTED] wrote: James Dunn wrote: Also I search across about 50 fields

Re: problems with lucene in multithreaded environment

2004-06-02 Thread Doug Cutting
Jayant Kumar wrote: We recently tested lucene with an index size of 2 GB which has about 1,500,000 documents, each document having about 25 fields. The frequency of search was about 20 queries per second. This resulted in an average response time of about 20 seconds approx per search. That sounds

Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Jayant Kumar wrote: Please find enclosed jvmdump.txt which contains a dump of our search program after about 20 seconds of starting the program. Also enclosed is the file queries.txt which contains few sample search queries. Thanks for the data. This is exactly what I was looking for. Thread-14

Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Doug Cutting wrote: Please tell me if you are able to simplify your queries and if that speeds things. I'll look into a ThreadLocal-based solution too. I've attached a patch that should help with the thread contention, although I've not tested it extensively. I still don't fully understand why

Re: problems with lucene in multithreaded environment

2004-06-07 Thread Doug Cutting
Jayant Kumar wrote: Thanks for the patch. It helped in increasing the search speed to a good extent. Good. I'll commit it. Thanks for testing it. But when we tried to give about 100 queries in 10 seconds, then again we found that after about 15 seconds, the response time per query increased.

Re: Running OutOfMemory while optimizing and searching

2004-07-01 Thread Doug Cutting
What do your queries look like? The memory required for a query can be computed by the following equation: 1 Byte * Number of fields in your query * Number of docs in your index So if your query searches on all 50 fields of your 3.5 Million document index then each search would take

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
A mergeFactor of 5000 is a bad idea. If you want to index faster, try increasing minMergeDocs instead. If you have lots of memory this can probably be 5000 or higher. Also, why do you optimize before you're done? That only slows things. Perhaps you have to do it because you've set

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
Julien, Thanks for the excellent explanation. I think this thread points to a documentation problem. We should improve the javadoc for these parameters to make it easier for folks to In particular, the javadoc for mergeFactor should mention that very large values (100) are not recommended,

Re: indexing help

2004-07-07 Thread Doug Cutting
John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. Also, to elaborate

Re: problem running lucene 1.4 demo on a solaris machine (permission denied)

2004-07-08 Thread Doug Cutting
MATL (Mats Lindberg) wrote: When i copied the lucene jar file to the solaris machine from the windows machine i used a ftp program. FTP probably mangled the file. You need to use FTP's binary mode. Doug - To unsubscribe, e-mail:

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last

Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix,

Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote: Just for my education, can you maybe elaborate on using the implement an IndexReader that delivers a synthetic index approach? IndexReader is an abstract class. It has few data fields, and few non-static methods that are not implemented in terms of abstract methods. So, in

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? That's correct. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has

Re: Way to repair an index broking during 1/2 optimize?

2004-07-09 Thread Doug Cutting
Kevin A. Burton wrote: With the typical handful of fields, one should never see more than hundreds of files. We only have 13 fields... Though to be honest I'm worried that even if I COULD do the optimize that it would run out of file handles. Optimization doesn't open all files at once. The

Re: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Doug Cutting
Armbrust, Daniel C. wrote: The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it. I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks

Re: Why is Field.java final?

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? You don't need to subclass to do this, just a static method somewhere. Why is this? I can't see any harm in making

Re: Field.java - STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote: So I added a few constants to my class: new Field( name, value, NOT_STORED, INDEXED, NOT_TOKENIZED ); which IMO is a lot easier to maintain. Why not add these constants to Field.java: public static final boolean STORED = true; public static final boolean NOT_STORED =

Re: Field.java - STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Doug Cutting wrote: The calls would look like: new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public static final Stored NO = new

Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Doug Cutting
[EMAIL PROTECTED] wrote: What I really would like to see are some best practices or some advice from some users who are working with really large indices how they handle this situation, or why they don't have to care about it or maybe why I am completely missing the point ;-)) Many folks with

Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote: First let me explain what I found out. I'm running Lucene on a 4 CPU server. While doing some stress tests I've noticed (by doing full thread dump) that searching threads are blocked on the method: public FieldInfo fieldInfo(int fieldNumber) This causes for a significant cpu idle

Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote: I use Lucene 1.4 final Here is the thread dump for one blocked thread (If you want a full thread dump for all threads I can do that too) Thanks. I think I get the point. I recently removed a synchronization point higher in the stack, so that now this one shows up! Whether or not

Re: Scoring without normalization!

2004-07-15 Thread Doug Cutting
Have you looked at: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html in particular, at: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int)

Re: Token or not Token, PerFieldAnalyzer

2004-07-15 Thread Doug Cutting
Florian Sauvin wrote: Everywhere in the documentation (and it seems logical) you say to use the same analyzer for indexing and querying... how is this handled on not tokenized fields? Imperfectly. The QueryParser knows nothing about the index, so it does not know which fields were tokenized and

Re: release migration plan

2004-07-15 Thread Doug Cutting
fp235-5 wrote: I am looking at the code to implement setIndexInterval() in IndexWriter. I'd like to have your opinion on the best way to do it. Currently the creation of an instance of TermInfosWriter requires the following steps: ... IndexWriter.addDocument(Document)

Re: Post-sorted inverted index?

2004-07-20 Thread Doug Cutting
You can define a subclass of FilterIndexReader that re-sorts documents in TermPositions(Term) and document(int), then use IndexWriter.addIndexes() to write this in Lucene's standard format. I have done this in Nutch, with the (as yet unused) IndexOptimizer.

Re: Sort: 1.4-rc3 vs. 1.4-final

2004-07-21 Thread Doug Cutting
The key in the WeakHashMap should be the IndexReader, not the Entry. I think this should become a two-level cache, a WeakHashMap of HashMaps, the WeakHashMap keyed by IndexReader, the HashMap keyed by Entry. I think the Entry class can also be changed to not include an IndexReader field.

Re: Weighting database fields

2004-07-21 Thread Doug Cutting
Ernesto De Santis wrote: If some field have set a boots value in index time, and when in search time the query have another boost value for this field, what happens? which value is used for boost? The two boosts are both multiplied into the score. Doug

Re: Logic of score method in hits class

2004-07-26 Thread Doug Cutting
Lucene scores are not percentages. They really only make sense compared to other scores for the same query. If you like percentages, you can divide all scores by the first score and multiply by 100. Doug lingaraju wrote: Dear All How the score method works(logic) in Hits class For 100% match

Re: Boosting documents

2004-07-26 Thread Doug Cutting
Rob Clews wrote: I want to do the same, set a boost for a field containing a date that lowers as the date is further from now, is there any way I could do this? You could implement Similarity.idf(Term, Searcher) to, when Term.field().equals(date), return a value that is greater for more recent

Re: over 300 GB to index: feasability and performance issue

2004-07-26 Thread Doug Cutting
Vincent Le Maout wrote: I have to index a huge, huge amount of data: about 10 million documents making up about 300 GB. Is there any technical limitation in Lucene that could prevent me from processing such amount (I mean, of course, apart from the external limits induce by the hardware: RAM,

Re: Caching of TermDocs

2004-07-27 Thread Doug Cutting
John Patterson wrote: I would like to hold a significant amount of the index in memory but use the disk index as a spill over. Obviously the best situation is to hold in memory only the information that is likely to be used again soon. It seems that caching TermDocs would allow popular search

Re: Hit Score [ Between ]

2004-08-04 Thread Doug Cutting
You could instead use a HitCollector to gather only documents with scores in that range. Doug Karthik N S wrote: Hi Apologies If I want to get all the hits for Scores between 0.5f to 0.8f, I usally use query = QueryParser.parse(srchkey,Fields, analyzer); int tothits =

Re: Split an existing index into smaller segments without a re-index?

2004-08-04 Thread Doug Cutting
Kevin A. Burton wrote: Is it possible to take an existing index (say 1G) and break it up into a number of smaller indexes (say 10 100M indexes)... I don't think theres currently an API for this but its certainly possible (I think). Yes, it is theoretically possible but not yet implemented. An

Re: NegativeArraySizeException when creating a new IndexSearcher

2004-08-20 Thread Doug Cutting
Looks to me like you're using an older version of Lucene on your Linux box. The code is back-compatible, it will read old indexes, but Lucene 1.3 cannot read indexes created by Lucene 1.4, and will fail in the way you describe. Doug Sven wrote: Hi! I have a problem to port a Lucene based

Re: Debian build problem with 1.4.1

2004-08-20 Thread Doug Cutting
I can successfully use gcc 3.4.0 with Lucene as follows: ant jar jar-demo gcj -O3 build/lucene-1.5-rc1-dev.jar build/lucene-demos-1.5-rc1-dev.jar -o indexer --main=org.apache.lucene.demo.IndexHTML ./indexer -create docs It runs pretty snappy too! However I don't know if there's much milage in

Re: speeding up queries (MySQL faster)

2004-08-22 Thread Doug Cutting
Yonik Seeley wrote: Setup info Stats: - 4.3M documents, 12 keyword fields per document, 11 [ ... ] field1:4 AND field2:188453 AND field3:1 field1:4 done alone selects around 4.2M records field2:188453 done alone selects around 1.6M records field3:1 done alone selects around 1K records

Re: telling one version of the index from another?

2004-09-07 Thread Doug Cutting
Bill Janssen wrote: Hi. Hey, Bill. It's been a long time! I've got a Lucene application that's been in use for about two years. Some users are using Lucene 1.2, some 1.3, and some are moving to 1.4. The indices seem to behave differently under each version. I'd like to add code to my application

Re: Possible to remove duplicate documents in sort API?

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote: My problem is that I have two machines... one for searching, one for indexing. The searcher has an existing index. The indexer found an UPDATED document and then adds it to a new index and pushes that new index over to the searcher. The searcher then reloads and when

Re: Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote: It looks like Document.java uses its own implementation of a LinkedList.. Why not use a HashMap to enable O(1) lookup... right now field lookup is O(N) which is certainly no fun. Was this benchmarked? Perhaps theres the assumption that since documents often have few

Re: maximum index size

2004-09-08 Thread Doug Cutting
Chris Fraschetti wrote: I've seen throughout the list mentions of millions of documents.. 8 million, 20 million, etc etc.. but can lucene potentially handle billions of documents and still efficiently search through them? Lucene can currently handle up to 2^31 documents in a single index. To a

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Doug Cutting
Bill Janssen wrote: I'd think that if a user specified a query cutting lucene, with an implicit AND and the default fields title and author, they'd expect to see a match in which both cutting and lucene appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucene) Your

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
David Spencer wrote: Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: Not that I know of. - ignore(penalize?) terms that are rare I think this one is easy to

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Doug Cutting
It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread Doug Cutting
Daniel Naber wrote: On Thursday 09 September 2004 18:52, Doug Cutting wrote: I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? Like this one? konvens leitseite Leitseite is only

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the did you mean spelling correction algorithm determines

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
Andrzej Bialecki wrote: I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms (recursive and descent) and suggests alternatives to recursize...thus if

Re: Running OutOfMemory while optimizing and searching

2004-09-17 Thread Doug Cutting
John Z wrote: We have indexes of around 1 million docs and around 25 searchable fields. We noticed that without any searches performed on the indexes, on startup, the memory taken up by the searcher is roughly 7 times the .tii file size. The .tii file is read into memory as per the code. Our .tii

Re: problem with get/setBoost of document fields

2004-09-23 Thread Doug Cutting
You can change field boosts without re-indexing. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#setNorm(int,%20java.lang.String,%20byte) Doug Bastian Grimm [Eastbeam GmbH] wrote: thanks for your reply, eric. so i am right that its not possible to change the

Re: demo HTML parser question

2004-09-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document.

Re: Document contents split among different Fields

2004-09-23 Thread Doug Cutting
Greg Langmead wrote: Am I right in saying that the design of Token's support for highlighting really only supports having the entire document stored as one monolithic contents Field? No, I don't think so. Has anyone tackled indexing multiple content Fields before that could shed some light? Do you

Re: sorting and score ordering

2004-10-13 Thread Doug Cutting
Paul Elschot wrote: Along with that, is there a simple way to assign a new scorer to the searcher? So I can use the same lucene algorithm for my hits, but tweak it a little to fit my needs? There is no one to one relationship between a seacher and a scorer. But you can use a different Similarity

Re: Shouldnt IndexWriter.flushRamSegments() be public? or at least protected?

2004-09-28 Thread Doug Cutting
Christian Rodriguez wrote: Now the problem I have is that I dont have a way to force a flush of the IndexWriter without closing it and I need to do that before commiting a transaction or I would get random errors. Shouldnt that function be public, in case the user wants to force a flush at some

Re: problem with get/setBoost of document fields

2004-09-29 Thread Doug Cutting
Bastian Grimm [Eastbeam GmbH] wrote: that works... but i have to do this setNorm() for each document, which has been indexed up to now, right? there are round about 1 mio. docs in the index... i dont think it's a good idea to perform a search and do it for every doc (and every field of the

Re: removing duplicate Documents from Hits

2004-10-01 Thread Doug Cutting
Timm, Andy (ETW) wrote: Hello, I've searched on previous posts on this topic but couldn't find an answer. I want to query my index (which are a number of 'flattened' Oracle tables) for some criteria, then return Hits such that there are no Documents that duplicate a particular field. In the case

new release: 1.4.2

2004-10-01 Thread Doug Cutting
There's a new release of Lucene, 1.4.2, which mostly fixes bugs in 1.4.1. Details are at http://jakarta.apache.org/lucene/. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: multifield-boolean vs singlefield-enum query performance

2004-10-07 Thread Doug Cutting
Tea Yu wrote: For the following implementations: 1) storing boolean strings in fields X and Y separately 2) storing the same info in a field XY as 3 enums: X, Y, B, N meaning only X is True, only Y is True, both are True or both are False Is there significant performance gain when we substitute

Re: Sort regeneration in multithreaded server

2004-10-08 Thread Doug Cutting
Stephen Halsey wrote: I was wondering if anyone could help with a problem (or should that be challenge?) I'm having using Sort in Lucene over a large number of records in multi-threaded server program on a continually updated index. I am using lucene-1.4-rc3. A number of bugs with the sorting code

Re: locking problems

2004-10-08 Thread Doug Cutting
Aad Nales wrote: 1. can I have one or multiple searchers open when I open a writer? 2. can I have one or multiple readers open when I open a writer? Yes, with one caveat: if you've called the IndexReader methods delete(), undelete() or setNorm() then you may not open an IndexWriter until you've

Re: Search speed

2004-11-02 Thread Doug Cutting
Jeff Munson wrote: Single word searches return pretty fast, but when I try phrases, searching seems to slow considerably. [ ... ] However, if I use this query, contents:all parts including picture tube guaranteed, it returns hits in 2890 millseconds. Other phrases take longer as well. You could

Re: Backup strategies

2004-11-16 Thread Doug Cutting
Christoph Kiehl wrote: I'm curious about your strategy to backup indexes based on FSDirectory. If I do a file based copy I suspect I will get corrupted data because of concurrent write access. My current favorite is to create an empty index and use IndexWriter.addIndexes() to copy the current

<    1   2   3   4   >