Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: Just to clarify things: Does the current solution require all fields that can be used for sorting to be loaded and kept in memory? (I guess you can answer this question faster than I can figure it out by myself :-) Field values are loaded into memory. But values are kept in an arr

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: That's reasonable. What I didn't quite understand yet: If I sort on a string field, will Lucene need to keep all values in memory all the time, or only during startup? It will cache one instance of each unique value. So if you have a million documents and string sort results on a

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: I will need to have a look at the code, but I assume that in principal it should be possible to replace the strings with sequential integers once the sorting is done? I don't understand the question. Doug - To un

Re: Lucene 1.4 - lobby for final release

2004-03-26 Thread Doug Cutting
Chad Small wrote: thanks Erik. Ok this is my official lobby effort for the release of 1.4 to final status. Anyone else need/want a 1.4 release? Does anyone have any information on 1.4 release plans? I'd like to make an RC once I manage to fix bug #27799, which will hopefully be soon. Doug --

Re: Demoting results

2004-03-29 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I have not been able to work out how to get custom coordination going to demote results based on a specific term [ ... ] Yeah, it's a little more complicated than perhaps it should be. I've attached a class which does this. I think it's faster and more effective than wh

Re: Lucene 1.4 - lobby for final release

2004-03-29 Thread Doug Cutting
Charlie Smith wrote: I'll vote yes please release new version with "too many files open" fixed. There is no "too many files open bug", except perhaps in your application. It is however an easy to encounter problem if you don't close indexes or if you change Lucene's default parameters. It will

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote: We're using lucene with one large target index which right now is 5G. Every night we take sub-indexes which are about 500M and merging them into this main index. This merge (done via IndexWriter.addIndexes(Directory[]) is taking way too much time. Looking at the stats f

Re: Overriding coordination

2004-03-29 Thread Doug Cutting
Boris Goldowsky wrote: I have a situation where I'm querying for something in several fields, with a clause similar to this: (title:(two words)^20 keywords:(two words)^10 body:(two words)) Some good documents are being scored too low if the query terms do not occur in the "body" field. I naive

Re: Demoting results

2004-03-29 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my implementation :-) Great! Glad to hear it was useful. BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help yo

Re: Javadocs lucene 1.4

2004-03-29 Thread Doug Cutting
Lucene 1.4 has not been released. Until it is released, you need to check out the sources from CVS and build them, including javadoc. Doug Stephane James Vaucher wrote: Are the javadocs available on the site? I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery) somewhere on the

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-30 Thread Doug Cutting
Esmond Pitt wrote: Don't want to start a buffer size war, but these have always seemed too small to me. I'd recommend upping both InputStream and OutputStream buffer sizes to at least 4k, as this is the cluster size on most disks these days, and also a common VM page size. Okay. Reading and writin

Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-31 Thread Doug Cutting
Kevin A. Burton wrote: I'm playing with this package: http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient Does it just seem inefficient,

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: As a note of warning: I did find StandardTokenizer to be the major culprit in my tokenizing benchmarks (avg 75ms for 16k sized docs). I have found I can live without StandardTokenizer in my apps. FYI, the message with Mark's timings can be found at: http://nagoya.apache.o

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Doug Cutting
Doug Cutting wrote: According to these, if your documents average 16k, then a 10-hit result page would require just 66ms to generate highlights using SimpleAnalyzer. Oops. That should be 110ms. Doug - To unsubscribe, e-mail

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Doug Cutting
Kevin A. Burton wrote: Doug Cutting wrote: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1413989 According to these, if your documents average 16k, then a 10-hit result page would require just 66ms to generate highlights using SimpleAnalyzer. The whole search takes only 3

Re: Wierd Search Behavior

2004-04-01 Thread Doug Cutting
Terry, Can you please try to develop a reproducible test case? Otherwise it's impossible to verify and debug this. For something like this it would suffice to provide: 1. The initial index, which satisifies the test queries; 2. The new index you add; 3. Your merge and test code, as a s

Re: Iterernal Document Numbers

2004-04-01 Thread Doug Cutting
Joe Rayguy wrote: So, assuming that sort as implemented in 1.4 doesn't work for me, my original question still stands. Do I have to worry about merges that occur as documents are added, or do I only have to rebuild my array after optimizations? Or, alternatively, how did everyone sort before 1.4?

Re: Find all Words in a Document

2004-04-06 Thread Doug Cutting
peters marcus wrote: is there a way to get all words stored in the index for a given document Yes, in the 1.4 release: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int) Doug -

Re: "starts with" query functionality

2004-04-06 Thread Doug Cutting
Chad Small wrote: We have a requirement to return documents with a "title" field that starts with a certain letter. Is there a way to do something like this? We're using the StandardAnalyzer Example title fields: This is the title of a document. And this is a title of a different document.

Re: Index partitioning

2004-04-06 Thread Doug Cutting
Magnus Mellin wrote: i would like to partition an index over X number of remote searchers. Any ideas, or suggestions, on how to use the same term dictionary (one that represents the terms and frequencies for the whole document collection) over all my indices? Try using a ParallelMultiSearcher com

Re: verifying index integrity

2004-04-06 Thread Doug Cutting
Weir, Michael wrote: I assume that it is possible to corrupt an index by crashing at just the right time. It should not be possible to corrupt an index this way. I notice that there's a method IndexReader.unlock(). Does this method ensure that the index has not been corrupted? If you use this met

Re: verifying index integrity

2004-04-08 Thread Doug Cutting
Weir, Michael wrote: So if our server is the only process that ever opens the index, I should be able to run through the indexes at startup and simply unlock them? Yes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additio

Re: Locking problems with NFS

2004-04-20 Thread Doug Cutting
Francesco Bellomi wrote: we are experiencing some difficulties in using Lucene with a NFS filesystem. Basically, locking seems not to work properly, since it appears that attempted concurring writing on the index (from different VMs) are not blocked, and this often causes the index to be corrupted.

Re: Locking problems with NFS

2004-04-20 Thread Doug Cutting
Francesco Bellomi wrote: The only problem is that, as lucene 1.4rc2, FSDirectory is 'final'. Please submit a patch to lucene-dev to make FSDirectory non-final. In fact, a third architectural approach would be to define an API for "pluggable" lock implementations: IMHO that would be more robust to

new Lucene release: 1.4 RC3

2004-05-11 Thread Doug Cutting
Version 1.4 RC3 of Lucene is available for download from: http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc3/ Changes are described at: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.85 Doug - To

Re: BooleanQuery.add()

2004-05-13 Thread Doug Cutting
Leonid Portnoy wrote: Am I misunderstanding something here, or is the documentation unclear? The documentation is unclear. Can you propose an improvement? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands

Re: SQLDirectory implementation

2004-05-11 Thread Doug Cutting
code. ( see test code ) 2.) The first search is always really slow as everything initializes and the cache fills ;) so don't let that discourage you. -vito On Mon, 2004-04-26 at 14:59, Doug Cutting wrote: Anthony Vito wrote: I noticed some talk on SQLDirectory a month or so ago. . Di

Re: exact the same score from different documents

2004-05-14 Thread Doug Cutting
hui wrote: I am getting the exactly same score like 0. 04809519 for different size documents for some queries and this happens quite frequently. Based on the score formula, it seems this should rarely happen. Or I misunderstand the formula? Normalization factors (& document boosts) are represented

Re: Not deleting temp files after updating/optimising.

2004-04-26 Thread Doug Cutting
Win32 seems to sometimes not permit one to delete a file immediately after it has been closed. Because of this, Lucene keeps a list of files that need to be deleted in the 'deleteable' file. Are your files listed in this file? If so, Lucene will again try to delete these files the next time

Re: SQLDirectory implementation

2004-04-26 Thread Doug Cutting
Anthony Vito wrote: I noticed some talk on SQLDirectory a month or so ago. ( I just joined the list :) ) I have a JDBC implementation that stores the "files" in a couple of tables and stores the data for the files as blocks (BLOBs) of a certain size ( 16k by default ). It also has an LRU cache fo

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? A few folks have implemented an SQL-based Lucene Directory, but n

Re: "phrase search" AND term

2004-04-27 Thread Doug Cutting
Ioan Miftode wrote: I recently upgraded to lucene 1.4 RC2 because I needed some sorting capabilities. However some phrase searches don't work anymore (the hits don't even have the term's I'm searching on). Try the latest CVS. There were some bugs in 1.4RC2 that have been fixed. (We'll probably do

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Incze Lajos wrote: Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index available over the network for multiple lucene engines -- if I'm right.) h

Re: Understanding Boolean Queries

2004-04-29 Thread Doug Cutting
Please don't crosspost to lucene-user and lucene-dev! Tate Avery wrote: 3) The maxClauseCount threshold appears not to care whether or not my clauses are 'required' or 'prohibited'... only how many of them there are in total. That's correct. It is an attempt to stop out-of-memory errors which can

Re: Help with scoring, coordination factor?

2004-04-29 Thread Doug Cutting
Matthew W. Bilotti wrote: We suspect the coordination term in driving down these documents' ranks and we would like to bring those documents back up to where they should be. That sounds right to me. Is there a relatively easy way to implement what we want using Lucene? Would it be better to t

Re: Memory usage

2004-05-26 Thread Doug Cutting
James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If

Re: Memory usage

2004-05-26 Thread Doug Cutting
requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting <[EMAIL PROTECTED]> wrote: James Dunn wrote: Also I search across ab

Re: problems with lucene in multithreaded environment

2004-06-02 Thread Doug Cutting
Jayant Kumar wrote: We recently tested lucene with an index size of 2 GB which has about 1,500,000 documents, each document having about 25 fields. The frequency of search was about 20 queries per second. This resulted in an average response time of about 20 seconds approx per search. That sounds s

Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Jayant Kumar wrote: Please find enclosed jvmdump.txt which contains a dump of our search program after about 20 seconds of starting the program. Also enclosed is the file queries.txt which contains few sample search queries. Thanks for the data. This is exactly what I was looking for. "Thread-14"

Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Doug Cutting wrote: Please tell me if you are able to simplify your queries and if that speeds things. I'll look into a ThreadLocal-based solution too. I've attached a patch that should help with the thread contention, although I've not tested it extensively. I still don't

Re: problems with lucene in multithreaded environment

2004-06-07 Thread Doug Cutting
Jayant Kumar wrote: Thanks for the patch. It helped in increasing the search speed to a good extent. Good. I'll commit it. Thanks for testing it. But when we tried to give about 100 queries in 10 seconds, then again we found that after about 15 seconds, the response time per query increased. This

Re: Setting Similarity in IndexWriter and IndexSearcher

2004-06-08 Thread Doug Cutting
David Spencer wrote: Does it ever make sense to set the Similartity obj in either (only one of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can I avoid setting it in IndexSearcher? Also, can I avoid setting it in IndexWriter and only set it in IndexSearcher? I noticed Nutch s

Re: Performance: compound vs. multi-file index, indexing and searching

2004-06-08 Thread Doug Cutting
Otis Gospodnetic wrote: Can anyone comment on performance differences? I'd expect multi-threaded performance to be a bit worse with the compound format, but single-threaded performance should be nearly identical. Doug - To unsub

Re: Proximity Searches behavior

2004-06-10 Thread Doug Cutting
Erik Hatcher wrote: If you want something that does "quick fox*" where "quick" must be followed by something starting with "fox", you'll have to do this through the API, perhaps using the awkwardly named PhrasePrefixQuery, which does support slop also. It would be up to you to do the term expa

Re: Making a case for Lucene

2004-07-01 Thread Doug Cutting
> The best example that I've been able to find is the Yahoo research > lab - as I understand it, this is a Nutch (i.e. Lucene) > implementation that's providing impressive performance over a > 100 million document repository. This demo runs on a handful of boxes. It was originally running on thre

Re: Running OutOfMemory while optimizing and searching

2004-07-01 Thread Doug Cutting
> What do your queries look like? The memory required > for a query can be computed by the following equation: > > 1 Byte * Number of fields in your query * Number of > docs in your index > > So if your query searches on all 50 fields of your 3.5 > Million document index then each search would tak

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
A mergeFactor of 5000 is a bad idea. If you want to index faster, try increasing minMergeDocs instead. If you have lots of memory this can probably be 5000 or higher. Also, why do you optimize before you're done? That only slows things. Perhaps you have to do it because you've set mergeFacto

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
Julien, Thanks for the excellent explanation. I think this thread points to a documentation problem. We should improve the javadoc for these parameters to make it easier for folks to In particular, the javadoc for mergeFactor should mention that very large values (>100) are not recommended, sin

Re: indexing help

2004-07-07 Thread Doug Cutting
John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it ca

Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote: The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. That's easy to fix. We just need to reuse the token: public cl

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: Also... what can I do to speed up this optimize? Ideally it wouldn't take 6 hours. Was this the index with the mergeFactor of 5000? If so, that's why it's so slow: you've delayed all of the work until the end. Indexing on a ramfs will make things faster in general, howe

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. Also, to elaborate

Re: problem running lucene 1.4 demo on a solaris machine (permission denied)

2004-07-08 Thread Doug Cutting
MATL (Mats Lindberg) wrote: When i copied the lucene jar file to the solaris machine from the windows machine i used a ftp program. FTP probably mangled the file. You need to use FTP's binary mode. Doug - To unsubscribe, e-mail: [

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last e

Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix, c:\windows\tmp

Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote: Just for my education, can you maybe elaborate on using the "implement an IndexReader that delivers a synthetic index" approach? IndexReader is an abstract class. It has few data fields, and few non-static methods that are not implemented in terms of abstract methods. So, in ef

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? That's correct. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has

Re: Way to repair an index broking during 1/2 optimize?

2004-07-09 Thread Doug Cutting
Kevin A. Burton wrote: With the typical handful of fields, one should never see more than hundreds of files. We only have 13 fields... Though to be honest I'm worried that even if I COULD do the optimize that it would run out of file handles. Optimization doesn't open all files at once. The mos

Re: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Doug Cutting
Armbrust, Daniel C. wrote: The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it. I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks

Re: Why is Field.java final?

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? You don't need to subclass to do this, just a static method somewhere. Why is this? I can't see any harm in making it

Re: Field.java -> STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Kevin A. Burton wrote: So I added a few constants to my class: new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED ); which IMO is a lot easier to maintain. Why not add these constants to Field.java: public static final boolean STORED = true; public static final boolean NOT_STORED

Re: Field.java -> STORED, NOT_STORED, etc...

2004-07-11 Thread Doug Cutting
Doug Cutting wrote: The calls would look like: new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public st

Re: AW: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-12 Thread Doug Cutting
[EMAIL PROTECTED] wrote: What I really would like to see are some best practices or some advice from some users who are working with really large indices how they handle this situation, or why they don't have to care about it or maybe why I am completely missing the point ;-)) Many folks with re

Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote: First let me explain what I found out. I'm running Lucene on a 4 CPU server. While doing some stress tests I've noticed (by doing full thread dump) that searching threads are blocked on the method: public FieldInfo fieldInfo(int fieldNumber) This causes for a significant cpu idle time

Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Doug Cutting
Aviran wrote: I use Lucene 1.4 final Here is the thread dump for one blocked thread (If you want a full thread dump for all threads I can do that too) Thanks. I think I get the point. I recently removed a synchronization point higher in the stack, so that now this one shows up! Whether or not y

Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-13 Thread Doug Cutting
Aviran wrote: I changed the Lucene 1.4 final source code and yes this is the source version I changed. Note that this patch won't produce the a speedup on earlier releases, since their was another multi-thread bottleneck higher up the stack that was only recently removed, revealing this lower-lev

Re: Why is Field.java final?

2004-07-13 Thread Doug Cutting
Kevin A. Burton wrote: Doug Cutting wrote: Field and Document are not designed to be extensible. They are persisted in such a way that added methods are not available when the field is restored. In other words, when a field is read, it always constructs an instance of Field, not a subclass

Re: Why is Field.java final?

2004-07-13 Thread Doug Cutting
John Wang wrote: On the same thought, how about the org.apache.lucene.analysis.Token class. Can we make it non-final? Sure, if you make a case for why it should be non-final. What would your subclasses do? Which methods would you override? Doug --

Re: Pool of IndexReaders or Pool of Searchers?

2004-07-13 Thread Doug Cutting
Whether this will make a difference depends on the size of the index. If your index is relatively small, then this patch will help more. If your index is large, it will help less. Aviran wrote: Try to compile this code changes into lucene http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.h

Re: Scoring without normalization!

2004-07-15 Thread Doug Cutting
Have you looked at: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html in particular, at: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#lengthNorm(java.lang.String,%20int) http://jakarta.apache.org/lucene/docs/api/org/apache/lucen

Re: Token or not Token, PerFieldAnalyzer

2004-07-15 Thread Doug Cutting
Florian Sauvin wrote: Everywhere in the documentation (and it seems logical) you say to use the same analyzer for indexing and querying... how is this handled on not tokenized fields? Imperfectly. The QueryParser knows nothing about the index, so it does not know which fields were tokenized and wh

Re: release & migration plan

2004-07-15 Thread Doug Cutting
fp235-5 wrote: I am looking at the code to implement setIndexInterval() in IndexWriter. I'd like to have your opinion on the best way to do it. Currently the creation of an instance of TermInfosWriter requires the following steps: ... IndexWriter.addDocument(Document) IndexWriter.addDocument(Docume

Re: Post-sorted inverted index?

2004-07-20 Thread Doug Cutting
You can define a subclass of FilterIndexReader that re-sorts documents in TermPositions(Term) and document(int), then use IndexWriter.addIndexes() to write this in Lucene's standard format. I have done this in Nutch, with the (as yet unused) IndexOptimizer. http://cvs.sourceforge.net/viewcvs.p

Re: Very slow IndexReader.open() performance

2004-07-20 Thread Doug Cutting
Optimization should not require huge amounts of memory. Can you tell a bit more about your configuration: What JVM? What OS? How many fields? What mergeFactor have you used? Also, please attach the output of 'ls -l' of your index directory, as well as the stack trace you see when OutOfMemo

Re: Sort: 1.4-rc3 vs. 1.4-final

2004-07-21 Thread Doug Cutting
The key in the WeakHashMap should be the IndexReader, not the Entry. I think this should become a two-level cache, a WeakHashMap of HashMaps, the WeakHashMap keyed by IndexReader, the HashMap keyed by Entry. I think the Entry class can also be changed to not include an IndexReader field. Doe

Re: Weighting database fields

2004-07-21 Thread Doug Cutting
Ernesto De Santis wrote: If some field have set a boots value in index time, and when in search time the query have another boost value for this field, what happens? which value is used for boost? The two boosts are both multiplied into the score. Doug --

Re: Logic of score method in hits class

2004-07-26 Thread Doug Cutting
Lucene scores are not percentages. They really only make sense compared to other scores for the same query. If you like percentages, you can divide all scores by the first score and multiply by 100. Doug lingaraju wrote: Dear All How the score method works(logic) in Hits class For 100% match

Re: Boosting documents

2004-07-26 Thread Doug Cutting
Rob Clews wrote: I want to do the same, set a boost for a field containing a date that lowers as the date is further from now, is there any way I could do this? You could implement Similarity.idf(Term, Searcher) to, when Term.field().equals("date"), return a value that is greater for more recent

Re: over 300 GB to index: feasability and performance issue

2004-07-26 Thread Doug Cutting
Vincent Le Maout wrote: I have to index a huge, huge amount of data: about 10 million documents making up about 300 GB. Is there any technical limitation in Lucene that could prevent me from processing such amount (I mean, of course, apart from the external limits induce by the hardware: RAM, disks

Re: Caching of TermDocs

2004-07-27 Thread Doug Cutting
John Patterson wrote: I would like to hold a significant amount of the index in memory but use the disk index as a spill over. Obviously the best situation is to hold in memory only the information that is likely to be used again soon. It seems that caching TermDocs would allow popular search ter

Re: Hit & Score [ Between ]

2004-08-04 Thread Doug Cutting
You could instead use a HitCollector to gather only documents with scores in that range. Doug Karthik N S wrote: Hi Apologies If I want to get all the hits for Scores between 0.5f to 0.8f, I usally use query = QueryParser.parse(srchkey,Fields, analyzer); int tothits = searcher.search(q

Re: Negative Boost

2004-08-04 Thread Doug Cutting
Terry Steichen wrote: But if, in the future, I or someone else took on this task of enhancing QueryParser, I'd like to be assured that the underlying Lucene engine will accept and support negative boosting. Is that the case? Lucene will multiply negative boosts into scores just like positive ones

Re: Split an existing index into smaller segments without a re-index?

2004-08-04 Thread Doug Cutting
Kevin A. Burton wrote: Is it possible to take an existing index (say 1G) and break it up into a number of smaller indexes (say 10 100M indexes)... I don't think theres currently an API for this but its certainly possible (I think). Yes, it is theoretically possible but not yet implemented. An ea

Re: NegativeArraySizeException when creating a new IndexSearcher

2004-08-20 Thread Doug Cutting
Looks to me like you're using an older version of Lucene on your Linux box. The code is back-compatible, it will read old indexes, but Lucene 1.3 cannot read indexes created by Lucene 1.4, and will fail in the way you describe. Doug Sven wrote: Hi! I have a problem to port a Lucene based knowl

Re: Debian build problem with 1.4.1

2004-08-20 Thread Doug Cutting
I can successfully use gcc 3.4.0 with Lucene as follows: ant jar jar-demo gcj -O3 build/lucene-1.5-rc1-dev.jar build/lucene-demos-1.5-rc1-dev.jar -o indexer --main=org.apache.lucene.demo.IndexHTML ./indexer -create docs It runs pretty snappy too! However I don't know if there's much milage in p

Re: speeding up queries (MySQL faster)

2004-08-22 Thread Doug Cutting
Yonik Seeley wrote: Setup info & Stats: - 4.3M documents, 12 keyword fields per document, 11 [ ... ] "field1:4 AND field2:188453 AND field3:1" field1:4 done alone selects around 4.2M records field2:188453 done alone selects around 1.6M records field3:1 done alone selects around 1K record

Re: telling one version of the index from another?

2004-09-07 Thread Doug Cutting
Bill Janssen wrote: Hi. Hey, Bill. It's been a long time! I've got a Lucene application that's been in use for about two years. Some users are using Lucene 1.2, some 1.3, and some are moving to 1.4. The indices seem to behave differently under each version. I'd like to add code to my application

Re: Possible to remove duplicate documents in sort API?

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote: My problem is that I have two machines... one for searching, one for indexing. The searcher has an existing index. The indexer found an UPDATED document and then adds it to a new index and pushes that new index over to the searcher. The searcher then reloads and when some

Re: Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)

2004-09-07 Thread Doug Cutting
Kevin A. Burton wrote: It looks like Document.java uses its own implementation of a LinkedList.. Why not use a HashMap to enable O(1) lookup... right now field lookup is O(N) which is certainly no fun. Was this benchmarked? Perhaps theres the assumption that since documents often have few field

Re: PDF->Text Performance comparison

2004-09-08 Thread Doug Cutting
Ben Litchfield wrote: PDFBox: slow PDF text extraction for Java applications http://www.pdfbox.org Shouldn't that read, "PDFBox: *free* slow PDF text extraction for Java applications, with Lucene integration"? Doug - To unsubscri

Re: maximum index size

2004-09-08 Thread Doug Cutting
Chris Fraschetti wrote: I've seen throughout the list mentions of millions of documents.. 8 million, 20 million, etc etc.. but can lucene potentially handle billions of documents and still efficiently search through them? Lucene can currently handle up to 2^31 documents in a single index. To a la

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread Doug Cutting
Bill Janssen wrote: I'd think that if a user specified a query "cutting lucene", with an implicit AND and the default fields "title" and "author", they'd expect to see a match in which both "cutting" and "lucene" appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR author:lucen

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
Aad Nales wrote: Before I start reinventing wheels I would like to do a short check to see if anybody else has already tried this. A customer has requested us to look into the possibility to perform a spell check on queries. So far the most promising way of doing this seems to be to create an Analy

Re: combining open office spellchecker with Lucene

2004-09-09 Thread Doug Cutting
David Spencer wrote: Good heuristics but are there any more precise, standard guidelines as to how to balance or combine what I think are the following possible criteria in suggesting a better choice: Not that I know of. - ignore(penalize?) terms that are rare I think this one is easy to threshol

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Doug Cutting
It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fi

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread Doug Cutting
Daniel Naber wrote: On Thursday 09 September 2004 18:52, Doug Cutting wrote: I have not been able to construct a two-word query that returns a page without both words in either the content, the title, the url or in a single anchor. Can you? Like this one? konvens leitseite Leitseite is only in

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-10 Thread Doug Cutting
David Spencer wrote: Doug Cutting wrote: And one should not try correction at all for terms which occur in a large proportion of the collection. I keep thinking over this one and I don't understand it. If a user misspells a word and the "did you mean" spelling correction algori

Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
Andrzej Bialecki wrote: I was wondering about the way you build the n-gram queries. You basically don't care about their position in the input term. Originally I thought about using PhraseQuery with a slop - however, after checking the source of PhraseQuery I realized that this probably wouldn't

Re: frequent terms - Re: combining open office spellchecker with Lucene

2004-09-14 Thread Doug Cutting
David Spencer wrote: [1] The user enters a query like: recursize descent parser [2] The search code parses this and sees that the 1st word is not a term in the index, but the next 2 are. So it ignores the last 2 terms ("recursive" and "descent") and suggests alternatives to "recursize"...thu

<    1   2   3   4   5   >