Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? That's correct. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has

Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote: Just for my education, can you maybe elaborate on using the "implement an IndexReader that delivers a synthetic index" approach? IndexReader is an abstract class. It has few data fields, and few non-static methods that are not implemented in terms of abstract methods. So, in ef

Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix, c:\windows\tmp

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last e

Re: problem running lucene 1.4 demo on a solaris machine (permission denied)

2004-07-08 Thread Doug Cutting
MATL (Mats Lindberg) wrote: When i copied the lucene jar file to the solaris machine from the windows machine i used a ftp program. FTP probably mangled the file. You need to use FTP's binary mode. Doug - To unsubscribe, e-mail: [

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. Also, to elaborate

Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Doug Cutting
Kevin A. Burton wrote: Also... what can I do to speed up this optimize? Ideally it wouldn't take 6 hours. Was this the index with the mergeFactor of 5000? If so, that's why it's so slow: you've delayed all of the work until the end. Indexing on a ramfs will make things faster in general, howe

Re: indexing help

2004-07-08 Thread Doug Cutting
John Wang wrote: The solution you proposed is still a derivative of creating a dummy document stream. Taking the same example, java (5), lucene (6), VectorTokenStream would create a total of 11 Tokens whereas only 2 is neccessary. That's easy to fix. We just need to reuse the token: public cl

Re: indexing help

2004-07-07 Thread Doug Cutting
John Wang wrote: While lucene tokenizes the words in the document, it counts the frequency and figures out the position, we are trying to bypass this stage: For each document, I have a set of words with a know frequency, e.g. java (5), lucene (6) etc. (I don't care about the position, so it ca

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
Julien, Thanks for the excellent explanation. I think this thread points to a documentation problem. We should improve the javadoc for these parameters to make it easier for folks to In particular, the javadoc for mergeFactor should mention that very large values (>100) are not recommended, sin

Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Doug Cutting
A mergeFactor of 5000 is a bad idea. If you want to index faster, try increasing minMergeDocs instead. If you have lots of memory this can probably be 5000 or higher. Also, why do you optimize before you're done? That only slows things. Perhaps you have to do it because you've set mergeFacto

Re: Running OutOfMemory while optimizing and searching

2004-07-01 Thread Doug Cutting
> What do your queries look like? The memory required > for a query can be computed by the following equation: > > 1 Byte * Number of fields in your query * Number of > docs in your index > > So if your query searches on all 50 fields of your 3.5 > Million document index then each search would tak

Re: Making a case for Lucene

2004-07-01 Thread Doug Cutting
> The best example that I've been able to find is the Yahoo research > lab - as I understand it, this is a Nutch (i.e. Lucene) > implementation that's providing impressive performance over a > 100 million document repository. This demo runs on a handful of boxes. It was originally running on thre

Re: Proximity Searches behavior

2004-06-10 Thread Doug Cutting
Erik Hatcher wrote: If you want something that does "quick fox*" where "quick" must be followed by something starting with "fox", you'll have to do this through the API, perhaps using the awkwardly named PhrasePrefixQuery, which does support slop also. It would be up to you to do the term expa

Re: Performance: compound vs. multi-file index, indexing and searching

2004-06-08 Thread Doug Cutting
Otis Gospodnetic wrote: Can anyone comment on performance differences? I'd expect multi-threaded performance to be a bit worse with the compound format, but single-threaded performance should be nearly identical. Doug - To unsub

Re: Setting Similarity in IndexWriter and IndexSearcher

2004-06-08 Thread Doug Cutting
David Spencer wrote: Does it ever make sense to set the Similartity obj in either (only one of..) IndexWriter or IndexSearcher? i.e. If I set it in IndexWriter can I avoid setting it in IndexSearcher? Also, can I avoid setting it in IndexWriter and only set it in IndexSearcher? I noticed Nutch s

Re: problems with lucene in multithreaded environment

2004-06-07 Thread Doug Cutting
Jayant Kumar wrote: Thanks for the patch. It helped in increasing the search speed to a good extent. Good. I'll commit it. Thanks for testing it. But when we tried to give about 100 queries in 10 seconds, then again we found that after about 15 seconds, the response time per query increased. This

Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Doug Cutting wrote: Please tell me if you are able to simplify your queries and if that speeds things. I'll look into a ThreadLocal-based solution too. I've attached a patch that should help with the thread contention, although I've not tested it extensively. I still don't

Re: problems with lucene in multithreaded environment

2004-06-04 Thread Doug Cutting
Jayant Kumar wrote: Please find enclosed jvmdump.txt which contains a dump of our search program after about 20 seconds of starting the program. Also enclosed is the file queries.txt which contains few sample search queries. Thanks for the data. This is exactly what I was looking for. "Thread-14"

Re: problems with lucene in multithreaded environment

2004-06-02 Thread Doug Cutting
Jayant Kumar wrote: We recently tested lucene with an index size of 2 GB which has about 1,500,000 documents, each document having about 25 fields. The frequency of search was about 20 queries per second. This resulted in an average response time of about 20 seconds approx per search. That sounds s

Re: Memory usage

2004-05-26 Thread Doug Cutting
requirements for a search. Does this memory only get used only during the search operation itself, or is it referenced by the Hits object or anything else after the actual search completes? Thanks again, Jim --- Doug Cutting <[EMAIL PROTECTED]> wrote: James Dunn wrote: Also I search across ab

Re: Memory usage

2004-05-26 Thread Doug Cutting
James Dunn wrote: Also I search across about 50 fields but I don't use wildcard or range queries. Lucene uses one byte of RAM per document per searched field, to hold the normalization values. So if you search a 10M document collection with 50 fields, then you'll end up using 500MB of RAM. If

Re: exact the same score from different documents

2004-05-14 Thread Doug Cutting
hui wrote: I am getting the exactly same score like 0. 04809519 for different size documents for some queries and this happens quite frequently. Based on the score formula, it seems this should rarely happen. Or I misunderstand the formula? Normalization factors (& document boosts) are represented

Re: BooleanQuery.add()

2004-05-13 Thread Doug Cutting
Leonid Portnoy wrote: Am I misunderstanding something here, or is the documentation unclear? The documentation is unclear. Can you propose an improvement? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands

new Lucene release: 1.4 RC3

2004-05-11 Thread Doug Cutting
Version 1.4 RC3 of Lucene is available for download from: http://cvs.apache.org/dist/jakarta/lucene/v1.4-rc3/ Changes are described at: http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.85 Doug - To

Re: SQLDirectory implementation

2004-05-11 Thread Doug Cutting
code. ( see test code ) 2.) The first search is always really slow as everything initializes and the cache fills ;) so don't let that discourage you. -vito On Mon, 2004-04-26 at 14:59, Doug Cutting wrote: Anthony Vito wrote: I noticed some talk on SQLDirectory a month or so ago. . Di

Re: Help with scoring, coordination factor?

2004-04-29 Thread Doug Cutting
Matthew W. Bilotti wrote: We suspect the coordination term in driving down these documents' ranks and we would like to bring those documents back up to where they should be. That sounds right to me. Is there a relatively easy way to implement what we want using Lucene? Would it be better to t

Re: Understanding Boolean Queries

2004-04-29 Thread Doug Cutting
Please don't crosspost to lucene-user and lucene-dev! Tate Avery wrote: 3) The maxClauseCount threshold appears not to care whether or not my clauses are 'required' or 'prohibited'... only how many of them there are in total. That's correct. It is an attempt to stop out-of-memory errors which can

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Incze Lajos wrote: Could anybody summarize what would be the technical pros/cons of a DB-based directory over the flat files? (What I see at the moment is that for some - significant? - perfomence penalty you'll get an index available over the network for multiple lucene engines -- if I'm right.) h

Re: "phrase search" AND term

2004-04-27 Thread Doug Cutting
Ioan Miftode wrote: I recently upgraded to lucene 1.4 RC2 because I needed some sorting capabilities. However some phrase searches don't work anymore (the hits don't even have the term's I'm searching on). Try the latest CVS. There were some bugs in 1.4RC2 that have been fixed. (We'll probably do

Re: need info for database based Lucene but not flat file

2004-04-27 Thread Doug Cutting
Yukun Song wrote: As known, currently Lucene uses flat file to store information for indexing. Any people has idea or resources for combining database (Like MySQL or PostreSQL) and Lucene instead of current flat index file formats? A few folks have implemented an SQL-based Lucene Directory, but n

Re: SQLDirectory implementation

2004-04-26 Thread Doug Cutting
Anthony Vito wrote: I noticed some talk on SQLDirectory a month or so ago. ( I just joined the list :) ) I have a JDBC implementation that stores the "files" in a couple of tables and stores the data for the files as blocks (BLOBs) of a certain size ( 16k by default ). It also has an LRU cache fo

Re: Not deleting temp files after updating/optimising.

2004-04-26 Thread Doug Cutting
Win32 seems to sometimes not permit one to delete a file immediately after it has been closed. Because of this, Lucene keeps a list of files that need to be deleted in the 'deleteable' file. Are your files listed in this file? If so, Lucene will again try to delete these files the next time

Re: Locking problems with NFS

2004-04-20 Thread Doug Cutting
Francesco Bellomi wrote: The only problem is that, as lucene 1.4rc2, FSDirectory is 'final'. Please submit a patch to lucene-dev to make FSDirectory non-final. In fact, a third architectural approach would be to define an API for "pluggable" lock implementations: IMHO that would be more robust to

Re: Locking problems with NFS

2004-04-20 Thread Doug Cutting
Francesco Bellomi wrote: we are experiencing some difficulties in using Lucene with a NFS filesystem. Basically, locking seems not to work properly, since it appears that attempted concurring writing on the index (from different VMs) are not blocked, and this often causes the index to be corrupted.

Re: verifying index integrity

2004-04-08 Thread Doug Cutting
Weir, Michael wrote: So if our server is the only process that ever opens the index, I should be able to run through the indexes at startup and simply unlock them? Yes. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additio

Re: verifying index integrity

2004-04-06 Thread Doug Cutting
Weir, Michael wrote: I assume that it is possible to corrupt an index by crashing at just the right time. It should not be possible to corrupt an index this way. I notice that there's a method IndexReader.unlock(). Does this method ensure that the index has not been corrupted? If you use this met

Re: Index partitioning

2004-04-06 Thread Doug Cutting
Magnus Mellin wrote: i would like to partition an index over X number of remote searchers. Any ideas, or suggestions, on how to use the same term dictionary (one that represents the terms and frequencies for the whole document collection) over all my indices? Try using a ParallelMultiSearcher com

Re: "starts with" query functionality

2004-04-06 Thread Doug Cutting
Chad Small wrote: We have a requirement to return documents with a "title" field that starts with a certain letter. Is there a way to do something like this? We're using the StandardAnalyzer Example title fields: This is the title of a document. And this is a title of a different document.

Re: Find all Words in a Document

2004-04-06 Thread Doug Cutting
peters marcus wrote: is there a way to get all words stored in the index for a given document Yes, in the 1.4 release: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVectors(int) Doug -

Re: Iterernal Document Numbers

2004-04-01 Thread Doug Cutting
Joe Rayguy wrote: So, assuming that sort as implemented in 1.4 doesn't work for me, my original question still stands. Do I have to worry about merges that occur as documents are added, or do I only have to rebuild my array after optimizations? Or, alternatively, how did everyone sort before 1.4?

Re: Wierd Search Behavior

2004-04-01 Thread Doug Cutting
Terry, Can you please try to develop a reproducible test case? Otherwise it's impossible to verify and debug this. For something like this it would suffice to provide: 1. The initial index, which satisifies the test queries; 2. The new index you add; 3. Your merge and test code, as a s

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Doug Cutting
Kevin A. Burton wrote: Doug Cutting wrote: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1413989 According to these, if your documents average 16k, then a 10-hit result page would require just 66ms to generate highlights using SimpleAnalyzer. The whole search takes only 3

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Doug Cutting
Doug Cutting wrote: According to these, if your documents average 16k, then a 10-hit result page would require just 66ms to generate highlights using SimpleAnalyzer. Oops. That should be 110ms. Doug - To unsubscribe, e-mail

Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Doug Cutting
[EMAIL PROTECTED] wrote: As a note of warning: I did find StandardTokenizer to be the major culprit in my tokenizing benchmarks (avg 75ms for 16k sized docs). I have found I can live without StandardTokenizer in my apps. FYI, the message with Mark's timings can be found at: http://nagoya.apache.o

Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-31 Thread Doug Cutting
Kevin A. Burton wrote: I'm playing with this package: http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient Does it just seem inefficient,

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-30 Thread Doug Cutting
Esmond Pitt wrote: Don't want to start a buffer size war, but these have always seemed too small to me. I'd recommend upping both InputStream and OutputStream buffer sizes to at least 4k, as this is the cluster size on most disks these days, and also a common VM page size. Okay. Reading and writin

Re: Javadocs lucene 1.4

2004-03-29 Thread Doug Cutting
Lucene 1.4 has not been released. Until it is released, you need to check out the sources from CVS and build them, including javadoc. Doug Stephane James Vaucher wrote: Are the javadocs available on the site? I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery) somewhere on the

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help yo

Re: Demoting results

2004-03-29 Thread Doug Cutting
[EMAIL PROTECTED] wrote: Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally useful than my implementation :-) Great! Glad to hear it was useful. BTW, I've had a thought about your suggestion for making the highlighter use some form of RAMindex of sentence fragments

Re: Overriding coordination

2004-03-29 Thread Doug Cutting
Boris Goldowsky wrote: I have a situation where I'm querying for something in several fields, with a clause similar to this: (title:(two words)^20 keywords:(two words)^10 body:(two words)) Some good documents are being scored too low if the query terms do not occur in the "body" field. I naive

Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Doug Cutting
Kevin A. Burton wrote: We're using lucene with one large target index which right now is 5G. Every night we take sub-indexes which are about 500M and merging them into this main index. This merge (done via IndexWriter.addIndexes(Directory[]) is taking way too much time. Looking at the stats f

Re: Lucene 1.4 - lobby for final release

2004-03-29 Thread Doug Cutting
Charlie Smith wrote: I'll vote yes please release new version with "too many files open" fixed. There is no "too many files open bug", except perhaps in your application. It is however an easy to encounter problem if you don't close indexes or if you change Lucene's default parameters. It will

Re: Demoting results

2004-03-29 Thread Doug Cutting
[EMAIL PROTECTED] wrote: I have not been able to work out how to get custom coordination going to demote results based on a specific term [ ... ] Yeah, it's a little more complicated than perhaps it should be. I've attached a class which does this. I think it's faster and more effective than wh

Re: Lucene 1.4 - lobby for final release

2004-03-26 Thread Doug Cutting
Chad Small wrote: thanks Erik. Ok this is my official lobby effort for the release of 1.4 to final status. Anyone else need/want a 1.4 release? Does anyone have any information on 1.4 release plans? I'd like to make an RC once I manage to fix bug #27799, which will hopefully be soon. Doug --

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: I will need to have a look at the code, but I assume that in principal it should be possible to replace the strings with sequential integers once the sorting is done? I don't understand the question. Doug - To un

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: That's reasonable. What I didn't quite understand yet: If I sort on a string field, will Lucene need to keep all values in memory all the time, or only during startup? It will cache one instance of each unique value. So if you have a million documents and string sort results on a

Re: How to order search results by Field value?

2004-03-25 Thread Doug Cutting
Eric Jain wrote: Just to clarify things: Does the current solution require all fields that can be used for sorting to be loaded and kept in memory? (I guess you can answer this question faster than I can figure it out by myself :-) Field values are loaded into memory. But values are kept in an arr

Re: Cover density ranking?

2004-03-23 Thread Doug Cutting
Boris Goldowsky wrote: How difficult would it be to implement something like Cover Density ranking for Lucene? Has anyone tried it? Cover density is described at http://citeseer.ist.psu.edu/558750.html , and is supposed to be particularly good for short queries of the type that you get in many

Re: Demoting results

2004-03-19 Thread Doug Cutting
Doug Cutting wrote: On Thu, 2004-03-18 at 13:32, Doug Cutting wrote: Have you tried assigning these very small boosts (0 < boost < 1) and assigning other query clauses relatively large boosts (boost > 1)? I don't think you understood my proposal. You should try boosting the docu

Re: Demoting results

2004-03-19 Thread Doug Cutting
Boris Goldowsky wrote: On Thu, 2004-03-18 at 13:32, Doug Cutting wrote: Have you tried assigning these very small boosts (0 < boost < 1) and assigning other query clauses relatively large boosts (boost > 1)? I was trying to formulate a query like, say +(title: asparagus) (doctyp

Re: Demoting results

2004-03-18 Thread Doug Cutting
Have you tried assigning these very small boosts (0 < boost < 1) and assigning other query clauses relatively large boosts (boost > 1)? Boris Goldowsky wrote: Is there any way to build a query where the occurrence of a particular Term (in a Keyword field) causes the rank of the document to be dec

Re: order of Field objects within Document

2004-03-18 Thread Doug Cutting
Sam Hough wrote: Can anybody confirm that no guarantee is given that Fields retain their order within a Document? Version 1.3 seems to (although reversing the order on occasion). In 1.3 they're reversed as added, then reversed as read, so that hits have fields in their added order. In 1.4 I've fi

Re: int vs long and document ids on 64bit machines.

2004-03-11 Thread Doug Cutting
hui wrote: If the document id is going to be changed, is it possible to define an interface so the user could provide other implementation to replace the default one? For example, the document unique timestamp or other fields as long as they are long could be used. I don't think that would be a goo

Re: update performance

2004-03-11 Thread Doug Cutting
Chris Kimm wrote: Unfortunately, I'm not able to batch the updates. The application needs to make some descisions based on what each document looks like before and after the update, so I have to do it one at a time. Are these decisions dependent on other documents? If not, you should be able

Re: update performance

2004-03-11 Thread Doug Cutting
It sounds like you're not batching your updates. The most efficient approch to update 1000 documents would be to: 1. Open an IndexReader; 2. Delete all 1000 documents. 3. Close the reader; 4. Open an IndexWriter; 5. Add all 1000 updated documents; 6. Close the IndexWriter. Is that wha

Re: int vs long and document ids on 64bit machines.

2004-03-11 Thread Doug Cutting
Kevin A. Burton wrote: A discussion I had a while back had someone note (Doug?) that the decision to go with 32bit ints for document IDs was that on 32 bit machines that 64bits weren't threadsafe. Somone, not me, perhaps provided that rationalization, which isn't a bad one. In fact, the situati

Re: Real time indexing and distribution to lucene on separate boxes (long)

2004-03-11 Thread Doug Cutting
Kevin A. Burton wrote: > 3. Have two directories on the searcher. The indexer would then sync to a tmp directory and then at run time swap them via a rename once the sync is over. The downside here is that this will take up 2x disk space on the searcher. The upside is that the box will only s

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-11 Thread Doug Cutting
Erik Hatcher wrote: Yes, I saw it. But is there a reason not to just expose HashSet given that it is the data structure that is most efficient? I bought into Kevin's arguments that it made sense to just expose HashSet. Just the general principal that one shouldn't expose more of the implementa

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-10 Thread Doug Cutting
Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my mess

Re: 1.3-final builds as 1.4-rc1-dev?

2004-03-10 Thread Doug Cutting
Jeff Wong wrote: I noticed that Lucene 1.3-final source builds a JAR file whose version number is "1.4-rc1-dev". What does this mean? Will 1.4-final build as "1.5-rc1-dev"? Probably. If you modify the sources of a 1.3-final release, and build them, you're not building 1.3-final, but a derivativ

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Doug Cutting
David Spencer wrote: Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. Good point. Doug -

Re: DocumentWriter, StopFilter should use HashMap... (patch)

2004-03-09 Thread Doug Cutting
Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like t

Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-08 Thread Doug Cutting
hui wrote: Index time: compound format is 89 seconds slower. compound format: 1389507 total milliseconds non-compound format: 1300534 total milliseconds The index size is 85m with 4 fields only. The files are stored in the index. The compound format has only 3 files and the other has 13 files. T

Re: Storing numbers

2004-03-08 Thread Doug Cutting
Erik Hatcher wrote: private static final DecimalFormat formatter = new DecimalFormat("0"); // make this as wide as you need For ints, ten digits is probably safest. Since Lucene uses prefix compression on the term dictionary, you don't pay a penalty at search time for long shared pre

Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-04 Thread Doug Cutting
hui wrote: Not yet. For the compound file format, when the files get bigger, if I add few new files frequently, the bigger files has to be updated. Will that affect lot on the search and produce heavier disk I/O compared with the traditional index format? It seems OS cache makes quite difference wh

Re: Sys properties Was: java.io.tmpdir as lock dir .... once again

2004-03-03 Thread Doug Cutting
Stephane James Vaucher wrote: As I've stated in my earlier mail, I like this change. More importantly, could this become a "standard" way of changing configurations at runtime? For example, the default merge factor could also be set in this manner. Sure, that's reasonable, so this would be someth

Re: java.io.tmpdir as lock dir .... once again

2004-03-03 Thread Doug Cutting
Michael Duval wrote: > I've hacked the code for the time being by updating FSDirectory and replaced all System.getProperty("java.io.tmpdir") calls with a call to a new method "getLockDir()". This method checks for a "lucene.lockdir" prop before the "java.io.tmpdir" prop giving the end user a bi

Re: Best Practices for indexing in Web application

2004-03-03 Thread Doug Cutting
Michael Steiger wrote: I'm wondering that there are no samples for this job. I do not think that I am the first one looking for this. If you found this confusing, and would have been helped by some examples, please take the time to donate some good examples. Lucene is free, but requires donati

Re: Problem with search results

2004-03-03 Thread Doug Cutting
Morus Walter wrote: Now I think this can be fixed in the query parser alone by simply allowing '-' within words. That is change <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> ) > to <#_TERM_CHAR: ( <_TERM_START_CHAR> | <_ESCAPED_CHAR> | "-" ) > As a result, query parser will read '-' within w

Re: Indexing multiple instances of the same field for each document

2004-03-01 Thread Doug Cutting
Erik Hatcher wrote: On Feb 27, 2004, at 6:17 PM, Doug Cutting wrote: I think it's document.add(). Fields are pushed onto the front, rather than added to the end. Ah, ok DocumentFieldList/DocumentFieldEnumeration are the culprits. This is certainly a bug. Yes, a bug that's

Re: Indexing multiple instances of the same field for each document

2004-02-27 Thread Doug Cutting
I think it's document.add(). Fields are pushed onto the front, rather than added to the end. Doug Roy Klein wrote: I think it's got something to do with Document.invertDocument(). When I reverse the words in the phrase, the other document matches the phrase query. Roy -Original M

Re: Indexing multiple instances of the same field for each document

2004-02-27 Thread Doug Cutting
Roy Klein wrote: E.g. doc1.add(Field.indexed("field","the"); doc1.add(Field.indexed("field","quick"); doc1.add(Field.indexed("field","brown"); doc1.add(Field.indexed("field","fox"); doc1.add(Field.indexed("field","jumped"); writer.addDocument(doc1); Vs. doc2.add(Field.indexed("

Re: Iterating TermEnum backwards

2004-02-26 Thread Doug Cutting
Matt Quail wrote: Is there any way to iterate through a TermEnum backwards? Okay, I know that there isn't a way to do this via the TermEnum class, but is it "implementable" on top of the underlying Lucene datastore? Not really. The best you can do is skip back to the previous "indexed" term in Te

Re: Prevent duplicate results?

2004-02-25 Thread Doug Cutting
How could Lucene know that something is "duplicate but older"? Sounds like an application-specific thing. Doug Kevin A. Burton wrote: Is there any way to prevent lucene from returning duplicate (but 'older') results from returning within a search result? Kevin ---

Re: RE : Lucene scalability/clustering

2004-02-24 Thread Doug Cutting
Anson Lau wrote: I'm trying to see what are some common ways to scale lucene onto multiple boxes. Is RMI based search and using a MultiSearcher the general approach? Yes, although you probably want to use ParallelMultiSearcher. Doug ---

Re: problem with SearchFiles demo

2004-02-23 Thread Doug Cutting
Michael, What JVM and OS are you using? Your attachment did not make it through. If you continue to have problems please submit a bug report and attach test code there. Thanks, Doug Michael A. Schoen wrote: I am using 1.3-final. Specifically I'm using the jar files from lucene-1.3-final.zip.

Re: Concurrency

2004-02-20 Thread Doug Cutting
David Townsend wrote: Does this mean that if an IndexSearcher has hold of a segment file, then the index is optimised, any subsequent search will use a list of files that probably don't exist anymore? The IndexSearcher (through an IndexReader) has the files open, so it is still valid, and may be s

Re: Concurrency

2004-02-20 Thread Doug Cutting
Alan Smith wrote: 1. What happens if i make a backup (copy) of an index while documents are being added? Can it cause problems, and if so is there a way to safely do this? This is not in general safe. A copy may not be a usable index. The segments file points to the current set of files. An I

Re: MultiReader

2004-02-19 Thread Doug Cutting
Rasik Pandey wrote: Does anyone know of an implementation of a MultiReader (IndexReader over multiple indices) in the same spirit as the MultiSearcher? I just committed one! This was really already there, in SegmentsReader, but it was not public and needed a few minor changes. Enjoy. Doug

Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

2004-02-17 Thread Doug Cutting
David Spencer wrote: Code rewritten, automagically chooses lots of defaults, lets you override the defs thru the static vars at the bottom or the non-static vars also at the bottom. Has anyone used this? Was it useful? Should we add it to the sandbox? Doug -

Re: SubstringQuery -- Re: Leading Wild Card Search

2004-02-17 Thread Doug Cutting
David Spencer wrote: 2 files attached, SubstringQuery (which you'll use) and SubstringTermEnum ( used by the former to be consistent w/ other Query code). I find this kind of query useful to have and think that the query parser should allow it in spite of the perception of this being slow, howev

Re: Inconsistent treatment of field-names between index-time and query-time

2004-02-17 Thread Doug Cutting
Esmond Pitt wrote: I have a field Author: and I'm using the StandardAnalyzer. When documents with this field are added to the index, the field name 'Author' is case-folded by the analyzer to 'author', and this is how it appears in the index. An analyzer does not process field names when indexing.

Re: 'Sponsored' links

2004-02-16 Thread Doug Cutting
Daniel B. Davis wrote: Are there other strategies not considered? Why not store sponsored documents in a separate index, separately searched, whose results are placed above those from the non-sponsored documents? Doug - To unsu

Re: SQLDirectory

2004-02-06 Thread Doug Cutting
would still be lousy, but add performance might be OK if the add operations were done in memory before committing them to the database. there would be a second index column, something like index number or something like that. Herb... -Original Message- From: Doug Cutting [mailto:[EMAIL PROT

Re: SQLDirectory

2004-02-06 Thread Doug Cutting
fast nor slow. i gather that each term's posting list was an individual BLOB in the database. the term string was used as the index column. i believe the group used stemming. Herb... -Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Friday, February 06, 2004

Re: SQLDirectory

2004-02-06 Thread Doug Cutting
Dror Matalon wrote: I suspect you're going to get lousy performence compared to using regular files. Perhaps, but in theory it shouldn't be a lot worse than, e.g., accessing an index over NFS. The tables might get fragmented as the index evolves, and database optimization might help performance.

Re: SQLDirectory

2004-02-05 Thread Doug Cutting
Philippe Laflamme wrote: I've worked on an implementation for Postgres. I used the Large Object API provided by the Postgres JDBC driver. It works fine but I doubt it is very scalable because the number of open connections during indexing can become very high. Lucene opens many different files when

Re: Using Explain and fieldNorm

2004-02-05 Thread Doug Cutting
Using the terminology in http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html fieldNorm is defined as getBoost(t.field in d) * lengthNorm(t.field in d) These two values are multipled into a single value at index time, and it is unfortunately impossible to separa

Re: setMaxClauseCount ??

2004-01-21 Thread Doug Cutting
Karl Koch wrote: Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? Using term frequencies of the document is not really possible since lucene is not providing access to a document vector, isn't it? Lucene does let you acce

<    1   2   3   4   5   >