RE: File Handles issue
> > P.S. At one point I tried doing an in-memory index using the > > RAMDirectory > > and then merging it with an on-disk index and it didn't work. The > > RAMDirectory never flushed to disk... leaving me with an > > empty index. I > > think this is because of a bug in the mechanism that is > > supposed to copy the > > segments during the merge, but I didn't follow up on this. > > That should work, it should be faster and would use a lot > less memory than > the approach you describe above. Can you please submit a > simple test case > illustrating the failure? Something self-contained would be best. Ok. This will fail: import java.io.*; import org.apache.lucene.index.*; import org.apache.lucene.analysis.*; import org.apache.lucene.document.*; import org.apache.lucene.store.*; public class LuceneRAMDirectoryTest { public static void main(String args[]) { try { // create index in RAM RAMDirectory ramDirectory = new RAMDirectory(); Analyzer analyzer = new SimpleAnalyzer(); IndexWriter ramWriter = new IndexWriter(ramDirectory, analyzer, true); try { for (int i = 0; i < 100; i++) { Document doc = new Document(); doc.add(Field.Keyword("field1", "" + i)); ramWriter.addDocument(doc); } } finally { ramWriter.close(); } // then merge into file File file = new File("index"); boolean missing = !file.exists(); if (missing) file.mkdir(); IndexWriter fileWriter = new IndexWriter(file, analyzer, true); try { fileWriter.addIndexes(new Directory[] { ramDirectory }); } finally { fileWriter.close(); } } catch (Exception e) { e.printStackTrace(); } } }
RE: File Handles issue
> From: Scott Ganyo [mailto:[EMAIL PROTECTED]] > > Thanks for the detailed information, Doug! That helps a lot. > > Based on what you've said and on taking a closer look at the > code, it looks > like by setting mergeFactor and maxMergeDocs to > Integer.MAX_VALUE, an entire > index will be built in a single segment completely in memory > (using the > RAMDirectory) and then flushed to disk when closed. Not quite. This would generate an index with a segment per document in memory, and then try to merge them all in a single step. That should work, but I do not think it is the most efficient way to build an index in memory. > P.S. At one point I tried doing an in-memory index using the > RAMDirectory > and then merging it with an on-disk index and it didn't work. The > RAMDirectory never flushed to disk... leaving me with an > empty index. I > think this is because of a bug in the mechanism that is > supposed to copy the > segments during the merge, but I didn't follow up on this. That should work, it should be faster and would use a lot less memory than the approach you describe above. Can you please submit a simple test case illustrating the failure? Something self-contained would be best. Doug
RE: File Handles issue
Thanks for the detailed information, Doug! That helps a lot. Based on what you've said and on taking a closer look at the code, it looks like by setting mergeFactor and maxMergeDocs to Integer.MAX_VALUE, an entire index will be built in a single segment completely in memory (using the RAMDirectory) and then flushed to disk when closed. Given enough memory, it would seem that this would be the fastest setting (as well as using a minimum of file handles). Would you agree? Thanks, Scott P.S. At one point I tried doing an in-memory index using the RAMDirectory and then merging it with an on-disk index and it didn't work. The RAMDirectory never flushed to disk... leaving me with an empty index. I think this is because of a bug in the mechanism that is supposed to copy the segments during the merge, but I didn't follow up on this.
RE: File Handles issue
> From: Scott Ganyo [mailto:[EMAIL PROTECTED]] > > We're having a heck of a time with too many file handles > around here. When > we create large indexes, we often get thousands of temporary > files in a given index! Thousands, eh? That seems high. The maximum number of segments should be f*log_f(N), where f is the IndexWriter.mergeFactor and N is the number of documents. The default merge factor is ten. There are seven files per segment, plus one per field. If we assume that you have three fields per document, then its ten files per segment. So to get 1000 files in an index with three fields and a mergeFactor of ten, you'd need 10 billion documents, which I doubt you have. (Lucene can't handle more than 2 billion anyway...) How many fields do you have? (How many different .f files are there per segment?) Have you lowered IndexWriter.maxMergeDocs? If you, e.g. lowered this to 10,000, then with a million documents you'd have 100 segments, which would give you 1000 files. So, to minimize the number of files, keep maxMergeDocs at Integer.MAX_VALUE, its default. Another possibility is that you're running on Win32 and obsolete files are being kept open by IndexReaders and cannot be deleted. Could that be the case? > Even worse, we just plain run out of file > handles--even on > boxes where we've upped the limits as much as we think we > can! You should endevour to keep just one IndexReader at a time for an index. When it is out of date, don't close it, as this could break queries running in other threads, just let it get garbage collected. The finalizers will close things and free the file handles. > I'm not very familiar with the Lucene file system yet, so can someone > briefly explain how Lucene works on creating an index? How does it > determine when to create a new temporary file in the index > and when does it > decide to compress the index? Assume mergeFactor is ten, the default. A new segment is created on disk for every ten documents added, or sooner if IndexWriter.close() is called before ten have been added. When the tenth segment of size ten is added, all ten are merged into a single segment of size 100. When ten such segments of size 100 have been added, these are merged into a single segment containing 1000 documents, and so on. So at any time there can be no more than nine segments in each power-of-ten index size. When optimize() is called all segments are merged into a single segment. The exception is that no segments will be created larger than IndexWriter.maxMergeDocs. So if this were set to 1000, then when you add the 10,000th document, instead of merging things into a single segment of 10,000, it would add a tenth segment of size 1000, and keep adding segments of size 1000 for every 1000 documents added. > Also, is there any way we > could limit the > number of file handles used by Lucene? An IndexReader keeps all files in all segments open while it is open. So to minimize the number of file handles you should minimize the number of segments, minimize the number of fields, and minimize the number of IndexReaders open at once. An IndexWriter also has all files in all segments open at once. So updating in a separate process would also buy you more file handles. Doug