Re: index size doubled?
On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing speed. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
Thanks for the heads up. I'm using Lucene 1.4.2. I tried to do optimize() again but it has no effect. Adding a just tiny dummy document would get rid of it. I'm doing optimize every few hundred documents because I tried to simulate incremental update. This lead to another question I would post separately. Thanks. Another possibility is that you are using an older version of Lucene, which was known to have a bug with similar symptoms. Get the latest version of Lucene. You shouldn't really have multiple .cfs files after optimizing your index. Also, optimize only at the end, if you care about indexing speed. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index Size
Guys Are u Using the Optimizing the index before close process. If not try using it... :} karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:00 PM To: Lucene Users List Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be merged into bigger segments. If done, the old segment files will be deleted. If there is an open IndexReader, the environment is unable to unlock the files and they still stay in the index directory. You will end up with an index, several times bigger than the dataset. Can you check your code for any open IndexReaders when indexing, or paste the relevant part to the list so we could have a look on it. hope this helps Bernhard Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Paul Thank you for your response. I have appended to the bottom of this message the field structure that I am using. I hope that this helps. I am using the StandardAnalyzer. I do not believe that I am changing any default values, but I have also appended the code that adds the temp index to the production index. Thanks for you help Rob Here is the code that describes the field structure. public static Document Document(String contents, String path, Date modified, String runDate, String totalpages, String pagecount, String countycode, String reportnum, String reportdescr) { SimpleDateFormat showFormat = new SimpleDateFormat(TurbineResources.getString(date.default.format)); SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd); Document doc = new Document(); doc.add(Field.Keyword(path, path)); doc.add(Field.Keyword(modified, showFormat.format(modified))); doc.add(Field.UnStored(searchDate, searchFormat.format(modified))); doc.add(Field.Keyword(runDate, runDate==null?:runDate)); doc.add(Field.UnStored(searchRunDate, runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri ng(3,5))); doc.add(Field.Keyword(reportnum, reportnum)); doc.add(Field.Text(reportdescr, reportdescr)); doc.add(Field.UnStored(cntycode, countycode)); doc.add(Field.Keyword(totalpages, totalpages)); doc.add(Field.Keyword(page, pagecount)); doc.add(Field.UnStored(contents, contents)); return doc; } Here is the code that adds the temp index to the production index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); - Original Message - From: Paul Elschot [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:16 AM Subject: Re: Index Size On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Karthik Thanks for responding. Yes, I optimize right before I close the index writer. I added this a little while ago to try and get the size down. Rob - Original Message - From: Karthik N S [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:59 AM Subject: RE: Index Size Guys Are u Using the Optimizing the index before close process. If not try using it... :} karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:00 PM To: Lucene Users List Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Bernhard Thanks for responding. I do have an IndexReader open on the Temp index. I pass this IndexReader into the addIndexes method on the IndexWriter to add these files. I did notice that I have a ton of CFS files that I removed and was still able to read the indexes. Are these the temporary segment files you are talking about? Here is my code that adds the temp index to the prod index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); Am I doing something wrong? Any help would be extremely appreciated. Thanks Rob - Original Message - From: Bernhard Messer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:09 AM Subject: Re: Index Size Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be merged into bigger segments. If done, the old segment files will be deleted. If there is an open IndexReader, the environment is unable to unlock the files and they still stay in the index directory. You will end up with an index, several times bigger than the dataset. Can you check your code for any open IndexReaders when indexing, or paste the relevant part to the list so we could have a look on it. hope this helps Bernhard Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
I did a little more research into my production indexes, and so far the first index in the only one that has any other files besides the CFS files. The other indexes that I have seen have just the deletable and segments files and a whole bunch of cfs files. Very interesting. Also worth noting is that once in awhile one of the production indexes will have a 0 length FNM file. Rob - Original Message - From: Rob Jose [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 6:42 AM Subject: Re: Index Size Bernhard Thanks for responding. I do have an IndexReader open on the Temp index. I pass this IndexReader into the addIndexes method on the IndexWriter to add these files. I did notice that I have a ton of CFS files that I removed and was still able to read the indexes. Are these the temporary segment files you are talking about? Here is my code that adds the temp index to the prod index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); Am I doing something wrong? Any help would be extremely appreciated. Thanks Rob - Original Message - From: Bernhard Messer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:09 AM Subject: Re: Index Size Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be merged into bigger segments. If done, the old segment files will be deleted. If there is an open IndexReader, the environment is unable to unlock the files and they still stay in the index directory. You will end up with an index, several times bigger than the dataset. Can you check your code for any open IndexReaders when indexing, or paste the relevant part to the list so we could have a look on it. hope this helps Bernhard Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the IndexWriter.addIndexes(IndexReader) method and instead traverse the IndexReader on the temp index and use IndexWriter.addDocument(Document) method? Thanks again for your input, I appreciate it. Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 8:00 AM Subject: Re: Index Size Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index Size
Have you tried looking at the contents of this small index with Luke, to see what actually got put into it? Maybe one of your stored fields is being fed something you didn't expect. Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Dan Thanks for your response. Yes, I have used Luke to look at the index and everything looks good. Rob - Original Message - From: Armbrust, Daniel C. [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 9:14 AM Subject: RE: Index Size Have you tried looking at the contents of this small index with Luke, to see what actually got put into it? Maybe one of your stored fields is being fed something you didn't expect. Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Stupid question: Are you sure you have the right number of docs in your index? i.e. you're not adding the same document twice into or via your tmp index. sv On Thu, 19 Aug 2004, Rob Jose wrote: Paul Thank you for your response. I have appended to the bottom of this message the field structure that I am using. I hope that this helps. I am using the StandardAnalyzer. I do not believe that I am changing any default values, but I have also appended the code that adds the temp index to the production index. Thanks for you help Rob Here is the code that describes the field structure. public static Document Document(String contents, String path, Date modified, String runDate, String totalpages, String pagecount, String countycode, String reportnum, String reportdescr) { SimpleDateFormat showFormat = new SimpleDateFormat(TurbineResources.getString(date.default.format)); SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd); Document doc = new Document(); doc.add(Field.Keyword(path, path)); doc.add(Field.Keyword(modified, showFormat.format(modified))); doc.add(Field.UnStored(searchDate, searchFormat.format(modified))); doc.add(Field.Keyword(runDate, runDate==null?:runDate)); doc.add(Field.UnStored(searchRunDate, runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri ng(3,5))); doc.add(Field.Keyword(reportnum, reportnum)); doc.add(Field.Text(reportdescr, reportdescr)); doc.add(Field.UnStored(cntycode, countycode)); doc.add(Field.Keyword(totalpages, totalpages)); doc.add(Field.Keyword(page, pagecount)); doc.add(Field.UnStored(contents, contents)); return doc; } Here is the code that adds the temp index to the production index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); - Original Message - From: Paul Elschot [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:16 AM Subject: Re: Index Size On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
How many fields do you have and what analyzer are you using? [EMAIL PROTECTED] 8/19/2004 11:54:25 AM Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the IndexWriter.addIndexes(IndexReader) method and instead traverse the IndexReader on the temp index and use IndexWriter.addDocument(Document) method? Thanks again for your input, I appreciate it. Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 8:00 AM Subject: Re: Index Size Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Grant Thanks for your response. I have fixed this issue. I have indexed 5 MB worth of text files and I now only use 224 KB. I was getting 80 MB. The only change I made was to change the way I merge my temp index into my prod index. My code changed from: prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); To: int iNumDocs = tempReader.numDocs(); for (int y = 0; y iNumDocs; y++) { Document tempDoc = tempReader.document(y); prodWriter.addDocument(tempDoc); } I don't know if this is a bug in the IndexWriter.addIndexes(IndexReader) method or something else I am doing that caused this, but I am getting much better results now. Thanks to everyone who helped, I really appreciate it. Rob - Original Message - From: Grant Ingersoll [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 10:51 AM Subject: Re: Index Size How many fields do you have and what analyzer are you using? [EMAIL PROTECTED] 8/19/2004 11:54:25 AM Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the IndexWriter.addIndexes(IndexReader) method and instead traverse the IndexReader on the temp index and use IndexWriter.addDocument(Document) method? Thanks again for your input, I appreciate it. Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 8:00 AM Subject: Re: Index Size Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp
Re: Index Size
From: Doug Cutting http://www.mail-archive.com/[EMAIL PROTECTED]/msg08757.html An index typically requires around 35% of the plain text size. I think it's a little big. sv On Wed, 18 Aug 2004, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size question
It's in the jguru FAQ http://www.jguru.com/faq/view.jsp?EID=538304 By the way, what platforms don't support files greater than 2GB in this day and age? Answer This question is often brought up because of the 2GB file size limit of some 32-bit operating systems. This is a slightly modified answer from Doug Cutting: The easiest thing is to set IndexWriter.maxMergeDocs. If, for instance, you hit the 2GB limit at 8M documents set maxMergeDocs to 7M. That will keep Lucene from trying to merge an index that won't fit in your filesystem. It will actually effectively round this down to the next lower power of Index.mergeFactor. So with the default mergeFactor set to 10 and maxMergeDocs set to 7M Lucene will generate a series of 1M document indexes, since merging 10 of these would exceed the maximum. A slightly more complex solution: You could further minimize the number of segments if, when you've added 7M documents, optimize the index and start a new index. Then use MultiSearcher to search the indexes. An even more complex and optimal solution: Write a version of FSDirectory that, when a file exceeds 2GB, creates a subdirectory and represents the file as a series of files. Is this item helpful? yes no Previous votes Yes: 1 No: 0 On Wed, Jan 14, 2004 at 10:50:48AM -0500, Chong, Herb wrote: this should probably in the FAQ. what happens when i index tens of gigabytes of documents on a platform that doesn't support files larger than 2GB. does Lucene automatically stop merging index files intelligently so that its files don't exceed 2GB in size, or must i manage the incoming documents such that no index file exceeds 2GB? Herb - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]