Re: Index Size
On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index Size
Guys Are u Using the Optimizing the index before close process. If not try using it... :} karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:00 PM To: Lucene Users List Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be merged into bigger segments. If done, the old segment files will be deleted. If there is an open IndexReader, the environment is unable to unlock the files and they still stay in the index directory. You will end up with an index, several times bigger than the dataset. Can you check your code for any open IndexReaders when indexing, or paste the relevant part to the list so we could have a look on it. hope this helps Bernhard Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
This is what I did. There are 2 classes in the lucene source which are not public and therefore cannot be accessed from outside the package. The classes are 1. org.apache.lucene.index.SegmentInfos - collection of segments 2. org.apache.lucene.index.SegmentInfo -represents a sigle segment I took these two files and moved to a separate folder. Then created a class with the following code fragment. public void displaySegments(String indexDir) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); StringBuffer str = new StringBuffer(); int size = segments.size(); str.append(Index Dir = + indexDir ); str.append(\nTotal Number of Segments + size); str.append(\n--); for(int i=0;isize;i++) { str.append(\n); str.append((i+1) + . ); str.append(((SegmentInfo)segments.get(i)).name); } str.append(\n--); System.out.println(str.toString()); } public void deleteSegment(String indexDir, String segmentName) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); int size = segments.size(); String name = null; boolean found = false; for(int i=0;isize;i++) { name = ((SegmentInfo)segments.get(i)).name; if (segmentName.equals(name)) { found = true; segments.remove(i); System.out.println(Deleted the segment with name + name + from the segments file); break; } } if (found) { segments.write(dir); } else { System.out.println(Invalid segment name: + segmentName); } } Use the displaySegments() method to display the segments and deleteSegment to delete the corrupt segment. Thanks, George --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys In Our Situation we would be indexing Million Millions of Information documents with Huge Giga Bytes of Data Indexed and finally would be put into a MERGED INDEX, Categorized accordingly. There may be a possibility of Corruption, So Please do post the code reffrals Thx Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 5:51 PM To: Lucene Users List Subject: Re: Restoring a corrupt index Thanks Erik, that worked. I was able to remove the corrupt index and now it looks like the index is OK. I was able to view the number of documents in the index. Before that I was getting the error, java.io.IOException: read past EOF I am yet to find out how my index got corrupted. There is another thread going on about this topic, http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html If anybody is facing similar problem and is interested in the code I can post it here. Thanks, George --- Erik Hatcher [EMAIL PROTECTED] wrote: The details of the segments file (and all the others) is freely available here: http://jakarta.apache.org/lucene/docs/fileformats.html Also, there is Java code in Lucene, of course, that manipulates the segments file which could be leveraged (although probably package scoped and not easily usable in a standalone repair tool). Erik On Aug 18, 2004, at 6:50 AM, Honey George wrote: Looks like problem is not with the hexeditor, even in the ultraedit(i had access to a windows box) I am seeing the same display. The problem is I am not able to identify where a record starts with just 1 record in the file. Need to try some alternate approach. Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
Hi George Do u think ,the same would work for MERGED Indexes Please Can u suggest a solution. Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 2:08 PM To: Lucene Users List Subject: RE: Restoring a corrupt index This is what I did. There are 2 classes in the lucene source which are not public and therefore cannot be accessed from outside the package. The classes are 1. org.apache.lucene.index.SegmentInfos - collection of segments 2. org.apache.lucene.index.SegmentInfo -represents a sigle segment I took these two files and moved to a separate folder. Then created a class with the following code fragment. public void displaySegments(String indexDir) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); StringBuffer str = new StringBuffer(); int size = segments.size(); str.append(Index Dir = + indexDir ); str.append(\nTotal Number of Segments + size); str.append(\n--); for(int i=0;isize;i++) { str.append(\n); str.append((i+1) + . ); str.append(((SegmentInfo)segments.get(i)).name); } str.append(\n--); System.out.println(str.toString()); } public void deleteSegment(String indexDir, String segmentName) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); int size = segments.size(); String name = null; boolean found = false; for(int i=0;isize;i++) { name = ((SegmentInfo)segments.get(i)).name; if (segmentName.equals(name)) { found = true; segments.remove(i); System.out.println(Deleted the segment with name + name + from the segments file); break; } } if (found) { segments.write(dir); } else { System.out.println(Invalid segment name: + segmentName); } } Use the displaySegments() method to display the segments and deleteSegment to delete the corrupt segment. Thanks, George --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys In Our Situation we would be indexing Million Millions of Information documents with Huge Giga Bytes of Data Indexed and finally would be put into a MERGED INDEX, Categorized accordingly. There may be a possibility of Corruption, So Please do post the code reffrals Thx Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 5:51 PM To: Lucene Users List Subject: Re: Restoring a corrupt index Thanks Erik, that worked. I was able to remove the corrupt index and now it looks like the index is OK. I was able to view the number of documents in the index. Before that I was getting the error, java.io.IOException: read past EOF I am yet to find out how my index got corrupted. There is another thread going on about this topic, http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html If anybody is facing similar problem and is interested in the code I can post it here. Thanks, George --- Erik Hatcher [EMAIL PROTECTED] wrote: The details of the segments file (and all the others) is freely available here: http://jakarta.apache.org/lucene/docs/fileformats.html Also, there is Java code in Lucene, of course, that manipulates the segments file which could be leveraged (although probably package scoped and not easily usable in a standalone repair tool). Erik On Aug 18, 2004, at 6:50 AM, Honey George wrote: Looks like problem is not with the hexeditor, even in the ultraedit(i had access to a windows box) I am seeing the same display. The problem is I am not able to identify where a record starts with just 1 record in the file. Need to try some alternate approach. Thanks, George ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun!
searchhelp
Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS.
Re: searchhelp
For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchhelp
The PDF and WORD stuff has been done too: have a look at http://www.zilverline.org. Michael Franken Chandan Tamrakar wrote: For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchhelp
I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re: searchhelp For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchhelp
Hi, Note that Lucene only provides an API to build a search engine you can use it how ever you want it. You can pass data to indexing in 2 forms. 1. java.lang.String 2. java.io.Reader What Lucene recieves is any of the two objects above. Now in the case of non-text documents you need to extract the text information from the documents and either create as a text file and convert to a Reader object or creat a String object (for small files). For indexing database contents, you need to write your own APIs to get data from the database (using JDBC/EJB etc), convert the data to a String object and pass it to Lucene for indexing. Again Lucene is not responsible for getting the data from your application. It only indexed the data given it to you. Also for extracting contents from pdf doc files(generally known as straining) I know of 2 more tools wvWare - for word documents pdftotext(xpdf) - for pdf documents. Google around and you will get lot of links. Hope this helps. Thanks, George --- Santosh [EMAIL PROTECTED] wrote: I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re: searchhelp For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Restoring a corrupt index
If I understand correctly, You have situation where you have a large main index and then you create small indexes and finally merge to the main index. It can happen that half way through merging, the system crashed and the index got corrupted. I do not think in this case you can use my solution. What I am trying to do is to remove a corrupt segment and associated files from the index folder, not trying to fix a corrupt segment. This way atleast I can add new documents to the index. Of cource I am sure I didn't loose anything because my index file size was actually 0 bytes. Thanks, George --- Karthik N S [EMAIL PROTECTED] wrote: Hi George Do u think ,the same would work for MERGED Indexes Please Can u suggest a solution. Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 2:08 PM To: Lucene Users List Subject: RE: Restoring a corrupt index This is what I did. There are 2 classes in the lucene source which are not public and therefore cannot be accessed from outside the package. The classes are 1. org.apache.lucene.index.SegmentInfos - collection of segments 2. org.apache.lucene.index.SegmentInfo -represents a sigle segment I took these two files and moved to a separate folder. Then created a class with the following code fragment. public void displaySegments(String indexDir) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); StringBuffer str = new StringBuffer(); int size = segments.size(); str.append(Index Dir = + indexDir ); str.append(\nTotal Number of Segments + size); str.append(\n--); for(int i=0;isize;i++) { str.append(\n); str.append((i+1) + . ); str.append(((SegmentInfo)segments.get(i)).name); } str.append(\n--); System.out.println(str.toString()); } public void deleteSegment(String indexDir, String segmentName) throws Exception { Directory dir = (Directory)FSDirectory.getDirectory(indexDir, false); SegmentInfos segments = new SegmentInfos(); segments.read(dir); int size = segments.size(); String name = null; boolean found = false; for(int i=0;isize;i++) { name = ((SegmentInfo)segments.get(i)).name; if (segmentName.equals(name)) { found = true; segments.remove(i); System.out.println(Deleted the segment with name + name + from the segments file); break; } } if (found) { segments.write(dir); } else { System.out.println(Invalid segment name: + segmentName); } } Use the displaySegments() method to display the segments and deleteSegment to delete the corrupt segment. Thanks, George --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys In Our Situation we would be indexing Million Millions of Information documents with Huge Giga Bytes of Data Indexed and finally would be put into a MERGED INDEX, Categorized accordingly. There may be a possibility of Corruption, So Please do post the code reffrals Thx Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 18, 2004 5:51 PM To: Lucene Users List Subject: Re: Restoring a corrupt index Thanks Erik, that worked. I was able to remove the corrupt index and now it looks like the index is OK. I was able to view the number of documents in the index. Before that I was getting the error, java.io.IOException: read past EOF I am yet to find out how my index got corrupted. There is another thread going on about this topic, http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html If anybody is facing similar problem and is interested in the code I can post it here. Thanks, George --- Erik Hatcher [EMAIL PROTECTED] wrote: The details of the segments file (and all the others) is freely available here: http://jakarta.apache.org/lucene/docs/fileformats.html Also, there is Java code in Lucene, of course, that manipulates the segments file which could be leveraged (although probably package scoped and not easily usable in a standalone repair tool). Erik On Aug 18, 2004, at 6:50 AM, Honey George wrote: Looks like problem is not with the hexeditor, even in the ultraedit(i had access to a windows box) I am
Re: searchhelp
for pdf u can refer www.pdfbox.org and pls. check the apache POI project in jakarta.apache.org site for indexing MS documents. - Original Message - From: Santosh [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 4:09 PM Subject: Re: searchhelp I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re: searchhelp For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: searchhelp
JGURU FAQ http://www.jguru.com/faq/Lucene OFFICIAL FAQ http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi MAIL ARCHIVE http://www.mail-archive.com/[EMAIL PROTECTED]/ hope this helps. -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: 19 August 2004 11:25 To: Lucene Users List Subject: Re: searchhelp I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re: searchhelp For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchhelp
thanks everybody, but i didnt got any code or any real help in this links any body has performed previously this search?if yes then please send me the code, or tell me the what code I have to add to my present lucene - Original Message - From: David Townsend [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 4:17 PM Subject: RE: searchhelp JGURU FAQ http://www.jguru.com/faq/Lucene OFFICIAL FAQ http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi MAIL ARCHIVE http://www.mail-archive.com/[EMAIL PROTECTED]/ hope this helps. -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: 19 August 2004 11:25 To: Lucene Users List Subject: Re: searchhelp I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re: searchhelp For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchhelp
As far as I remember, the pdfbox release includes some existing code to index pdfs with lucene, based upon the demo created for lucene 1.3. In fact, I think the code only works for lucene 1,3 - something to do with a change from arrays to vectors in lucene 1.4. I may be wrong though. http://www.csh.rit.edu/~ben/projects/pdfbox/javadoc/org/pdfbox/searchengine/lucene/package-summary.html thanks everybody, but i didnt got any code or any real help in this links any body has performed previously this search?if yes then please send me the code, or tell me the what code I have to add to my present lucene - Original Message - From: David Townsend [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 4:17 PM Subject: RE: searchhelp JGURU FAQ http://www.jguru.com/faq/Lucene OFFICIAL FAQ http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi MAIL ARCHIVE http://www.mail-archive.com/[EMAIL PROTECTED]/ hope this helps. -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: 19 August 2004 11:25 To: Lucene Users List Subject: Re: searchhelp I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re: searchhelp For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:58 PM Subject: searchhelp Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how? thankyou santosh ---SOFTPRO DISCLAIMER-- Information contained in this E-MAIL and any attachments are confidential being proprietary to SOFTPRO SYSTEMS is 'privileged' and 'confidential'. If you are not an intended or authorised recipient of this E-MAIL or have received it in error, You are notified that any use, copying or dissemination of the information contained in this E-MAIL in any manner whatsoever is strictly prohibited. Please delete it immediately and notify the sender by E-MAIL. In such a case reading, reproducing, printing or further dissemination of this E-MAIL is strictly prohibited and may be unlawful. SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment hereto is free from computer viruses or other defects. The opinions expressed in this E-MAIL and any ATTACHEMENTS may be those of the author and are not necessarily those of SOFTPRO SYSTEMS. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Re: Re: OutOfMemoryError
Terence, Calling close() on IndexSearcher will not release the memory immediately. It will only release resources (e.g. other Java objects used by IndexSearcher), and it is up to the JVM's garbage collector to actually reclaim/release the previously used memory. There are command-line parameters you can use to tune garbage collection. Here is one example: java -XX:+UseParallelGC -XX:PermSize=20M -XX:MaxNewSize=32M -XX:NewSize=32M . This works with Sun's JVM. The above is just an example - you need to play with the options and see what works for you. There are other options, too: -Xnoclassgc disable class garbage collection -Xincgc enable incremental garbage collection -Xloggc:filelog GC status to a file with time stamps -Xbatch disable background compilation -Xmssizeset initial Java heap size -Xmxsizeset maximum Java heap size -Xsssizeset java thread stack size -Xprofoutput cpu profiling data -Xrunhprof[:help]|[:option=value, ...] perform JVMPI heap, cpu, or monitor profiling Otis --- Terence Lai [EMAIL PROTECTED] wrote: Hi David, In my test program, I invoke the IndexSearcher.close() method at the end of the loop. However, it doesn't seems to release the memory. My concern is that even though I put the IndexSearcher.close() statement in the hook methods, it may not release all the memory until the application server is shut down. Every time the EJB object is re-actived, a new IndexSearcher is open. If the resources allocated to the previous IndexSearcher cannot be fully released, the system will use up more memory. Eventually, it may run into the OutOfMemoryError. I am not very familiar with EJB. My interpretation could be wrong. I am going to try the hook methods. Thanks for pointing this out to me. Terence I tried to reuse the IndexSearcher, but I have another question. What happen if an application server unloads the class after it is idle for a while, and then re-instantiate the object back when it recieves a new request? The EJB spec takes this into account, as there are hook methods you can define that get called when your EJB object is about to be passivated or activated. Search for something like passivate/active and/or ejbLoad/ejbSave. This is where you should close/open your single index searcher object. -- Cheers, David This message is intended only for the named recipient. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Get your free email account from http://www.trekspace.com Your Internet Virtual Desktop! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Re: OutOfMemoryError
Use the life-cycle hooks mentioned in another email (activate/passivate) and when you detect that the server is about to unload your class, call close() on IndexSearcher. I haven't used Lucene in an EJB environment, so I don't know the details, unfortunately. :( Your simulation may be too fast for the JVM. Like I mentioned in the previous email, close() doesn't release the memory, it's the JVM that has to reclaim it. Your for loop is very fast (no pauses anywhere, probably), so maybe the garbage collector doesn't have time to reclaim the needed memory. I don't know enough about the low-level JVM stuff to be certain about this statement, but you could try adding some Thread.sleep calls in your test code. Otis --- Terence Lai [EMAIL PROTECTED] wrote: Hi, I tried to reuse the IndexSearcher, but I have another question. What happen if an application server unloads the class after it is idle for a while, and then re-instantiate the object back when it recieves a new request? Everytime the server re-instantiates the class, a new IndexSearcher instance will be created. If the IndexSearcher.close() method does not release all the memory and the server keeps unloading and re-instantiating the class, it will eventually hit the OutOfMemoryError issue. The test program from my previous email is simulating this condition. The reason why I instantiate/close the IndexSearcher inside the loop is to simulate the scenario when the server unloads and re-instantiates the object. I think that the same issue will happen if the application is written in servlet. Although the singleton pattern may resolve the problem that I described above; however, it isn't permitted by the J2EE spec according to some news letters. In order words, I can't use singleton pattern in EJB. Please correct me if I am wrong on this. Thanks, Terence Reuse your IndexSearcher! :) Also, I think somebody has written some EJB stuff to work with Lucene. The project is on SF.net. Otis --- Terence Lai [EMAIL PROTECTED] wrote: Hi All, I am getting a OutOfMemoryError when I deploy my EJB application. To debug the problem, I wrote the following test program: public static void main(String[] args) { try { Query query = getQuery(); for (int i=0; i1000; i++) { search(query); if ( i%50 == 0 ) { System.out.println(Sleep...); Thread.currentThread().sleep(5000); System.out.println(Wake up!); } } } catch (Exception e) { e.printStackTrace(); } } private static void search(Query query) throws IOException { FSDirectory fsDir = null; IndexSearcher is = null; Hits hits = null; try { fsDir = FSDirectory.getDirectory(C:\\index, false); is = new IndexSearcher(fsDir); SortField sortField = new SortField(profile_modify_date, SortField.STRING, true); hits = is.search(query, new Sort(sortField)); } finally { if (is != null) { try { is.close(); } catch (Exception ex) { } } if (fsDir != null) { try { is.close(); } catch (Exception ex) { } } } } In the test program, I wrote a loop to keep calling the search method. Everytime it enters the search method, I would instantiate the IndexSearcher. Before I exit the method, I close the IndexSearcher and FSDirectory. I also made the Thread sleep for 5 seconds in every 50 searches. Hopefully, this will give some time for the java to do the Garbage Collection. Unfortunately, when I observe the memory usage of my process, it keeps increasing until I got the java.lang.OutOfMemoryError. Note that I invoke the IndexSearcher.search(Query query, Sort sort) to process the search. If I don't specify the Sort field(i.e. using IndexSearcher.search(query)), I don't have this problem, and the memory usage keeps at a very static level. Does anyone experience a similar problem? Did I do something wrong in the test program. I throught by closing the IndexSearcher and the FSDirectory, the memory will be able to release during the Garbage Collection. Thanks, Terence -- Get your free email account from http://www.trekspace.com Your Internet Virtual Desktop! -
RE: Re: OutOfMemoryError
Terence, 2) I have a background process to update the index files. If I keep the IndexSearcher opened, I am not sure whether it will pick up the changes from the index updates done in the background process. This is a frequently asked question. Basically, you have to make use of IndexReader's method for checking the index version. You can do it as often as you want, it's really up to you, and when you detect that the index has been modified, throw away the old IndexSearcher and make a new one. If you are sure nobody is using your old IndexSearcher, you can close() it, but if somebody (e.g. another thread) is still using it and you close() it, you will get an error. Otis Reuse your IndexSearcher! :) Also, I think somebody has written some EJB stuff to work with Lucene. The project is on SF.net. Otis --- Terence Lai [EMAIL PROTECTED] wrote: Hi All, I am getting a OutOfMemoryError when I deploy my EJB application. To debug the problem, I wrote the following test program: public static void main(String[] args) { try { Query query = getQuery(); for (int i=0; i1000; i++) { search(query); if ( i%50 == 0 ) { System.out.println(Sleep...); Thread.currentThread().sleep(5000); System.out.println(Wake up!); } } } catch (Exception e) { e.printStackTrace(); } } private static void search(Query query) throws IOException { FSDirectory fsDir = null; IndexSearcher is = null; Hits hits = null; try { fsDir = FSDirectory.getDirectory(C:\\index, false); is = new IndexSearcher(fsDir); SortField sortField = new SortField(profile_modify_date, SortField.STRING, true); hits = is.search(query, new Sort(sortField)); } finally { if (is != null) { try { is.close(); } catch (Exception ex) { } } if (fsDir != null) { try { is.close(); } catch (Exception ex) { } } } } In the test program, I wrote a loop to keep calling the search method. Everytime it enters the search method, I would instantiate the IndexSearcher. Before I exit the method, I close the IndexSearcher and FSDirectory. I also made the Thread sleep for 5 seconds in every 50 searches. Hopefully, this will give some time for the java to do the Garbage Collection. Unfortunately, when I observe the memory usage of my process, it keeps increasing until I got the java.lang.OutOfMemoryError. Note that I invoke the IndexSearcher.search(Query query, Sort sort) to process the search. If I don't specify the Sort field(i.e. using IndexSearcher.search(query)), I don't have this problem, and the memory usage keeps at a very static level. Does anyone experience a similar problem? Did I do something wrong in the test program. I throught by closing the IndexSearcher and the FSDirectory, the memory will be able to release during the Garbage Collection. Thanks, Terence -- Get your free email account from http://www.trekspace.com Your Internet Virtual Desktop! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Get your free email account from http://www.trekspace.com Your Internet Virtual Desktop! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Paul Thank you for your response. I have appended to the bottom of this message the field structure that I am using. I hope that this helps. I am using the StandardAnalyzer. I do not believe that I am changing any default values, but I have also appended the code that adds the temp index to the production index. Thanks for you help Rob Here is the code that describes the field structure. public static Document Document(String contents, String path, Date modified, String runDate, String totalpages, String pagecount, String countycode, String reportnum, String reportdescr) { SimpleDateFormat showFormat = new SimpleDateFormat(TurbineResources.getString(date.default.format)); SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd); Document doc = new Document(); doc.add(Field.Keyword(path, path)); doc.add(Field.Keyword(modified, showFormat.format(modified))); doc.add(Field.UnStored(searchDate, searchFormat.format(modified))); doc.add(Field.Keyword(runDate, runDate==null?:runDate)); doc.add(Field.UnStored(searchRunDate, runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri ng(3,5))); doc.add(Field.Keyword(reportnum, reportnum)); doc.add(Field.Text(reportdescr, reportdescr)); doc.add(Field.UnStored(cntycode, countycode)); doc.add(Field.Keyword(totalpages, totalpages)); doc.add(Field.Keyword(page, pagecount)); doc.add(Field.UnStored(contents, contents)); return doc; } Here is the code that adds the temp index to the production index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); - Original Message - From: Paul Elschot [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:16 AM Subject: Re: Index Size On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Karthik Thanks for responding. Yes, I optimize right before I close the index writer. I added this a little while ago to try and get the size down. Rob - Original Message - From: Karthik N S [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:59 AM Subject: RE: Index Size Guys Are u Using the Optimizing the index before close process. If not try using it... :} karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:00 PM To: Lucene Users List Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Bernhard Thanks for responding. I do have an IndexReader open on the Temp index. I pass this IndexReader into the addIndexes method on the IndexWriter to add these files. I did notice that I have a ton of CFS files that I removed and was still able to read the indexes. Are these the temporary segment files you are talking about? Here is my code that adds the temp index to the prod index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); Am I doing something wrong? Any help would be extremely appreciated. Thanks Rob - Original Message - From: Bernhard Messer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:09 AM Subject: Re: Index Size Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be merged into bigger segments. If done, the old segment files will be deleted. If there is an open IndexReader, the environment is unable to unlock the files and they still stay in the index directory. You will end up with an index, several times bigger than the dataset. Can you check your code for any open IndexReaders when indexing, or paste the relevant part to the list so we could have a look on it. hope this helps Bernhard Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
I did a little more research into my production indexes, and so far the first index in the only one that has any other files besides the CFS files. The other indexes that I have seen have just the deletable and segments files and a whole bunch of cfs files. Very interesting. Also worth noting is that once in awhile one of the production indexes will have a 0 length FNM file. Rob - Original Message - From: Rob Jose [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 6:42 AM Subject: Re: Index Size Bernhard Thanks for responding. I do have an IndexReader open on the Temp index. I pass this IndexReader into the addIndexes method on the IndexWriter to add these files. I did notice that I have a ton of CFS files that I removed and was still able to read the indexes. Are these the temporary segment files you are talking about? Here is my code that adds the temp index to the prod index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); Am I doing something wrong? Any help would be extremely appreciated. Thanks Rob - Original Message - From: Bernhard Messer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:09 AM Subject: Re: Index Size Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be merged into bigger segments. If done, the old segment files will be deleted. If there is an open IndexReader, the environment is unable to unlock the files and they still stay in the index directory. You will end up with an index, several times bigger than the dataset. Can you check your code for any open IndexReaders when indexing, or paste the relevant part to the list so we could have a look on it. hope this helps Bernhard Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
about performance (newbie)
Luceners, I have elements (accounts, contacts, task, events) where I have to find in any field a word (hello for example). Which is the best way to do that with Lucene? In other words, I have several elements where I have to search a Word. I can make one search and then order the hits to separate the elements. The another option is to make as much searching as elements I have. Which one of these options is better? I'm using MultiFieldQueryParser
Indexing Scheduler
FYI, I want to configure the Indexing file as per the user setting values(Date Time). Job Scheduler. How can I handle the job scheduler to indexing??? Any one knows good experience in Quartz Scheduler share with me. Thanks, Natarajan.
Re: Index Size
I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob ___ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the IndexWriter.addIndexes(IndexReader) method and instead traverse the IndexReader on the temp index and use IndexWriter.addDocument(Document) method? Thanks again for your input, I appreciate it. Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 8:00 AM Subject: Re: Index Size Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Index Size
Have you tried looking at the contents of this small index with Luke, to see what actually got put into it? Maybe one of your stored fields is being fed something you didn't expect. Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Dan Thanks for your response. Yes, I have used Luke to look at the index and everything looks good. Rob - Original Message - From: Armbrust, Daniel C. [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 9:14 AM Subject: RE: Index Size Have you tried looking at the contents of this small index with Luke, to see what actually got put into it? Maybe one of your stored fields is being fed something you didn't expect. Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Stupid question: Are you sure you have the right number of docs in your index? i.e. you're not adding the same document twice into or via your tmp index. sv On Thu, 19 Aug 2004, Rob Jose wrote: Paul Thank you for your response. I have appended to the bottom of this message the field structure that I am using. I hope that this helps. I am using the StandardAnalyzer. I do not believe that I am changing any default values, but I have also appended the code that adds the temp index to the production index. Thanks for you help Rob Here is the code that describes the field structure. public static Document Document(String contents, String path, Date modified, String runDate, String totalpages, String pagecount, String countycode, String reportnum, String reportdescr) { SimpleDateFormat showFormat = new SimpleDateFormat(TurbineResources.getString(date.default.format)); SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd); Document doc = new Document(); doc.add(Field.Keyword(path, path)); doc.add(Field.Keyword(modified, showFormat.format(modified))); doc.add(Field.UnStored(searchDate, searchFormat.format(modified))); doc.add(Field.Keyword(runDate, runDate==null?:runDate)); doc.add(Field.UnStored(searchRunDate, runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri ng(3,5))); doc.add(Field.Keyword(reportnum, reportnum)); doc.add(Field.Text(reportdescr, reportdescr)); doc.add(Field.UnStored(cntycode, countycode)); doc.add(Field.Keyword(totalpages, totalpages)); doc.add(Field.Keyword(page, pagecount)); doc.add(Field.UnStored(contents, contents)); return doc; } Here is the code that adds the temp index to the production index. File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode); tempReader = IndexReader.open(tempFile); try { boolean createIndex = false; File f = new File(sIndex + File.separatorChar + sCntyCode); if (!f.exists()) { createIndex = true; } prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new StandardAnalyzer(), createIndex); } catch (Exception e) { IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar + sCntyCode, false)); CasesReports.log(Tried to Unlock + sIndex); prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false); CasesReports.log(Successfully Unlocked + sIndex + File.separatorChar + sCntyCode); } prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); - Original Message - From: Paul Elschot [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:16 AM Subject: Re: Index Size On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
How many fields do you have and what analyzer are you using? [EMAIL PROTECTED] 8/19/2004 11:54:25 AM Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the IndexWriter.addIndexes(IndexReader) method and instead traverse the IndexReader on the temp index and use IndexWriter.addDocument(Document) method? Thanks again for your input, I appreciate it. Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 8:00 AM Subject: Re: Index Size Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp index. The last thing that I do before closing the production writer is to call the optimize method. I would really appreciate any ideas to get the index size smaller if it is at all possible. Thanks Rob - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
Grant Thanks for your response. I have fixed this issue. I have indexed 5 MB worth of text files and I now only use 224 KB. I was getting 80 MB. The only change I made was to change the way I merge my temp index into my prod index. My code changed from: prodWriter.setUseCompoundFile(true); prodWriter.addIndexes(new IndexReader[] { tempReader }); To: int iNumDocs = tempReader.numDocs(); for (int y = 0; y iNumDocs; y++) { Document tempDoc = tempReader.document(y); prodWriter.addDocument(tempDoc); } I don't know if this is a bug in the IndexWriter.addIndexes(IndexReader) method or something else I am doing that caused this, but I am getting much better results now. Thanks to everyone who helped, I really appreciate it. Rob - Original Message - From: Grant Ingersoll [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 10:51 AM Subject: Re: Index Size How many fields do you have and what analyzer are you using? [EMAIL PROTECTED] 8/19/2004 11:54:25 AM Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the IndexWriter.addIndexes(IndexReader) method and instead traverse the IndexReader on the temp index and use IndexWriter.addDocument(Document) method? Thanks again for your input, I appreciate it. Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 8:00 AM Subject: Re: Index Size Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly build and see if using that takes car eof your problem. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the files that I have. The CFS files are appoximately 11 MB each. The totals I gave you before were for all of my indexes together. This particular index has a size of 21.6 GB. The files that it indexed have a size of 89 MB. OK - I just removed all of the CFS files from the directory and I can still read my indexes. So know I have to ask what are these CFS files? Why are they created? And how can I get rid of them if I don't need them. I will also take a look at the Lucene website to see if I can find any information. Thanks Rob - Original Message - From: Honey George [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:29 AM Subject: Re: Index Size Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking half of the total size. My problem is that after deleting the temporary files, the index size is same as that of the data size. That again seems to be a problem. I am yet to find out the reason.. Thanks, george --- Rob Jose [EMAIL PROTECTED] wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the file, just indexing and tokenizing. I am using Lucene 1.3 final. Can you guys let me know what you are experiencing? I don't want to go into production with something that I should be configuring better. I am not sure if this helps, but I have a temp index and a real index. I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter. I have also set the production writer setUseCompoundFile to true. I did not set this on the temp
Debian build problem with 1.4.1
Hi all, I am the Debian package maintainer for Lucene, and I'm having build problems with 1.4.1. We are very close to a major Debian release (code named 'sarge'), and the window for changes is very small. Can someone please help me in the next day or two, otherwise Debian stable will ship Lucene 1.4-final for the next couple of years. It looks to me like the problem is in javacc generated code, and it's not obvious to me what to do. For debian sarge or sid users out there who want to reproduce the build problem, download the lucene 1.4.1 source tarball, then: apt-get install devscripts apt-get source liblucene-java cd lucene-1.4 uupdate -v 1.4.1 ../lucene-1.4.1-src.tar.gz cd ../lucene-1.4.1 debuild -us -uc Cheers, Jeff = compile-core: [mkdir] Created dir: /tmp/lucene/lucene-1.4.1/build/classes/java [javac] Compiling 160 source files to /tmp/lucene/lucene-1.4.1/build/classes/java [javac] /tmp/lucene/lucene-1.4.1/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:15: cannot resolve symbol [javac] symbol : class Reader [javac] location: class org.apache.lucene.analysis.standard.StandardTokenizer [javac] public StandardTokenizer(Reader reader) { [javac]^ [javac] /tmp/lucene/lucene-1.4.1/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:24: cannot resolve symbol [javac] symbol : class IOException [javac] location: class org.apache.lucene.analysis.standard.StandardTokenizer [javac] final public org.apache.lucene.analysis.Token next() throws ParseException, IOException { [javac] ^ [javac] /tmp/lucene/lucene-1.4.1/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:15: recursive constructor invocation [javac] public StandardTokenizer(Reader reader) { [javac] ^ [javac] 3 errors - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]