Re: Index Size

2004-08-19 Thread Paul Elschot
On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does

Re: Index Size

2004-08-19 Thread Honey George
Hi, Please check for hidden files in the index folder. If you are using linx, do something like ls -al index folder I am also facing a similar problem where the index size is greater than the data size. In my case there were some hidden temproary files which the lucene creates. That was taking

RE: Index Size

2004-08-19 Thread Karthik N S
Guys Are u Using the Optimizing the index before close process. If not try using it... :} karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 1:00 PM To: Lucene Users List Subject: Re: Index Size Hi, Please check

Re: Index Size

2004-08-19 Thread Bernhard Messer
Rob, as Doug and Paul already mentioned, the index size is definitely to big :-(. What could raise the problem, especially when running on a windows platform, is that an IndexReader is open during the whole index process. During indexing, the writer creates temporary segment files which will be

RE: Restoring a corrupt index

2004-08-19 Thread Honey George
This is what I did. There are 2 classes in the lucene source which are not public and therefore cannot be accessed from outside the package. The classes are 1. org.apache.lucene.index.SegmentInfos - collection of segments 2. org.apache.lucene.index.SegmentInfo -represents a sigle segment I

RE: Restoring a corrupt index

2004-08-19 Thread Karthik N S
Hi George Do u think ,the same would work for MERGED Indexes Please Can u suggest a solution. Karthik -Original Message- From: Honey George [mailto:[EMAIL PROTECTED] Sent: Thursday, August 19, 2004 2:08 PM To: Lucene Users List Subject: RE: Restoring a corrupt index

searchhelp

2004-08-19 Thread Santosh
Hi, I am using lucene search engine for my application. i am able to search through the text files and htmls as specified by lucene can you please clarify my doubts 1.can lucene search through pdfs and word documents? if yes then how? 2.can lucene search through database ? if yes then how?

Re: searchhelp

2004-08-19 Thread Chandan Tamrakar
For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list related to your queries. About database ,i guess someone must have done it . :) - Original Message - From: Santosh

Re: searchhelp

2004-08-19 Thread Zilverline info
The PDF and WORD stuff has been done too: have a look at http://www.zilverline.org. Michael Franken Chandan Tamrakar wrote: For PDF you need to extract a text from pdf files using pdfbox library and for word documents u can use apache POI api's . There are messages posted on the lucene list

Re: searchhelp

2004-08-19 Thread Santosh
I am recently joined into list, I didnt gone through any previous mails, if you have any mails or related code please forward it to me - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 3:47 PM Subject: Re:

Re: searchhelp

2004-08-19 Thread Honey George
Hi, Note that Lucene only provides an API to build a search engine you can use it how ever you want it. You can pass data to indexing in 2 forms. 1. java.lang.String 2. java.io.Reader What Lucene recieves is any of the two objects above. Now in the case of non-text documents you need to extract

RE: Restoring a corrupt index

2004-08-19 Thread Honey George
If I understand correctly, You have situation where you have a large main index and then you create small indexes and finally merge to the main index. It can happen that half way through merging, the system crashed and the index got corrupted. I do not think in this case you can use my solution.

Re: searchhelp

2004-08-19 Thread Chandan Tamrakar
for pdf u can refer www.pdfbox.org and pls. check the apache POI project in jakarta.apache.org site for indexing MS documents. - Original Message - From: Santosh [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 4:09 PM Subject: Re: searchhelp

RE: searchhelp

2004-08-19 Thread David Townsend
JGURU FAQ http://www.jguru.com/faq/Lucene OFFICIAL FAQ http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi MAIL ARCHIVE http://www.mail-archive.com/[EMAIL PROTECTED]/ hope this helps. -Original Message- From: Santosh [mailto:[EMAIL PROTECTED] Sent: 19 August 2004 11:25 To: Lucene

Re: searchhelp

2004-08-19 Thread Santosh
thanks everybody, but i didnt got any code or any real help in this links any body has performed previously this search?if yes then please send me the code, or tell me the what code I have to add to my present lucene - Original Message - From: David Townsend [EMAIL PROTECTED] To: Lucene

Re: searchhelp

2004-08-19 Thread Reyhood Farhan
As far as I remember, the pdfbox release includes some existing code to index pdfs with lucene, based upon the demo created for lucene 1.3. In fact, I think the code only works for lucene 1,3 - something to do with a change from arrays to vectors in lucene 1.4. I may be wrong though.

RE: Re: Re: OutOfMemoryError

2004-08-19 Thread Otis Gospodnetic
Terence, Calling close() on IndexSearcher will not release the memory immediately. It will only release resources (e.g. other Java objects used by IndexSearcher), and it is up to the JVM's garbage collector to actually reclaim/release the previously used memory. There are command-line

RE: Re: OutOfMemoryError

2004-08-19 Thread Otis Gospodnetic
Use the life-cycle hooks mentioned in another email (activate/passivate) and when you detect that the server is about to unload your class, call close() on IndexSearcher. I haven't used Lucene in an EJB environment, so I don't know the details, unfortunately. :( Your simulation may be too fast

RE: Re: OutOfMemoryError

2004-08-19 Thread Otis Gospodnetic
Terence, 2) I have a background process to update the index files. If I keep the IndexSearcher opened, I am not sure whether it will pick up the changes from the index updates done in the background process. This is a frequently asked question. Basically, you have to make use of

Re: Index Size

2004-08-19 Thread Rob Jose
Paul Thank you for your response. I have appended to the bottom of this message the field structure that I am using. I hope that this helps. I am using the StandardAnalyzer. I do not believe that I am changing any default values, but I have also appended the code that adds the temp index to

Re: Index Size

2004-08-19 Thread Rob Jose
Hey George Thanks for responding. I am using windows and I don't see any hidden files. I have a ton of CFS files (1366/1405). I have 22 F# (F1, F2, etc.) files. I have two FDT files and two FDX files. And three FNM files. Add these files to the deletable and segments file and that is all of the

Re: Index Size

2004-08-19 Thread Rob Jose
Karthik Thanks for responding. Yes, I optimize right before I close the index writer. I added this a little while ago to try and get the size down. Rob - Original Message - From: Karthik N S [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 12:59

Re: Index Size

2004-08-19 Thread Rob Jose
Bernhard Thanks for responding. I do have an IndexReader open on the Temp index. I pass this IndexReader into the addIndexes method on the IndexWriter to add these files. I did notice that I have a ton of CFS files that I removed and was still able to read the indexes. Are these the temporary

Re: Index Size

2004-08-19 Thread Rob Jose
I did a little more research into my production indexes, and so far the first index in the only one that has any other files besides the CFS files. The other indexes that I have seen have just the deletable and segments files and a whole bunch of cfs files. Very interesting. Also worth noting is

about performance (newbie)

2004-08-19 Thread Wermus Fernando
Luceners, I have elements (accounts, contacts, task, events) where I have to find in any field a word (hello for example). Which is the best way to do that with Lucene? In other words, I have several elements where I have to search a Word. I can make one search and then order the hits to

Indexing Scheduler

2004-08-19 Thread Natarajan.T
FYI, I want to configure the Indexing file as per the user setting values(Date Time). Job Scheduler. How can I handle the job scheduler to indexing??? Any one knows good experience in Quartz Scheduler share with me. Thanks, Natarajan.

Re: Index Size

2004-08-19 Thread Otis Gospodnetic
I thought this was the case. I believe there was a bug in one of the recent Lucene releases that caused old CFS files not to be removed when they should be removed. This resulted in your index directory containing a bunch of old CFS files consuming your disk space. Try getting a recent nightly

Re: Index Size

2004-08-19 Thread Rob Jose
Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 7:13 AM Subject: Re: Index Size I thought this was the case. I

Re: Index Size

2004-08-19 Thread Otis Gospodnetic
Just go for 1.4.1 and look at the CHANGES.txt file to see if there were any index format changes. If there were, you'll need to re-index. Otis --- Rob Jose [EMAIL PROTECTED] wrote: Otis I am using Lucene 1.3 final. Would it help if I move to Lucene 1.4 final? Rob - Original Message

Re: Index Size

2004-08-19 Thread Rob Jose
Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the IndexWriter.addIndexes(IndexReader) method and instead traverse the IndexReader on the temp index and use

RE: Index Size

2004-08-19 Thread Armbrust, Daniel C.
Have you tried looking at the contents of this small index with Luke, to see what actually got put into it? Maybe one of your stored fields is being fed something you didn't expect. Dan - To unsubscribe, e-mail: [EMAIL

Re: Index Size

2004-08-19 Thread Rob Jose
Dan Thanks for your response. Yes, I have used Luke to look at the index and everything looks good. Rob - Original Message - From: Armbrust, Daniel C. [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, August 19, 2004 9:14 AM Subject: RE: Index Size Have you

Re: Index Size

2004-08-19 Thread Stephane James Vaucher
Stupid question: Are you sure you have the right number of docs in your index? i.e. you're not adding the same document twice into or via your tmp index. sv On Thu, 19 Aug 2004, Rob Jose wrote: Paul Thank you for your response. I have appended to the bottom of this message the field

Re: Index Size

2004-08-19 Thread Grant Ingersoll
How many fields do you have and what analyzer are you using? [EMAIL PROTECTED] 8/19/2004 11:54:25 AM Otis I upgraded to 1.4.1. I deleted all of my old indexes and started from scratch. I indexed 2 MB worth of text files and my index size is 8 MB. Would it be better if I stopped using the

Re: Index Size

2004-08-19 Thread Rob Jose
Grant Thanks for your response. I have fixed this issue. I have indexed 5 MB worth of text files and I now only use 224 KB. I was getting 80 MB. The only change I made was to change the way I merge my temp index into my prod index. My code changed from: prodWriter.setUseCompoundFile(true);

Debian build problem with 1.4.1

2004-08-19 Thread Jeff Breidenbach
Hi all, I am the Debian package maintainer for Lucene, and I'm having build problems with 1.4.1. We are very close to a major Debian release (code named 'sarge'), and the window for changes is very small. Can someone please help me in the next day or two, otherwise Debian stable will ship Lucene