Re: Index Size

2004-08-19 Thread Paul Elschot
On Wednesday 18 August 2004 22:44, Rob Jose wrote:
 Hello
 I have indexed several thousand (52 to be exact) text files and I keep
 running out of disk space to store the indexes.  The size of the documents
 I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
 287 GB.  Does this seem correct?  I am not storing the contents of the

As noted, one would expect the index size to be about 35%
of the original text, ie. about 2.5GB * 35% = 800MB.
That is two orders of magnitude off from what you have.

Could you provide some more information about the field structure,
ie. how many fields, which fields are stored, which fields are indexed,
evt. use of non standard analyzers, and evt. non standard
Lucene settings?

You might also try to change to non compound format to have a look
at the sizes of the individual index files, see file formats on the lucene
web site.
You can then see the total disk size of for example the stored fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Honey George
Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al index folder

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose [EMAIL PROTECTED] wrote: 
 Hello
 I have indexed several thousand (52 to be exact)
 text files and I keep running out of disk space to
 store the indexes.  The size of the documents I have
 indexed is around 2.5 GB.  The size of the Lucene
 indexes is around 287 GB.  Does this seem correct? 
 I am not storing the contents of the file, just
 indexing and tokenizing.  I am using Lucene 1.3
 final.  Can you guys let me know what you are
 experiencing?  I don't want to go into production
 with something that I should be configuring better. 
 
 
 I am not sure if this helps, but I have a temp index
 and a real index.  I index the file into the temp
 index, and then merge the temp index into the real
 index using the addIndexes method on the
 IndexWriter.  I have also set the production writer
 setUseCompoundFile to true.  I did not set this on
 the temp index.  The last thing that I do before
 closing the production writer is to call the
 optimize method.  
 
 I would really appreciate any ideas to get the index
 size smaller if it is at all possible.
 
 Thanks
 Rob 





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index Size

2004-08-19 Thread Karthik N S
Guys

   Are u Using the Optimizing  the index before close process.

  If not try using it...  :}



karthik




-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:00 PM
To: Lucene Users List
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al index folder

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose [EMAIL PROTECTED] wrote:
 Hello
 I have indexed several thousand (52 to be exact)
 text files and I keep running out of disk space to
 store the indexes.  The size of the documents I have
 indexed is around 2.5 GB.  The size of the Lucene
 indexes is around 287 GB.  Does this seem correct?
 I am not storing the contents of the file, just
 indexing and tokenizing.  I am using Lucene 1.3
 final.  Can you guys let me know what you are
 experiencing?  I don't want to go into production
 with something that I should be configuring better.


 I am not sure if this helps, but I have a temp index
 and a real index.  I index the file into the temp
 index, and then merge the temp index into the real
 index using the addIndexes method on the
 IndexWriter.  I have also set the production writer
 setUseCompoundFile to true.  I did not set this on
 the temp index.  The last thing that I do before
 closing the production writer is to call the
 optimize method.

 I would really appreciate any ideas to get the index
 size smaller if it is at all possible.

 Thanks
 Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Bernhard Messer
Rob,
as Doug and Paul already mentioned, the index size is definitely to big :-(.
What could raise the problem, especially when running on a windows 
platform, is that an IndexReader is open during the whole index process. 
During indexing, the writer creates temporary segment files which will 
be merged into bigger segments. If done, the old segment files will be 
deleted. If there is an open IndexReader, the environment is unable to 
unlock the files and they still stay in the index directory. You will 
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or 
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard
Rob Jose wrote:
Hello
I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes.  The size of the documents I have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287 GB.  Does this seem correct?  I am not storing the contents of the file, just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys let me know what you are experiencing?  I don't want to go into production with something that I should be configuring better.  

I am not sure if this helps, but I have a temp index and a real index.  I index the file into the temp index, and then merge the temp index into the real index using the addIndexes method on the IndexWriter.  I have also set the production writer setUseCompoundFile to true.  I did not set this on the temp index.  The last thing that I do before closing the production writer is to call the optimize method.  

I would really appreciate any ideas to get the index size smaller if it is at all 
possible.
Thanks
Rob
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Restoring a corrupt index

2004-08-19 Thread Honey George
This is what I did.

There are 2 classes in the lucene source which are not
public and therefore cannot be accessed from outside
the package. The classes are
1. org.apache.lucene.index.SegmentInfos
   - collection of segments
2. org.apache.lucene.index.SegmentInfo
   -represents a sigle segment

I took these two files and moved to a separate folder.
Then created a class with the following code fragment.

public void displaySegments(String indexDir)
throws Exception
{
Directory dir =
(Directory)FSDirectory.getDirectory(indexDir, false);
SegmentInfos segments = new SegmentInfos();
segments.read(dir);

StringBuffer str = new StringBuffer();
int size = segments.size();
str.append(Index Dir =  + indexDir );
str.append(\nTotal Number of Segments  +
size);
   
str.append(\n--);
for(int i=0;isize;i++)
{
str.append(\n);
str.append((i+1) + . );
   
str.append(((SegmentInfo)segments.get(i)).name);
}
   
str.append(\n--);

System.out.println(str.toString());
}


public void deleteSegment(String indexDir, String
segmentName)
throws Exception
{
Directory dir =
(Directory)FSDirectory.getDirectory(indexDir, false);
SegmentInfos segments = new SegmentInfos();
segments.read(dir);

int size = segments.size();
String name = null;
boolean found = false;
for(int i=0;isize;i++)
{
name =
((SegmentInfo)segments.get(i)).name;
if (segmentName.equals(name))
{
found = true;
segments.remove(i);
System.out.println(Deleted the
segment with name  + name
+ from the segments file);
break;
}
}
if (found)
{
segments.write(dir);
}
else
{
System.out.println(Invalid segment name:
 + segmentName);
}
}

Use the displaySegments() method to display the
segments and deleteSegment to delete the corrupt
segment.

Thanks,
  George

 --- Karthik N S [EMAIL PROTECTED] wrote: 
 Hi Guys
 
 
In Our Situation we would be indexing  Million 
 Millions of Information
 documents
 
   with  Huge Giga Bytes of Data Indexed  and 
 finally would be  put into a
 MERGED INDEX, Categorized accordingly.
 
   There may be a possibility of Corruption,  So
 Please do post  the code
 reffrals
 
 
  Thx
 Karthik
 
 
 -Original Message-
 From: Honey George [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 18, 2004 5:51 PM
 To: Lucene Users List
 Subject: Re: Restoring a corrupt index
 
 
 Thanks Erik, that worked. I was able to remove the
 corrupt index and now it looks like the index is OK.
 I
 was able to view the number of documents in the
 index.
 Before that I was getting the error,
 java.io.IOException: read past EOF
 
 I am yet to find out how my index got corrupted.
 There
 is another thread going on about this topic,

http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html
 
 If anybody is facing similar problem and is
 interested
 in the code I can post it here.
 
 Thanks,
   George
 
 
 
  --- Erik Hatcher [EMAIL PROTECTED]
 wrote:
  The details of the segments file (and all the
  others) is freely
  available here:
 
 
 

http://jakarta.apache.org/lucene/docs/fileformats.html
 
  Also, there is Java code in Lucene, of course,
 that
  manipulates the
  segments file which could be leveraged (although
  probably package
  scoped and not easily usable in a standalone
 repair
  tool).
 
  Erik
 
 
  On Aug 18, 2004, at 6:50 AM, Honey George wrote:
 
   Looks like problem is not with the hexeditor,
 even
  in
   the ultraedit(i had access to a windows box) I
 am
   seeing the same display. The problem is I am not
  able
   to identify where a record starts with just 1
  record
   in the file.
  
   Need to try some alternate approach.
  
   Thanks,
 George
 
 
 
 
 
 

___ALL-NEW
 Yahoo!
 Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Restoring a corrupt index

2004-08-19 Thread Karthik N S
Hi

  George

   Do u think ,the same would work for MERGED Indexes
   Please Can u suggest a solution.


  Karthik

-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 2:08 PM
To: Lucene Users List
Subject: RE: Restoring a corrupt index


This is what I did.

There are 2 classes in the lucene source which are not
public and therefore cannot be accessed from outside
the package. The classes are
1. org.apache.lucene.index.SegmentInfos
   - collection of segments
2. org.apache.lucene.index.SegmentInfo
   -represents a sigle segment

I took these two files and moved to a separate folder.
Then created a class with the following code fragment.

public void displaySegments(String indexDir)
throws Exception
{
Directory dir =
(Directory)FSDirectory.getDirectory(indexDir, false);
SegmentInfos segments = new SegmentInfos();
segments.read(dir);

StringBuffer str = new StringBuffer();
int size = segments.size();
str.append(Index Dir =  + indexDir );
str.append(\nTotal Number of Segments  +
size);

str.append(\n--);
for(int i=0;isize;i++)
{
str.append(\n);
str.append((i+1) + . );

str.append(((SegmentInfo)segments.get(i)).name);
}

str.append(\n--);

System.out.println(str.toString());
}


public void deleteSegment(String indexDir, String
segmentName)
throws Exception
{
Directory dir =
(Directory)FSDirectory.getDirectory(indexDir, false);
SegmentInfos segments = new SegmentInfos();
segments.read(dir);

int size = segments.size();
String name = null;
boolean found = false;
for(int i=0;isize;i++)
{
name =
((SegmentInfo)segments.get(i)).name;
if (segmentName.equals(name))
{
found = true;
segments.remove(i);
System.out.println(Deleted the
segment with name  + name
+ from the segments file);
break;
}
}
if (found)
{
segments.write(dir);
}
else
{
System.out.println(Invalid segment name:
 + segmentName);
}
}

Use the displaySegments() method to display the
segments and deleteSegment to delete the corrupt
segment.

Thanks,
  George

 --- Karthik N S [EMAIL PROTECTED] wrote:
 Hi Guys


In Our Situation we would be indexing  Million 
 Millions of Information
 documents

   with  Huge Giga Bytes of Data Indexed  and
 finally would be  put into a
 MERGED INDEX, Categorized accordingly.

   There may be a possibility of Corruption,  So
 Please do post  the code
 reffrals


  Thx
 Karthik


 -Original Message-
 From: Honey George [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 18, 2004 5:51 PM
 To: Lucene Users List
 Subject: Re: Restoring a corrupt index


 Thanks Erik, that worked. I was able to remove the
 corrupt index and now it looks like the index is OK.
 I
 was able to view the number of documents in the
 index.
 Before that I was getting the error,
 java.io.IOException: read past EOF

 I am yet to find out how my index got corrupted.
 There
 is another thread going on about this topic,

http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html

 If anybody is facing similar problem and is
 interested
 in the code I can post it here.

 Thanks,
   George



  --- Erik Hatcher [EMAIL PROTECTED]
 wrote:
  The details of the segments file (and all the
  others) is freely
  available here:
 
 
 

http://jakarta.apache.org/lucene/docs/fileformats.html
 
  Also, there is Java code in Lucene, of course,
 that
  manipulates the
  segments file which could be leveraged (although
  probably package
  scoped and not easily usable in a standalone
 repair
  tool).
 
  Erik
 
 
  On Aug 18, 2004, at 6:50 AM, Honey George wrote:
 
   Looks like problem is not with the hexeditor,
 even
  in
   the ultraedit(i had access to a windows box) I
 am
   seeing the same display. The problem is I am not
  able
   to identify where a record starts with just 1
  record
   in the file.
  
   Need to try some alternate approach.
  
   Thanks,
 George







___ALL-NEW
 Yahoo!
 Messenger - all new features - even more fun!
 http://uk.messenger.yahoo.com


-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]



-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]







___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  

searchhelp

2004-08-19 Thread Santosh
Hi,

I am using lucene search engine for my application.

i am able to search through the text files and htmls as specified by lucene

can you please clarify my doubts

1.can lucene search through pdfs and word documents? if yes then how?

2.can lucene search through database ? if yes then how?

thankyou

santosh


---SOFTPRO DISCLAIMER--



Information contained in this E-MAIL and any attachments are

confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'

and 'confidential'.



If you are not an intended or authorised recipient of this E-MAIL or

have received it in error, You are notified that any use, copying or

dissemination  of the information contained in this E-MAIL in any

manner whatsoever is strictly prohibited. Please delete it immediately

and notify the sender by E-MAIL.



In such a case reading, reproducing, printing or further dissemination

of this E-MAIL is strictly prohibited and may be unlawful.



SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment

hereto is free from computer viruses or other defects.



The opinions expressed in this E-MAIL and any ATTACHEMENTS may be

those of the author and are not necessarily those of SOFTPRO SYSTEMS.





Re: searchhelp

2004-08-19 Thread Chandan Tamrakar
For PDF you need to extract a text from pdf files using pdfbox library  and
for word documents u can use apache POI api's . There are messages
posted on the  lucene list related to your queries. About database ,i guess
someone must have done it . :)

- Original Message - 
From: Santosh [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 3:58 PM
Subject: searchhelp


Hi,

I am using lucene search engine for my application.

i am able to search through the text files and htmls as specified by lucene

can you please clarify my doubts

1.can lucene search through pdfs and word documents? if yes then how?

2.can lucene search through database ? if yes then how?

thankyou

santosh


---SOFTPRO DISCLAIMER--

Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.

If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.

In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.

SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects.

The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searchhelp

2004-08-19 Thread Zilverline info
The PDF and WORD stuff has been done too: have a look at 
http://www.zilverline.org.

Michael Franken
Chandan Tamrakar wrote:
For PDF you need to extract a text from pdf files using pdfbox library  and
for word documents u can use apache POI api's . There are messages
posted on the  lucene list related to your queries. About database ,i guess
someone must have done it . :)
- Original Message - 
From: Santosh [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 3:58 PM
Subject: searchhelp

Hi,
I am using lucene search engine for my application.
i am able to search through the text files and htmls as specified by lucene
can you please clarify my doubts
1.can lucene search through pdfs and word documents? if yes then how?
2.can lucene search through database ? if yes then how?
thankyou
santosh
---SOFTPRO DISCLAIMER--
Information contained in this E-MAIL and any attachments are
confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
and 'confidential'.
If you are not an intended or authorised recipient of this E-MAIL or
have received it in error, You are notified that any use, copying or
dissemination  of the information contained in this E-MAIL in any
manner whatsoever is strictly prohibited. Please delete it immediately
and notify the sender by E-MAIL.
In such a case reading, reproducing, printing or further dissemination
of this E-MAIL is strictly prohibited and may be unlawful.
SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
hereto is free from computer viruses or other defects.
The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
those of the author and are not necessarily those of SOFTPRO SYSTEMS.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: searchhelp

2004-08-19 Thread Santosh
I am recently joined into list, I didnt gone through any previous mails, if
you have any mails or related code please forward it to me
- Original Message -
From: Chandan Tamrakar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 3:47 PM
Subject: Re: searchhelp


 For PDF you need to extract a text from pdf files using pdfbox library
and
 for word documents u can use apache POI api's . There are messages
 posted on the  lucene list related to your queries. About database ,i
guess
 someone must have done it . :)

 - Original Message -
 From: Santosh [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 3:58 PM
 Subject: searchhelp


 Hi,

 I am using lucene search engine for my application.

 i am able to search through the text files and htmls as specified by
lucene

 can you please clarify my doubts

 1.can lucene search through pdfs and word documents? if yes then how?

 2.can lucene search through database ? if yes then how?

 thankyou

 santosh


 ---SOFTPRO DISCLAIMER--

 Information contained in this E-MAIL and any attachments are
 confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
 and 'confidential'.

 If you are not an intended or authorised recipient of this E-MAIL or
 have received it in error, You are notified that any use, copying or
 dissemination  of the information contained in this E-MAIL in any
 manner whatsoever is strictly prohibited. Please delete it immediately
 and notify the sender by E-MAIL.

 In such a case reading, reproducing, printing or further dissemination
 of this E-MAIL is strictly prohibited and may be unlawful.

 SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
 hereto is free from computer viruses or other defects.

 The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
 those of the author and are not necessarily those of SOFTPRO SYSTEMS.
 



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searchhelp

2004-08-19 Thread Honey George
Hi,
  Note that Lucene only provides an API to build a
search engine you can use it how ever you want it. You
can pass data to indexing in 2 forms.
1. java.lang.String
2. java.io.Reader

What Lucene recieves is any of the two objects above.
Now in the case of non-text documents you need to
extract the text information from the documents and
either create as a text file and convert to a Reader
object or creat a String object (for small files). 

For indexing database contents, you need to write your
own APIs to get data from the database (using JDBC/EJB
etc), convert the data to a String object and pass it
to Lucene for indexing.

Again Lucene is not responsible for getting the data
from your application. It only indexed the data given
it to you.

Also for extracting contents from pdf  doc
files(generally known as straining) I know of 2 more
tools
wvWare - for word documents
pdftotext(xpdf) - for pdf documents.

Google around and you will get lot of links.

Hope this helps.

Thanks,
   George

 --- Santosh [EMAIL PROTECTED] wrote: 
 I am recently joined into list, I didnt gone through
 any previous mails, if
 you have any mails or related code please forward it
 to me
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: Lucene Users List
 [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 3:47 PM
 Subject: Re: searchhelp
 
 
  For PDF you need to extract a text from pdf files
 using pdfbox library
 and
  for word documents u can use apache POI api's .
 There are messages
  posted on the  lucene list related to your
 queries. About database ,i
 guess
  someone must have done it . :)
 
  - Original Message -
  From: Santosh [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 3:58 PM
  Subject: searchhelp
 
 
  Hi,
 
  I am using lucene search engine for my
 application.
 
  i am able to search through the text files and
 htmls as specified by
 lucene
 
  can you please clarify my doubts
 
  1.can lucene search through pdfs and word
 documents? if yes then how?
 
  2.can lucene search through database ? if yes then
 how?
 
  thankyou
 
  santosh
 
 
  ---SOFTPRO
 DISCLAIMER--
 
  Information contained in this E-MAIL and any
 attachments are
  confidential being  proprietary to SOFTPRO SYSTEMS
  is 'privileged'
  and 'confidential'.
 
  If you are not an intended or authorised recipient
 of this E-MAIL or
  have received it in error, You are notified that
 any use, copying or
  dissemination  of the information contained in
 this E-MAIL in any
  manner whatsoever is strictly prohibited. Please
 delete it immediately
  and notify the sender by E-MAIL.
 
  In such a case reading, reproducing, printing or
 further dissemination
  of this E-MAIL is strictly prohibited and may be
 unlawful.
 
  SOFTPRO SYSYTEMS does not REPRESENT or WARRANT
 that an attachment
  hereto is free from computer viruses or other
 defects.
 
  The opinions expressed in this E-MAIL and any
 ATTACHEMENTS may be
  those of the author and are not necessarily those
 of SOFTPRO SYSTEMS.
 


 
 
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  





___ALL-NEW Yahoo! Messenger - 
all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Restoring a corrupt index

2004-08-19 Thread Honey George
If I understand correctly, You have situation where
you have a large main index and then you create small
indexes and finally merge to the main index. It can
happen that half way through merging, the system
crashed and the index got corrupted. I do not think in
this case you can use my solution. 

What I am trying to do is to remove a corrupt segment
and associated files from the index folder, not trying
to fix a corrupt segment. This way atleast I can add
new documents to the index. Of cource I am sure I
didn't loose anything because my index file size was
actually 0 bytes.


Thanks,
  George

 --- Karthik N S [EMAIL PROTECTED] wrote: 
 Hi
 
   George
 
Do u think ,the same would work for MERGED
 Indexes
Please Can u suggest a solution.
 
 
   Karthik
 
 -Original Message-
 From: Honey George [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 2:08 PM
 To: Lucene Users List
 Subject: RE: Restoring a corrupt index
 
 
 This is what I did.
 
 There are 2 classes in the lucene source which are
 not
 public and therefore cannot be accessed from outside
 the package. The classes are
 1. org.apache.lucene.index.SegmentInfos
- collection of segments
 2. org.apache.lucene.index.SegmentInfo
-represents a sigle segment
 
 I took these two files and moved to a separate
 folder.
 Then created a class with the following code
 fragment.
 
 public void displaySegments(String indexDir)
 throws Exception
 {
 Directory dir =
 (Directory)FSDirectory.getDirectory(indexDir,
 false);
 SegmentInfos segments = new SegmentInfos();
 segments.read(dir);
 
 StringBuffer str = new StringBuffer();
 int size = segments.size();
 str.append(Index Dir =  + indexDir );
 str.append(\nTotal Number of Segments  +
 size);
 

str.append(\n--);
 for(int i=0;isize;i++)
 {
 str.append(\n);
 str.append((i+1) + . );
 
 str.append(((SegmentInfo)segments.get(i)).name);
 }
 

str.append(\n--);
 
 System.out.println(str.toString());
 }
 
 
 public void deleteSegment(String indexDir,
 String
 segmentName)
 throws Exception
 {
 Directory dir =
 (Directory)FSDirectory.getDirectory(indexDir,
 false);
 SegmentInfos segments = new SegmentInfos();
 segments.read(dir);
 
 int size = segments.size();
 String name = null;
 boolean found = false;
 for(int i=0;isize;i++)
 {
 name =
 ((SegmentInfo)segments.get(i)).name;
 if (segmentName.equals(name))
 {
 found = true;
 segments.remove(i);
 System.out.println(Deleted the
 segment with name  + name
 + from the segments file);
 break;
 }
 }
 if (found)
 {
 segments.write(dir);
 }
 else
 {
 System.out.println(Invalid segment
 name:
  + segmentName);
 }
 }
 
 Use the displaySegments() method to display the
 segments and deleteSegment to delete the corrupt
 segment.
 
 Thanks,
   George
 
  --- Karthik N S [EMAIL PROTECTED] wrote:
  Hi Guys
 
 
 In Our Situation we would be indexing  Million
 
  Millions of Information
  documents
 
with  Huge Giga Bytes of Data Indexed  and
  finally would be  put into a
  MERGED INDEX, Categorized accordingly.
 
There may be a possibility of Corruption,  So
  Please do post  the code
  reffrals
 
 
   Thx
  Karthik
 
 
  -Original Message-
  From: Honey George [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, August 18, 2004 5:51 PM
  To: Lucene Users List
  Subject: Re: Restoring a corrupt index
 
 
  Thanks Erik, that worked. I was able to remove the
  corrupt index and now it looks like the index is
 OK.
  I
  was able to view the number of documents in the
  index.
  Before that I was getting the error,
  java.io.IOException: read past EOF
 
  I am yet to find out how my index got corrupted.
  There
  is another thread going on about this topic,
 

http://www.mail-archive.com/[EMAIL PROTECTED]/msg03165.html
 
  If anybody is facing similar problem and is
  interested
  in the code I can post it here.
 
  Thanks,
George
 
 
 
   --- Erik Hatcher [EMAIL PROTECTED]
  wrote:
   The details of the segments file (and all the
   others) is freely
   available here:
  
  
  
 

http://jakarta.apache.org/lucene/docs/fileformats.html
  
   Also, there is Java code in Lucene, of course,
  that
   manipulates the
   segments file which could be leveraged (although
   probably package
   scoped and not easily usable in a standalone
  repair
   tool).
  
 Erik
  
  
   On Aug 18, 2004, at 6:50 AM, Honey George wrote:
  
Looks like problem is not with the hexeditor,
  even
   in
the ultraedit(i had access to a windows box) I
  am

Re: searchhelp

2004-08-19 Thread Chandan Tamrakar
for pdf  u can refer www.pdfbox.org  and pls. check the apache POI project
in jakarta.apache.org site for indexing MS documents.

- Original Message - 
From: Santosh [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 4:09 PM
Subject: Re: searchhelp


 I am recently joined into list, I didnt gone through any previous mails,
if
 you have any mails or related code please forward it to me
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 3:47 PM
 Subject: Re: searchhelp


  For PDF you need to extract a text from pdf files using pdfbox library
 and
  for word documents u can use apache POI api's . There are messages
  posted on the  lucene list related to your queries. About database ,i
 guess
  someone must have done it . :)
 
  - Original Message -
  From: Santosh [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 3:58 PM
  Subject: searchhelp
 
 
  Hi,
 
  I am using lucene search engine for my application.
 
  i am able to search through the text files and htmls as specified by
 lucene
 
  can you please clarify my doubts
 
  1.can lucene search through pdfs and word documents? if yes then how?
 
  2.can lucene search through database ? if yes then how?
 
  thankyou
 
  santosh
 
 
  ---SOFTPRO DISCLAIMER--
 
  Information contained in this E-MAIL and any attachments are
  confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
  and 'confidential'.
 
  If you are not an intended or authorised recipient of this E-MAIL or
  have received it in error, You are notified that any use, copying or
  dissemination  of the information contained in this E-MAIL in any
  manner whatsoever is strictly prohibited. Please delete it immediately
  and notify the sender by E-MAIL.
 
  In such a case reading, reproducing, printing or further dissemination
  of this E-MAIL is strictly prohibited and may be unlawful.
 
  SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
  hereto is free from computer viruses or other defects.
 
  The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
  those of the author and are not necessarily those of SOFTPRO SYSTEMS.
  
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: searchhelp

2004-08-19 Thread David Townsend
JGURU FAQ
http://www.jguru.com/faq/Lucene

OFFICIAL FAQ
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi

MAIL ARCHIVE
http://www.mail-archive.com/[EMAIL PROTECTED]/

hope this helps.


-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: 19 August 2004 11:25
To: Lucene Users List
Subject: Re: searchhelp


I am recently joined into list, I didnt gone through any previous mails, if
you have any mails or related code please forward it to me
- Original Message -
From: Chandan Tamrakar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 3:47 PM
Subject: Re: searchhelp


 For PDF you need to extract a text from pdf files using pdfbox library
and
 for word documents u can use apache POI api's . There are messages
 posted on the  lucene list related to your queries. About database ,i
guess
 someone must have done it . :)

 - Original Message -
 From: Santosh [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 3:58 PM
 Subject: searchhelp


 Hi,

 I am using lucene search engine for my application.

 i am able to search through the text files and htmls as specified by
lucene

 can you please clarify my doubts

 1.can lucene search through pdfs and word documents? if yes then how?

 2.can lucene search through database ? if yes then how?

 thankyou

 santosh


 ---SOFTPRO DISCLAIMER--

 Information contained in this E-MAIL and any attachments are
 confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
 and 'confidential'.

 If you are not an intended or authorised recipient of this E-MAIL or
 have received it in error, You are notified that any use, copying or
 dissemination  of the information contained in this E-MAIL in any
 manner whatsoever is strictly prohibited. Please delete it immediately
 and notify the sender by E-MAIL.

 In such a case reading, reproducing, printing or further dissemination
 of this E-MAIL is strictly prohibited and may be unlawful.

 SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
 hereto is free from computer viruses or other defects.

 The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
 those of the author and are not necessarily those of SOFTPRO SYSTEMS.
 



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searchhelp

2004-08-19 Thread Santosh
thanks everybody,

but i didnt got any code or any real help in this links
any body has performed previously this search?if yes then please send me the
code, or tell me the what code I have to add to my present lucene
- Original Message -
From: David Townsend [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 4:17 PM
Subject: RE: searchhelp


JGURU FAQ
http://www.jguru.com/faq/Lucene

OFFICIAL FAQ
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi

MAIL ARCHIVE
http://www.mail-archive.com/[EMAIL PROTECTED]/

hope this helps.


-Original Message-
From: Santosh [mailto:[EMAIL PROTECTED]
Sent: 19 August 2004 11:25
To: Lucene Users List
Subject: Re: searchhelp


I am recently joined into list, I didnt gone through any previous mails, if
you have any mails or related code please forward it to me
- Original Message -
From: Chandan Tamrakar [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 3:47 PM
Subject: Re: searchhelp


 For PDF you need to extract a text from pdf files using pdfbox library
and
 for word documents u can use apache POI api's . There are messages
 posted on the  lucene list related to your queries. About database ,i
guess
 someone must have done it . :)

 - Original Message -
 From: Santosh [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 3:58 PM
 Subject: searchhelp


 Hi,

 I am using lucene search engine for my application.

 i am able to search through the text files and htmls as specified by
lucene

 can you please clarify my doubts

 1.can lucene search through pdfs and word documents? if yes then how?

 2.can lucene search through database ? if yes then how?

 thankyou

 santosh


 ---SOFTPRO DISCLAIMER--

 Information contained in this E-MAIL and any attachments are
 confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
 and 'confidential'.

 If you are not an intended or authorised recipient of this E-MAIL or
 have received it in error, You are notified that any use, copying or
 dissemination  of the information contained in this E-MAIL in any
 manner whatsoever is strictly prohibited. Please delete it immediately
 and notify the sender by E-MAIL.

 In such a case reading, reproducing, printing or further dissemination
 of this E-MAIL is strictly prohibited and may be unlawful.

 SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
 hereto is free from computer viruses or other defects.

 The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
 those of the author and are not necessarily those of SOFTPRO SYSTEMS.
 



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searchhelp

2004-08-19 Thread Reyhood Farhan
As far as I remember, the pdfbox release includes some existing code to 
index pdfs with lucene, based upon the demo created for lucene 1.3. In 
fact, I  think the code only works for lucene 1,3 - something to do with 
a change from arrays to vectors in lucene 1.4. I may be wrong though. 

http://www.csh.rit.edu/~ben/projects/pdfbox/javadoc/org/pdfbox/searchengine/lucene/package-summary.html


 thanks everybody,
 
 but i didnt got any code or any real help in this links
 any body has performed previously this search?if yes then please send me the
 code, or tell me the what code I have to add to my present lucene
 - Original Message -
 From: David Townsend [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 4:17 PM
 Subject: RE: searchhelp
 
 
 JGURU FAQ
 http://www.jguru.com/faq/Lucene
 
 OFFICIAL FAQ
 http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi
 
 MAIL ARCHIVE
 http://www.mail-archive.com/[EMAIL PROTECTED]/
 
 hope this helps.
 
 
 -Original Message-
 From: Santosh [mailto:[EMAIL PROTECTED]
 Sent: 19 August 2004 11:25
 To: Lucene Users List
 Subject: Re: searchhelp
 
 
 I am recently joined into list, I didnt gone through any previous mails, if
 you have any mails or related code please forward it to me
 - Original Message -
 From: Chandan Tamrakar [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 3:47 PM
 Subject: Re: searchhelp
 
 
  For PDF you need to extract a text from pdf files using pdfbox library
 and
  for word documents u can use apache POI api's . There are messages
  posted on the  lucene list related to your queries. About database ,i
 guess
  someone must have done it . :)
 
  - Original Message -
  From: Santosh [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 3:58 PM
  Subject: searchhelp
 
 
  Hi,
 
  I am using lucene search engine for my application.
 
  i am able to search through the text files and htmls as specified by
 lucene
 
  can you please clarify my doubts
 
  1.can lucene search through pdfs and word documents? if yes then how?
 
  2.can lucene search through database ? if yes then how?
 
  thankyou
 
  santosh
 
 
  ---SOFTPRO DISCLAIMER--
 
  Information contained in this E-MAIL and any attachments are
  confidential being  proprietary to SOFTPRO SYSTEMS  is 'privileged'
  and 'confidential'.
 
  If you are not an intended or authorised recipient of this E-MAIL or
  have received it in error, You are notified that any use, copying or
  dissemination  of the information contained in this E-MAIL in any
  manner whatsoever is strictly prohibited. Please delete it immediately
  and notify the sender by E-MAIL.
 
  In such a case reading, reproducing, printing or further dissemination
  of this E-MAIL is strictly prohibited and may be unlawful.
 
  SOFTPRO SYSYTEMS does not REPRESENT or WARRANT that an attachment
  hereto is free from computer viruses or other defects.
 
  The opinions expressed in this E-MAIL and any ATTACHEMENTS may be
  those of the author and are not necessarily those of SOFTPRO SYSTEMS.
  
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Re: Re: OutOfMemoryError

2004-08-19 Thread Otis Gospodnetic
Terence,

Calling close() on IndexSearcher will not release the memory
immediately.  It will only release resources (e.g. other Java objects
used by IndexSearcher), and it is up to the JVM's garbage collector to
actually reclaim/release the previously used memory.  There are
command-line parameters you can use to tune garbage collection.  Here
is one example:

java -XX:+UseParallelGC -XX:PermSize=20M -XX:MaxNewSize=32M
-XX:NewSize=32M .

This works with Sun's JVM.  The above is just an example - you need to
play with the options and see what works for you.  There are other
options, too:

-Xnoclassgc   disable class garbage collection
-Xincgc   enable incremental garbage collection
-Xloggc:filelog GC status to a file with time stamps
-Xbatch   disable background compilation
-Xmssizeset initial Java heap size
-Xmxsizeset maximum Java heap size
-Xsssizeset java thread stack size
-Xprofoutput cpu profiling data
-Xrunhprof[:help]|[:option=value, ...]
  perform JVMPI heap, cpu, or monitor profiling


Otis



--- Terence Lai [EMAIL PROTECTED] wrote:

 Hi David,
 
 In my test program, I invoke the IndexSearcher.close() method at the
 end of the loop. However, it doesn't seems to release the memory. My
 concern is that even though I put the IndexSearcher.close() statement
 in the hook methods, it may not release all the memory until the
 application server is shut down. Every time the EJB object is
 re-actived, a new IndexSearcher is open. If the resources allocated
 to the previous IndexSearcher cannot be fully released, the system
 will use up more memory. Eventually, it may run into the
 OutOfMemoryError.
 
 I am not very familiar with EJB. My interpretation could be wrong. I
 am going to try the hook methods. Thanks for pointing this out to me.
 
 Terence
 
   I tried to reuse the IndexSearcher, but I have another question.
 What
   happen if an application server unloads the class after it is
 idle for a
   while, and then re-instantiate the object back when it recieves a
 new
   request?
  
  The EJB spec takes this into account, as there are hook methods you
 can 
  define that get called when your EJB object is about to be
 passivated or 
  activated.  Search for something like passivate/active and/or 
  ejbLoad/ejbSave.  This is where you should close/open your single
 index 
  searcher object.
  
  -- 
  Cheers,
  David
  
  This message is intended only for the named recipient.  If you are
 not the 
  intended recipient you are notified that disclosing, copying,
 distributing 
  or taking any action  in reliance on the contents of this
 information is 
  strictly prohibited.
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 
 
 --
 Get your free email account from http://www.trekspace.com
   Your Internet Virtual Desktop!
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Re: OutOfMemoryError

2004-08-19 Thread Otis Gospodnetic
Use the life-cycle hooks mentioned in another email
(activate/passivate) and when you detect that the server is about to
unload your class, call close() on IndexSearcher.  I haven't used
Lucene in an EJB environment, so I don't know the details,
unfortunately. :(

Your simulation may be too fast for the JVM.  Like I mentioned in the
previous email, close() doesn't release the memory, it's the JVM that
has to reclaim it.  Your for loop is very fast (no pauses anywhere,
probably), so maybe the garbage collector doesn't have time to reclaim
the needed memory.  I don't know enough about the low-level JVM stuff
to be certain about this statement, but you could try adding some
Thread.sleep calls in your test code.

Otis

--- Terence Lai [EMAIL PROTECTED] wrote:

 Hi,
 
 I tried to reuse the IndexSearcher, but I have another question. What
 happen if an application server unloads the class after it is idle
 for a while, and then re-instantiate the object back when it recieves
 a new request?
 
 Everytime the server re-instantiates the class, a new IndexSearcher
 instance will be created. If the IndexSearcher.close() method does
 not release all the memory and the server keeps unloading and
 re-instantiating the class, it will eventually hit the
 OutOfMemoryError issue. The test program from my previous email is
 simulating this condition. The reason why I instantiate/close the
 IndexSearcher inside the loop is to simulate the scenario when the
 server unloads and re-instantiates the object. I think that the same
 issue will happen if the application is written in servlet.
 
 Although the singleton pattern may resolve the problem that I
 described above; however, it isn't permitted by the J2EE spec
 according to some news letters. In order words, I can't use singleton
 pattern in EJB. Please correct me if I am wrong on this.
 
 Thanks,
 Terence
 
  Reuse your IndexSearcher! :)
  
  Also, I think somebody has written some EJB stuff to work with
 Lucene. 
  The project is on SF.net.
  
  Otis
  
  --- Terence Lai [EMAIL PROTECTED] wrote:
  
   Hi All,
   
   I am getting a OutOfMemoryError when I deploy my EJB application.
 To
   debug the problem, I wrote the following test program:
   
   public static void main(String[] args) {
   try {
   Query query = getQuery();
   
   for (int i=0; i1000; i++) {
   search(query);
   
   if ( i%50 == 0 ) {
   System.out.println(Sleep...);
   Thread.currentThread().sleep(5000);
   System.out.println(Wake up!);
   }
   }
   } catch (Exception e) {
   e.printStackTrace();
   }
   }
   
   private static void search(Query query) throws IOException {
   FSDirectory fsDir = null;
   IndexSearcher is = null;
   Hits hits = null;
   
   try {
   fsDir = FSDirectory.getDirectory(C:\\index, false);
   is = new IndexSearcher(fsDir);
   SortField sortField = new
   SortField(profile_modify_date,
   SortField.STRING, true);
   
   hits = is.search(query, new Sort(sortField));
   } finally {
   if (is != null) {
   try {
   is.close();
   } catch (Exception ex) {
   }
   }
   
   if (fsDir != null) {
   try {
   is.close();
   } catch (Exception ex) {
   }
   }
   }
   
   }
   
   In the test program, I wrote a loop to keep calling the search
   method. Everytime it enters the search method, I would
 instantiate
   the IndexSearcher. Before I exit the method, I close the
   IndexSearcher and FSDirectory. I also made the Thread sleep for 5
   seconds in every 50 searches. Hopefully, this will give some time
 for
   the java to do the Garbage Collection. Unfortunately, when I
 observe
   the memory usage of my process, it keeps increasing until I got
 the
   java.lang.OutOfMemoryError.
   
   Note that I invoke the IndexSearcher.search(Query query, Sort
 sort)
   to process the search. If I don't specify the Sort field(i.e.
 using
   IndexSearcher.search(query)), I don't have this problem, and the
   memory usage keeps at a very static level.
   
   Does anyone experience a similar problem? Did I do something
 wrong in
   the test program. I throught by closing the IndexSearcher and the
   FSDirectory, the memory will be able to release during the
 Garbage
   Collection.
   
   Thanks,
   Terence
   
   
   
   
   --
   Get your free email account from http://www.trekspace.com
 Your Internet Virtual Desktop!
   
  
 -
 

RE: Re: OutOfMemoryError

2004-08-19 Thread Otis Gospodnetic
Terence,

 2) I have a background process to update the index files. If I keep
 the IndexSearcher opened, I am not sure whether it will pick up the
 changes from the index updates done in the background process.

This is a frequently asked question.  Basically, you have to make use
of IndexReader's method for checking the index version.  You can do it
as often as you want, it's really up to you, and when you detect that
the index has been modified, throw away the old IndexSearcher and make
a new one.  If you are sure nobody is using your old IndexSearcher, you
can close() it, but if somebody (e.g. another thread) is still using it
and you close() it, you will get an error.

Otis



  Reuse your IndexSearcher! :)
  
  Also, I think somebody has written some EJB stuff to work with
 Lucene. 
  The project is on SF.net.
  
  Otis
  
  --- Terence Lai [EMAIL PROTECTED] wrote:
  
   Hi All,
   
   I am getting a OutOfMemoryError when I deploy my EJB application.
 To
   debug the problem, I wrote the following test program:
   
   public static void main(String[] args) {
   try {
   Query query = getQuery();
   
   for (int i=0; i1000; i++) {
   search(query);
   
   if ( i%50 == 0 ) {
   System.out.println(Sleep...);
   Thread.currentThread().sleep(5000);
   System.out.println(Wake up!);
   }
   }
   } catch (Exception e) {
   e.printStackTrace();
   }
   }
   
   private static void search(Query query) throws IOException {
   FSDirectory fsDir = null;
   IndexSearcher is = null;
   Hits hits = null;
   
   try {
   fsDir = FSDirectory.getDirectory(C:\\index, false);
   is = new IndexSearcher(fsDir);
   SortField sortField = new
   SortField(profile_modify_date,
   SortField.STRING, true);
   
   hits = is.search(query, new Sort(sortField));
   } finally {
   if (is != null) {
   try {
   is.close();
   } catch (Exception ex) {
   }
   }
   
   if (fsDir != null) {
   try {
   is.close();
   } catch (Exception ex) {
   }
   }
   }
   
   }
   
   In the test program, I wrote a loop to keep calling the search
   method. Everytime it enters the search method, I would
 instantiate
   the IndexSearcher. Before I exit the method, I close the
   IndexSearcher and FSDirectory. I also made the Thread sleep for 5
   seconds in every 50 searches. Hopefully, this will give some time
 for
   the java to do the Garbage Collection. Unfortunately, when I
 observe
   the memory usage of my process, it keeps increasing until I got
 the
   java.lang.OutOfMemoryError.
   
   Note that I invoke the IndexSearcher.search(Query query, Sort
 sort)
   to process the search. If I don't specify the Sort field(i.e.
 using
   IndexSearcher.search(query)), I don't have this problem, and the
   memory usage keeps at a very static level.
   
   Does anyone experience a similar problem? Did I do something
 wrong in
   the test program. I throught by closing the IndexSearcher and the
   FSDirectory, the memory will be able to release during the
 Garbage
   Collection.
   
   Thanks,
   Terence
   
   
   
   
   --
   Get your free email account from http://www.trekspace.com
 Your Internet Virtual Desktop!
   
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
 
 
 
 
 --
 Get your free email account from http://www.trekspace.com
   Your Internet Virtual Desktop!
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Paul
Thank you for your response.  I have appended to the bottom of this message
the field structure that I am using.  I hope that this helps.  I am using
the StandardAnalyzer.  I do not believe that I am changing any default
values, but I have also appended the code that adds the temp index to the
production index.

Thanks for you help
Rob

Here is the code that describes the field structure.
public static Document Document(String contents, String path, Date modified,
String runDate, String totalpages, String pagecount, String countycode,
String reportnum, String reportdescr)

{

SimpleDateFormat showFormat = new
SimpleDateFormat(TurbineResources.getString(date.default.format));

SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd);

Document doc = new Document();

doc.add(Field.Keyword(path, path));

doc.add(Field.Keyword(modified, showFormat.format(modified)));

doc.add(Field.UnStored(searchDate, searchFormat.format(modified)));

doc.add(Field.Keyword(runDate, runDate==null?:runDate));

doc.add(Field.UnStored(searchRunDate,
runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri
ng(3,5)));

doc.add(Field.Keyword(reportnum, reportnum));

doc.add(Field.Text(reportdescr, reportdescr));

doc.add(Field.UnStored(cntycode, countycode));

doc.add(Field.Keyword(totalpages, totalpages));

doc.add(Field.Keyword(page, pagecount));

doc.add(Field.UnStored(contents, contents));

return doc;

}



Here is the code that adds the temp index to the production index.

File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);

tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log(Tried to Unlock  + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });





- Original Message - 
From: Paul Elschot [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 12:16 AM
Subject: Re: Index Size


On Wednesday 18 August 2004 22:44, Rob Jose wrote:
 Hello
 I have indexed several thousand (52 to be exact) text files and I keep
 running out of disk space to store the indexes.  The size of the documents
 I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
 287 GB.  Does this seem correct?  I am not storing the contents of the

As noted, one would expect the index size to be about 35%
of the original text, ie. about 2.5GB * 35% = 800MB.
That is two orders of magnitude off from what you have.

Could you provide some more information about the field structure,
ie. how many fields, which fields are stored, which fields are indexed,
evt. use of non standard analyzers, and evt. non standard
Lucene settings?

You might also try to change to non compound format to have a look
at the sizes of the individual index files, see file formats on the lucene
web site.
You can then see the total disk size of for example the stored fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Hey George
Thanks for responding.  I am using windows and I don't see any hidden files.
I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.) files.
I have two FDT files and two FDX files. And three FNM files.  Add these
files to the deletable and segments file and that is all of the files that I
have.   The CFS files are appoximately 11 MB each.  The totals I gave you
before were for all of my indexes together.  This particular index has a
size of 21.6 GB.  The files that it indexed have a size of 89 MB.

OK - I just removed all of the CFS files from the directory and I can still
read my indexes.  So know I have to ask what are these CFS files?  Why are
they created?  And how can I get rid of them if I don't need them.  I will
also take a look at the Lucene website to see if I can find any information.

Thanks
Rob

- Original Message - 
From: Honey George [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 12:29 AM
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al index folder

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose [EMAIL PROTECTED] wrote:
 Hello
 I have indexed several thousand (52 to be exact)
 text files and I keep running out of disk space to
 store the indexes.  The size of the documents I have
 indexed is around 2.5 GB.  The size of the Lucene
 indexes is around 287 GB.  Does this seem correct?
 I am not storing the contents of the file, just
 indexing and tokenizing.  I am using Lucene 1.3
 final.  Can you guys let me know what you are
 experiencing?  I don't want to go into production
 with something that I should be configuring better.


 I am not sure if this helps, but I have a temp index
 and a real index.  I index the file into the temp
 index, and then merge the temp index into the real
 index using the addIndexes method on the
 IndexWriter.  I have also set the production writer
 setUseCompoundFile to true.  I did not set this on
 the temp index.  The last thing that I do before
 closing the production writer is to call the
 optimize method.

 I would really appreciate any ideas to get the index
 size smaller if it is at all possible.

 Thanks
 Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Karthik
Thanks for responding.  Yes, I optimize right before I close the index
writer.  I added this a little while ago to try and get the size down.

Rob
- Original Message - 
From: Karthik N S [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 12:59 AM
Subject: RE: Index Size


Guys

   Are u Using the Optimizing  the index before close process.

  If not try using it...  :}



karthik




-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:00 PM
To: Lucene Users List
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al index folder

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose [EMAIL PROTECTED] wrote:
 Hello
 I have indexed several thousand (52 to be exact)
 text files and I keep running out of disk space to
 store the indexes.  The size of the documents I have
 indexed is around 2.5 GB.  The size of the Lucene
 indexes is around 287 GB.  Does this seem correct?
 I am not storing the contents of the file, just
 indexing and tokenizing.  I am using Lucene 1.3
 final.  Can you guys let me know what you are
 experiencing?  I don't want to go into production
 with something that I should be configuring better.


 I am not sure if this helps, but I have a temp index
 and a real index.  I index the file into the temp
 index, and then merge the temp index into the real
 index using the addIndexes method on the
 IndexWriter.  I have also set the production writer
 setUseCompoundFile to true.  I did not set this on
 the temp index.  The last thing that I do before
 closing the production writer is to call the
 optimize method.

 I would really appreciate any ideas to get the index
 size smaller if it is at all possible.

 Thanks
 Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Bernhard
Thanks for responding.  I do have an IndexReader open on the Temp index.  I
pass this IndexReader into the addIndexes method on the IndexWriter to add
these files.  I did notice that I have a ton of CFS files that I removed and
was still able to read the indexes.  Are these the temporary segment files
you are talking about?  Here is my code that adds the temp index to the prod
index.
File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);
tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log(Tried to Unlock  + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });



Am I doing something wrong?  Any help would be extremely appreciated.



Thanks

Rob

- Original Message - 
From: Bernhard Messer [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:09 AM
Subject: Re: Index Size


Rob,

as Doug and Paul already mentioned, the index size is definitely to big :-(.

What could raise the problem, especially when running on a windows
platform, is that an IndexReader is open during the whole index process.
During indexing, the writer creates temporary segment files which will
be merged into bigger segments. If done, the old segment files will be
deleted. If there is an open IndexReader, the environment is unable to
unlock the files and they still stay in the index directory. You will
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard


Rob Jose wrote:

Hello
I have indexed several thousand (52 to be exact) text files and I keep
running out of disk space to store the indexes.  The size of the documents I
have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287
GB.  Does this seem correct?  I am not storing the contents of the file,
just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys
let me know what you are experiencing?  I don't want to go into production
with something that I should be configuring better.

I am not sure if this helps, but I have a temp index and a real index.  I
index the file into the temp index, and then merge the temp index into the
real index using the addIndexes method on the IndexWriter.  I have also set
the production writer setUseCompoundFile to true.  I did not set this on the
temp index.  The last thing that I do before closing the production writer
is to call the optimize method.

I would really appreciate any ideas to get the index size smaller if it is
at all possible.

Thanks
Rob




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
I did a little more research into my production indexes, and so far the
first index in the only one that has any other files besides the CFS files.
The other indexes that I have seen have just the deletable and segments
files and a whole bunch of cfs files.  Very interesting.  Also worth noting
is that once in awhile one of the production indexes will have a 0 length
FNM file.

Rob
- Original Message - 
From: Rob Jose [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 6:42 AM
Subject: Re: Index Size


Bernhard
Thanks for responding.  I do have an IndexReader open on the Temp index.  I
pass this IndexReader into the addIndexes method on the IndexWriter to add
these files.  I did notice that I have a ton of CFS files that I removed and
was still able to read the indexes.  Are these the temporary segment files
you are talking about?  Here is my code that adds the temp index to the prod
index.
File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);
tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log(Tried to Unlock  + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });



Am I doing something wrong?  Any help would be extremely appreciated.



Thanks

Rob

- Original Message - 
From: Bernhard Messer [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:09 AM
Subject: Re: Index Size


Rob,

as Doug and Paul already mentioned, the index size is definitely to big :-(.

What could raise the problem, especially when running on a windows
platform, is that an IndexReader is open during the whole index process.
During indexing, the writer creates temporary segment files which will
be merged into bigger segments. If done, the old segment files will be
deleted. If there is an open IndexReader, the environment is unable to
unlock the files and they still stay in the index directory. You will
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard


Rob Jose wrote:

Hello
I have indexed several thousand (52 to be exact) text files and I keep
running out of disk space to store the indexes.  The size of the documents I
have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287
GB.  Does this seem correct?  I am not storing the contents of the file,
just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys
let me know what you are experiencing?  I don't want to go into production
with something that I should be configuring better.

I am not sure if this helps, but I have a temp index and a real index.  I
index the file into the temp index, and then merge the temp index into the
real index using the addIndexes method on the IndexWriter.  I have also set
the production writer setUseCompoundFile to true.  I did not set this on the
temp index.  The last thing that I do before closing the production writer
is to call the optimize method.

I would really appreciate any ideas to get the index size smaller if it is
at all possible.

Thanks
Rob




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



about performance (newbie)

2004-08-19 Thread Wermus Fernando
Luceners,
I have elements (accounts, contacts, task, events) where I have to find
in any field a word (hello for example). Which is the best way to do
that with Lucene?
 
In other words,
I have several elements where I have to search a Word. I can make one
search and then order the hits to separate the elements. The another
option is to make as much searching as elements I have. Which one of
these options is better?
 
I'm using MultiFieldQueryParser
 
 


Indexing Scheduler

2004-08-19 Thread Natarajan.T
FYI,
 
I want to configure the Indexing  file as per the user setting
values(Date  Time). Job Scheduler.
How can I handle the job scheduler to indexing???
 
Any one knows good experience in Quartz Scheduler share with me.
 
Thanks,
Natarajan.


Re: Index Size

2004-08-19 Thread Otis Gospodnetic
I thought this was the case.  I believe there was a bug in one of the
recent Lucene releases that caused old CFS files not to be removed when
they should be removed.  This resulted in your index directory
containing a bunch of old CFS files consuming your disk space.

Try getting a recent nightly build and see if using that takes car eof
your problem.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Hey George
 Thanks for responding.  I am using windows and I don't see any hidden
 files.
 I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
 files.
 I have two FDT files and two FDX files. And three FNM files.  Add
 these
 files to the deletable and segments file and that is all of the files
 that I
 have.   The CFS files are appoximately 11 MB each.  The totals I gave
 you
 before were for all of my indexes together.  This particular index
 has a
 size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
 OK - I just removed all of the CFS files from the directory and I can
 still
 read my indexes.  So know I have to ask what are these CFS files? 
 Why are
 they created?  And how can I get rid of them if I don't need them.  I
 will
 also take a look at the Lucene website to see if I can find any
 information.
 
 Thanks
 Rob
 
 - Original Message - 
 From: Honey George [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 12:29 AM
 Subject: Re: Index Size
 
 
 Hi,
  Please check for hidden files in the index folder. If
 you are using linx, do something like
 
 ls -al index folder
 
 I am also facing a similar problem where the index
 size is greater than the data size. In my case there
 were some hidden temproary files which the lucene
 creates.
 That was taking half of the total size.
 
 My problem is that after deleting the temporary files,
 the index size is same as that of the data size. That
 again seems to be a problem. I am yet to find out the
 reason..
 
 Thanks,
george
 
 
  --- Rob Jose [EMAIL PROTECTED] wrote:
  Hello
  I have indexed several thousand (52 to be exact)
  text files and I keep running out of disk space to
  store the indexes.  The size of the documents I have
  indexed is around 2.5 GB.  The size of the Lucene
  indexes is around 287 GB.  Does this seem correct?
  I am not storing the contents of the file, just
  indexing and tokenizing.  I am using Lucene 1.3
  final.  Can you guys let me know what you are
  experiencing?  I don't want to go into production
  with something that I should be configuring better.
 
 
  I am not sure if this helps, but I have a temp index
  and a real index.  I index the file into the temp
  index, and then merge the temp index into the real
  index using the addIndexes method on the
  IndexWriter.  I have also set the production writer
  setUseCompoundFile to true.  I did not set this on
  the temp index.  The last thing that I do before
  closing the production writer is to call the
  optimize method.
 
  I would really appreciate any ideas to get the index
  size smaller if it is at all possible.
 
  Thanks
  Rob
 
 
 
 
 
 ___ALL-NEW
 Yahoo!
 Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Otis
I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4 final?

Rob
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 7:13 AM
Subject: Re: Index Size


I thought this was the case.  I believe there was a bug in one of the
recent Lucene releases that caused old CFS files not to be removed when
they should be removed.  This resulted in your index directory
containing a bunch of old CFS files consuming your disk space.

Try getting a recent nightly build and see if using that takes car eof
your problem.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Hey George
 Thanks for responding.  I am using windows and I don't see any hidden
 files.
 I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
 files.
 I have two FDT files and two FDX files. And three FNM files.  Add
 these
 files to the deletable and segments file and that is all of the files
 that I
 have.   The CFS files are appoximately 11 MB each.  The totals I gave
 you
 before were for all of my indexes together.  This particular index
 has a
 size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
 OK - I just removed all of the CFS files from the directory and I can
 still
 read my indexes.  So know I have to ask what are these CFS files? 
 Why are
 they created?  And how can I get rid of them if I don't need them.  I
 will
 also take a look at the Lucene website to see if I can find any
 information.
 
 Thanks
 Rob
 
 - Original Message - 
 From: Honey George [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 12:29 AM
 Subject: Re: Index Size
 
 
 Hi,
  Please check for hidden files in the index folder. If
 you are using linx, do something like
 
 ls -al index folder
 
 I am also facing a similar problem where the index
 size is greater than the data size. In my case there
 were some hidden temproary files which the lucene
 creates.
 That was taking half of the total size.
 
 My problem is that after deleting the temporary files,
 the index size is same as that of the data size. That
 again seems to be a problem. I am yet to find out the
 reason..
 
 Thanks,
george
 
 
  --- Rob Jose [EMAIL PROTECTED] wrote:
  Hello
  I have indexed several thousand (52 to be exact)
  text files and I keep running out of disk space to
  store the indexes.  The size of the documents I have
  indexed is around 2.5 GB.  The size of the Lucene
  indexes is around 287 GB.  Does this seem correct?
  I am not storing the contents of the file, just
  indexing and tokenizing.  I am using Lucene 1.3
  final.  Can you guys let me know what you are
  experiencing?  I don't want to go into production
  with something that I should be configuring better.
 
 
  I am not sure if this helps, but I have a temp index
  and a real index.  I index the file into the temp
  index, and then merge the temp index into the real
  index using the addIndexes method on the
  IndexWriter.  I have also set the production writer
  setUseCompoundFile to true.  I did not set this on
  the temp index.  The last thing that I do before
  closing the production writer is to call the
  optimize method.
 
  I would really appreciate any ideas to get the index
  size smaller if it is at all possible.
 
  Thanks
  Rob
 
 
 
 
 
 ___ALL-NEW
 Yahoo!
 Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Otis Gospodnetic
Just go for 1.4.1 and look at the CHANGES.txt file to see if there were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Otis
 I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
 final?
 
 Rob
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 7:13 AM
 Subject: Re: Index Size
 
 
 I thought this was the case.  I believe there was a bug in one of the
 recent Lucene releases that caused old CFS files not to be removed
 when
 they should be removed.  This resulted in your index directory
 containing a bunch of old CFS files consuming your disk space.
 
 Try getting a recent nightly build and see if using that takes car
 eof
 your problem.
 
 Otis
 
 --- Rob Jose [EMAIL PROTECTED] wrote:
 
  Hey George
  Thanks for responding.  I am using windows and I don't see any
 hidden
  files.
  I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
  files.
  I have two FDT files and two FDX files. And three FNM files.  Add
  these
  files to the deletable and segments file and that is all of the
 files
  that I
  have.   The CFS files are appoximately 11 MB each.  The totals I
 gave
  you
  before were for all of my indexes together.  This particular index
  has a
  size of 21.6 GB.  The files that it indexed have a size of 89 MB.
  
  OK - I just removed all of the CFS files from the directory and I
 can
  still
  read my indexes.  So know I have to ask what are these CFS files? 
  Why are
  they created?  And how can I get rid of them if I don't need them. 
 I
  will
  also take a look at the Lucene website to see if I can find any
  information.
  
  Thanks
  Rob
  
  - Original Message - 
  From: Honey George [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 12:29 AM
  Subject: Re: Index Size
  
  
  Hi,
   Please check for hidden files in the index folder. If
  you are using linx, do something like
  
  ls -al index folder
  
  I am also facing a similar problem where the index
  size is greater than the data size. In my case there
  were some hidden temproary files which the lucene
  creates.
  That was taking half of the total size.
  
  My problem is that after deleting the temporary files,
  the index size is same as that of the data size. That
  again seems to be a problem. I am yet to find out the
  reason..
  
  Thanks,
 george
  
  
   --- Rob Jose [EMAIL PROTECTED] wrote:
   Hello
   I have indexed several thousand (52 to be exact)
   text files and I keep running out of disk space to
   store the indexes.  The size of the documents I have
   indexed is around 2.5 GB.  The size of the Lucene
   indexes is around 287 GB.  Does this seem correct?
   I am not storing the contents of the file, just
   indexing and tokenizing.  I am using Lucene 1.3
   final.  Can you guys let me know what you are
   experiencing?  I don't want to go into production
   with something that I should be configuring better.
  
  
   I am not sure if this helps, but I have a temp index
   and a real index.  I index the file into the temp
   index, and then merge the temp index into the real
   index using the addIndexes method on the
   IndexWriter.  I have also set the production writer
   setUseCompoundFile to true.  I did not set this on
   the temp index.  The last thing that I do before
   closing the production writer is to call the
   optimize method.
  
   I would really appreciate any ideas to get the index
   size smaller if it is at all possible.
  
   Thanks
   Rob
  
  
  
  
  
  ___ALL-NEW
  Yahoo!
  Messenger - all new features - even more fun! 
  http://uk.messenger.yahoo.com
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8 MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Otis
 I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
 final?

 Rob
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 7:13 AM
 Subject: Re: Index Size


 I thought this was the case.  I believe there was a bug in one of the
 recent Lucene releases that caused old CFS files not to be removed
 when
 they should be removed.  This resulted in your index directory
 containing a bunch of old CFS files consuming your disk space.

 Try getting a recent nightly build and see if using that takes car
 eof
 your problem.

 Otis

 --- Rob Jose [EMAIL PROTECTED] wrote:

  Hey George
  Thanks for responding.  I am using windows and I don't see any
 hidden
  files.
  I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
  files.
  I have two FDT files and two FDX files. And three FNM files.  Add
  these
  files to the deletable and segments file and that is all of the
 files
  that I
  have.   The CFS files are appoximately 11 MB each.  The totals I
 gave
  you
  before were for all of my indexes together.  This particular index
  has a
  size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
  OK - I just removed all of the CFS files from the directory and I
 can
  still
  read my indexes.  So know I have to ask what are these CFS files?
  Why are
  they created?  And how can I get rid of them if I don't need them.
 I
  will
  also take a look at the Lucene website to see if I can find any
  information.
 
  Thanks
  Rob
 
  - Original Message - 
  From: Honey George [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 12:29 AM
  Subject: Re: Index Size
 
 
  Hi,
   Please check for hidden files in the index folder. If
  you are using linx, do something like
 
  ls -al index folder
 
  I am also facing a similar problem where the index
  size is greater than the data size. In my case there
  were some hidden temproary files which the lucene
  creates.
  That was taking half of the total size.
 
  My problem is that after deleting the temporary files,
  the index size is same as that of the data size. That
  again seems to be a problem. I am yet to find out the
  reason..
 
  Thanks,
 george
 
 
   --- Rob Jose [EMAIL PROTECTED] wrote:
   Hello
   I have indexed several thousand (52 to be exact)
   text files and I keep running out of disk space to
   store the indexes.  The size of the documents I have
   indexed is around 2.5 GB.  The size of the Lucene
   indexes is around 287 GB.  Does this seem correct?
   I am not storing the contents of the file, just
   indexing and tokenizing.  I am using Lucene 1.3
   final.  Can you guys let me know what you are
   experiencing?  I don't want to go into production
   with something that I should be configuring better.
  
  
   I am not sure if this helps, but I have a temp index
   and a real index.  I index the file into the temp
   index, and then merge the temp index into the real
   index using the addIndexes method on the
   IndexWriter.  I have also set the production writer
   setUseCompoundFile to true.  I did not set this on
   the temp index.  The last thing that I do before
   closing the production writer is to call the
   optimize method.
  
   I would really appreciate any ideas to get the index
   size smaller if it is at all possible.
  
   Thanks
   Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Index Size

2004-08-19 Thread Armbrust, Daniel C.
Have you tried looking at the contents of this small index with Luke, to see what 
actually got put into it?  Maybe one of your stored fields is being fed something you 
didn't expect.

Dan 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Dan
Thanks for your response.  Yes, I have used Luke to look at the index and
everything looks good.

Rob
- Original Message - 
From: Armbrust, Daniel C. [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 9:14 AM
Subject: RE: Index Size


Have you tried looking at the contents of this small index with Luke, to see
what actually got put into it?  Maybe one of your stored fields is being fed
something you didn't expect.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Stephane James Vaucher
Stupid question:

Are you sure you have the right number of docs in your index? i.e. you're
not adding the same document twice into or via your tmp index.

sv

On Thu, 19 Aug 2004, Rob Jose wrote:

 Paul
 Thank you for your response.  I have appended to the bottom of this message
 the field structure that I am using.  I hope that this helps.  I am using
 the StandardAnalyzer.  I do not believe that I am changing any default
 values, but I have also appended the code that adds the temp index to the
 production index.

 Thanks for you help
 Rob

 Here is the code that describes the field structure.
 public static Document Document(String contents, String path, Date modified,
 String runDate, String totalpages, String pagecount, String countycode,
 String reportnum, String reportdescr)

 {

 SimpleDateFormat showFormat = new
 SimpleDateFormat(TurbineResources.getString(date.default.format));

 SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd);

 Document doc = new Document();

 doc.add(Field.Keyword(path, path));

 doc.add(Field.Keyword(modified, showFormat.format(modified)));

 doc.add(Field.UnStored(searchDate, searchFormat.format(modified)));

 doc.add(Field.Keyword(runDate, runDate==null?:runDate));

 doc.add(Field.UnStored(searchRunDate,
 runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri
 ng(3,5)));

 doc.add(Field.Keyword(reportnum, reportnum));

 doc.add(Field.Text(reportdescr, reportdescr));

 doc.add(Field.UnStored(cntycode, countycode));

 doc.add(Field.Keyword(totalpages, totalpages));

 doc.add(Field.Keyword(page, pagecount));

 doc.add(Field.UnStored(contents, contents));

 return doc;

 }



 Here is the code that adds the temp index to the production index.

 File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);

 tempReader = IndexReader.open(tempFile);

 try

 {

 boolean createIndex = false;

 File f = new File(sIndex + File.separatorChar + sCntyCode);

 if (!f.exists())

 {

 createIndex = true;

 }

 prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
 StandardAnalyzer(), createIndex);

 }

 catch (Exception e)

 {

 IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
 sCntyCode, false));

 CasesReports.log(Tried to Unlock  + sIndex);

 prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

 CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
 sCntyCode);

 }

 prodWriter.setUseCompoundFile(true);

 prodWriter.addIndexes(new IndexReader[] { tempReader });





 - Original Message -
 From: Paul Elschot [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 12:16 AM
 Subject: Re: Index Size


 On Wednesday 18 August 2004 22:44, Rob Jose wrote:
  Hello
  I have indexed several thousand (52 to be exact) text files and I keep
  running out of disk space to store the indexes.  The size of the documents
  I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
  287 GB.  Does this seem correct?  I am not storing the contents of the

 As noted, one would expect the index size to be about 35%
 of the original text, ie. about 2.5GB * 35% = 800MB.
 That is two orders of magnitude off from what you have.

 Could you provide some more information about the field structure,
 ie. how many fields, which fields are stored, which fields are indexed,
 evt. use of non standard analyzers, and evt. non standard
 Lucene settings?

 You might also try to change to non compound format to have a look
 at the sizes of the individual index files, see file formats on the lucene
 web site.
 You can then see the total disk size of for example the stored fields.

 Regards,
 Paul Elschot


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Grant Ingersoll
How many fields do you have and what analyzer are you using?

 [EMAIL PROTECTED] 8/19/2004 11:54:25 AM 
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8
MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use
IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there
were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Otis
 I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
 final?

 Rob
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 7:13 AM
 Subject: Re: Index Size


 I thought this was the case.  I believe there was a bug in one of
the
 recent Lucene releases that caused old CFS files not to be removed
 when
 they should be removed.  This resulted in your index directory
 containing a bunch of old CFS files consuming your disk space.

 Try getting a recent nightly build and see if using that takes car
 eof
 your problem.

 Otis

 --- Rob Jose [EMAIL PROTECTED] wrote:

  Hey George
  Thanks for responding.  I am using windows and I don't see any
 hidden
  files.
  I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2,
etc.)
  files.
  I have two FDT files and two FDX files. And three FNM files.  Add
  these
  files to the deletable and segments file and that is all of the
 files
  that I
  have.   The CFS files are appoximately 11 MB each.  The totals I
 gave
  you
  before were for all of my indexes together.  This particular index
  has a
  size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
  OK - I just removed all of the CFS files from the directory and I
 can
  still
  read my indexes.  So know I have to ask what are these CFS files?
  Why are
  they created?  And how can I get rid of them if I don't need them.
 I
  will
  also take a look at the Lucene website to see if I can find any
  information.
 
  Thanks
  Rob
 
  - Original Message - 
  From: Honey George [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 12:29 AM
  Subject: Re: Index Size
 
 
  Hi,
   Please check for hidden files in the index folder. If
  you are using linx, do something like
 
  ls -al index folder
 
  I am also facing a similar problem where the index
  size is greater than the data size. In my case there
  were some hidden temproary files which the lucene
  creates.
  That was taking half of the total size.
 
  My problem is that after deleting the temporary files,
  the index size is same as that of the data size. That
  again seems to be a problem. I am yet to find out the
  reason..
 
  Thanks,
 george
 
 
   --- Rob Jose [EMAIL PROTECTED] wrote:
   Hello
   I have indexed several thousand (52 to be exact)
   text files and I keep running out of disk space to
   store the indexes.  The size of the documents I have
   indexed is around 2.5 GB.  The size of the Lucene
   indexes is around 287 GB.  Does this seem correct?
   I am not storing the contents of the file, just
   indexing and tokenizing.  I am using Lucene 1.3
   final.  Can you guys let me know what you are
   experiencing?  I don't want to go into production
   with something that I should be configuring better.
  
  
   I am not sure if this helps, but I have a temp index
   and a real index.  I index the file into the temp
   index, and then merge the temp index into the real
   index using the addIndexes method on the
   IndexWriter.  I have also set the production writer
   setUseCompoundFile to true.  I did not set this on
   the temp index.  The last thing that I do before
   closing the production writer is to call the
   optimize method.
  
   I would really appreciate any ideas to get the index
   size smaller if it is at all possible.
  
   Thanks
   Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Grant
Thanks for your response.  I have fixed this issue.  I have indexed 5 MB
worth of text files and I now only use 224 KB.  I was getting 80 MB.  The
only change I made was to change the way I merge my temp index into my prod
index.  My code changed from:
prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });

To:

int iNumDocs = tempReader.numDocs();

for (int y = 0; y  iNumDocs; y++) {

Document tempDoc = tempReader.document(y);

prodWriter.addDocument(tempDoc);

}



I don't know if this is a bug in the IndexWriter.addIndexes(IndexReader)
method or something else I am doing that caused this, but I am getting much
better results now.



Thanks to everyone who helped, I really appreciate it.



Rob

- Original Message - 
From: Grant Ingersoll [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 10:51 AM
Subject: Re: Index Size


How many fields do you have and what analyzer are you using?

 [EMAIL PROTECTED] 8/19/2004 11:54:25 AM 
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8
MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use
IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there
were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Otis
 I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
 final?

 Rob
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 7:13 AM
 Subject: Re: Index Size


 I thought this was the case.  I believe there was a bug in one of
the
 recent Lucene releases that caused old CFS files not to be removed
 when
 they should be removed.  This resulted in your index directory
 containing a bunch of old CFS files consuming your disk space.

 Try getting a recent nightly build and see if using that takes car
 eof
 your problem.

 Otis

 --- Rob Jose [EMAIL PROTECTED] wrote:

  Hey George
  Thanks for responding.  I am using windows and I don't see any
 hidden
  files.
  I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2,
etc.)
  files.
  I have two FDT files and two FDX files. And three FNM files.  Add
  these
  files to the deletable and segments file and that is all of the
 files
  that I
  have.   The CFS files are appoximately 11 MB each.  The totals I
 gave
  you
  before were for all of my indexes together.  This particular index
  has a
  size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
  OK - I just removed all of the CFS files from the directory and I
 can
  still
  read my indexes.  So know I have to ask what are these CFS files?
  Why are
  they created?  And how can I get rid of them if I don't need them.
 I
  will
  also take a look at the Lucene website to see if I can find any
  information.
 
  Thanks
  Rob
 
  - Original Message - 
  From: Honey George [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 12:29 AM
  Subject: Re: Index Size
 
 
  Hi,
   Please check for hidden files in the index folder. If
  you are using linx, do something like
 
  ls -al index folder
 
  I am also facing a similar problem where the index
  size is greater than the data size. In my case there
  were some hidden temproary files which the lucene
  creates.
  That was taking half of the total size.
 
  My problem is that after deleting the temporary files,
  the index size is same as that of the data size. That
  again seems to be a problem. I am yet to find out the
  reason..
 
  Thanks,
 george
 
 
   --- Rob Jose [EMAIL PROTECTED] wrote:
   Hello
   I have indexed several thousand (52 to be exact)
   text files and I keep running out of disk space to
   store the indexes.  The size of the documents I have
   indexed is around 2.5 GB.  The size of the Lucene
   indexes is around 287 GB.  Does this seem correct?
   I am not storing the contents of the file, just
   indexing and tokenizing.  I am using Lucene 1.3
   final.  Can you guys let me know what you are
   experiencing?  I don't want to go into production
   with something that I should be configuring better.
  
  
   I am not sure if this helps, but I have a temp index
   and a real index.  I index the file into the temp
   index, and then merge the temp index into the real
   index using the addIndexes method on the
   IndexWriter.  I have also set the production writer
   setUseCompoundFile to true.  I did not set this on
   the temp 

Debian build problem with 1.4.1

2004-08-19 Thread Jeff Breidenbach

Hi all,

I am the Debian package maintainer for Lucene, and I'm having build
problems with 1.4.1. We are very close to a major Debian release (code
named 'sarge'), and the window for changes is very small. Can someone
please help me in the next day or two, otherwise Debian stable will ship
Lucene 1.4-final for the next couple of years. It looks to me like the
problem is in javacc generated code, and it's not obvious to me what
to do.

For debian sarge or sid users out there who want to reproduce the
build problem, download the lucene 1.4.1 source tarball, then:

  apt-get install devscripts
  apt-get source liblucene-java
  cd lucene-1.4
  uupdate -v 1.4.1 ../lucene-1.4.1-src.tar.gz 
  cd ../lucene-1.4.1
  debuild -us -uc

Cheers,
Jeff

=


compile-core:
[mkdir] Created dir: /tmp/lucene/lucene-1.4.1/build/classes/java
[javac] Compiling 160 source files to /tmp/lucene/lucene-1.4.1/build/classes/java
[javac] 
/tmp/lucene/lucene-1.4.1/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:15:
 cannot resolve symbol
[javac] symbol  : class Reader
[javac] location: class org.apache.lucene.analysis.standard.StandardTokenizer
[javac]   public StandardTokenizer(Reader reader) {
[javac]^
[javac] 
/tmp/lucene/lucene-1.4.1/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:24:
 cannot resolve symbol
[javac] symbol  : class IOException
[javac] location: class org.apache.lucene.analysis.standard.StandardTokenizer
[javac]   final public org.apache.lucene.analysis.Token next() throws 
ParseException, IOException {
[javac]
   ^
[javac] 
/tmp/lucene/lucene-1.4.1/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java:15:
 recursive constructor invocation
[javac]   public StandardTokenizer(Reader reader) {
[javac]  ^
[javac] 3 errors

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]