Re: Index Size

2004-08-19 Thread Rob Jose
Paul
Thank you for your response.  I have appended to the bottom of this message
the field structure that I am using.  I hope that this helps.  I am using
the StandardAnalyzer.  I do not believe that I am changing any default
values, but I have also appended the code that adds the temp index to the
production index.

Thanks for you help
Rob

Here is the code that describes the field structure.
public static Document Document(String contents, String path, Date modified,
String runDate, String totalpages, String pagecount, String countycode,
String reportnum, String reportdescr)

{

SimpleDateFormat showFormat = new
SimpleDateFormat(TurbineResources.getString(date.default.format));

SimpleDateFormat searchFormat = new SimpleDateFormat(MMdd);

Document doc = new Document();

doc.add(Field.Keyword(path, path));

doc.add(Field.Keyword(modified, showFormat.format(modified)));

doc.add(Field.UnStored(searchDate, searchFormat.format(modified)));

doc.add(Field.Keyword(runDate, runDate==null?:runDate));

doc.add(Field.UnStored(searchRunDate,
runDate==null?:runDate.substring(6)+runDate.substring(0,2)+runDate.substri
ng(3,5)));

doc.add(Field.Keyword(reportnum, reportnum));

doc.add(Field.Text(reportdescr, reportdescr));

doc.add(Field.UnStored(cntycode, countycode));

doc.add(Field.Keyword(totalpages, totalpages));

doc.add(Field.Keyword(page, pagecount));

doc.add(Field.UnStored(contents, contents));

return doc;

}



Here is the code that adds the temp index to the production index.

File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);

tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log(Tried to Unlock  + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });





- Original Message - 
From: Paul Elschot [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 12:16 AM
Subject: Re: Index Size


On Wednesday 18 August 2004 22:44, Rob Jose wrote:
 Hello
 I have indexed several thousand (52 to be exact) text files and I keep
 running out of disk space to store the indexes.  The size of the documents
 I have indexed is around 2.5 GB.  The size of the Lucene indexes is around
 287 GB.  Does this seem correct?  I am not storing the contents of the

As noted, one would expect the index size to be about 35%
of the original text, ie. about 2.5GB * 35% = 800MB.
That is two orders of magnitude off from what you have.

Could you provide some more information about the field structure,
ie. how many fields, which fields are stored, which fields are indexed,
evt. use of non standard analyzers, and evt. non standard
Lucene settings?

You might also try to change to non compound format to have a look
at the sizes of the individual index files, see file formats on the lucene
web site.
You can then see the total disk size of for example the stored fields.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Hey George
Thanks for responding.  I am using windows and I don't see any hidden files.
I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.) files.
I have two FDT files and two FDX files. And three FNM files.  Add these
files to the deletable and segments file and that is all of the files that I
have.   The CFS files are appoximately 11 MB each.  The totals I gave you
before were for all of my indexes together.  This particular index has a
size of 21.6 GB.  The files that it indexed have a size of 89 MB.

OK - I just removed all of the CFS files from the directory and I can still
read my indexes.  So know I have to ask what are these CFS files?  Why are
they created?  And how can I get rid of them if I don't need them.  I will
also take a look at the Lucene website to see if I can find any information.

Thanks
Rob

- Original Message - 
From: Honey George [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 12:29 AM
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al index folder

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose [EMAIL PROTECTED] wrote:
 Hello
 I have indexed several thousand (52 to be exact)
 text files and I keep running out of disk space to
 store the indexes.  The size of the documents I have
 indexed is around 2.5 GB.  The size of the Lucene
 indexes is around 287 GB.  Does this seem correct?
 I am not storing the contents of the file, just
 indexing and tokenizing.  I am using Lucene 1.3
 final.  Can you guys let me know what you are
 experiencing?  I don't want to go into production
 with something that I should be configuring better.


 I am not sure if this helps, but I have a temp index
 and a real index.  I index the file into the temp
 index, and then merge the temp index into the real
 index using the addIndexes method on the
 IndexWriter.  I have also set the production writer
 setUseCompoundFile to true.  I did not set this on
 the temp index.  The last thing that I do before
 closing the production writer is to call the
 optimize method.

 I would really appreciate any ideas to get the index
 size smaller if it is at all possible.

 Thanks
 Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Karthik
Thanks for responding.  Yes, I optimize right before I close the index
writer.  I added this a little while ago to try and get the size down.

Rob
- Original Message - 
From: Karthik N S [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 12:59 AM
Subject: RE: Index Size


Guys

   Are u Using the Optimizing  the index before close process.

  If not try using it...  :}



karthik




-Original Message-
From: Honey George [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:00 PM
To: Lucene Users List
Subject: Re: Index Size


Hi,
 Please check for hidden files in the index folder. If
you are using linx, do something like

ls -al index folder

I am also facing a similar problem where the index
size is greater than the data size. In my case there
were some hidden temproary files which the lucene
creates.
That was taking half of the total size.

My problem is that after deleting the temporary files,
the index size is same as that of the data size. That
again seems to be a problem. I am yet to find out the
reason..

Thanks,
   george


 --- Rob Jose [EMAIL PROTECTED] wrote:
 Hello
 I have indexed several thousand (52 to be exact)
 text files and I keep running out of disk space to
 store the indexes.  The size of the documents I have
 indexed is around 2.5 GB.  The size of the Lucene
 indexes is around 287 GB.  Does this seem correct?
 I am not storing the contents of the file, just
 indexing and tokenizing.  I am using Lucene 1.3
 final.  Can you guys let me know what you are
 experiencing?  I don't want to go into production
 with something that I should be configuring better.


 I am not sure if this helps, but I have a temp index
 and a real index.  I index the file into the temp
 index, and then merge the temp index into the real
 index using the addIndexes method on the
 IndexWriter.  I have also set the production writer
 setUseCompoundFile to true.  I did not set this on
 the temp index.  The last thing that I do before
 closing the production writer is to call the
 optimize method.

 I would really appreciate any ideas to get the index
 size smaller if it is at all possible.

 Thanks
 Rob





___ALL-NEW Yahoo!
Messenger - all new features - even more fun!  http://uk.messenger.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Bernhard
Thanks for responding.  I do have an IndexReader open on the Temp index.  I
pass this IndexReader into the addIndexes method on the IndexWriter to add
these files.  I did notice that I have a ton of CFS files that I removed and
was still able to read the indexes.  Are these the temporary segment files
you are talking about?  Here is my code that adds the temp index to the prod
index.
File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);
tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log(Tried to Unlock  + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });



Am I doing something wrong?  Any help would be extremely appreciated.



Thanks

Rob

- Original Message - 
From: Bernhard Messer [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:09 AM
Subject: Re: Index Size


Rob,

as Doug and Paul already mentioned, the index size is definitely to big :-(.

What could raise the problem, especially when running on a windows
platform, is that an IndexReader is open during the whole index process.
During indexing, the writer creates temporary segment files which will
be merged into bigger segments. If done, the old segment files will be
deleted. If there is an open IndexReader, the environment is unable to
unlock the files and they still stay in the index directory. You will
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard


Rob Jose wrote:

Hello
I have indexed several thousand (52 to be exact) text files and I keep
running out of disk space to store the indexes.  The size of the documents I
have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287
GB.  Does this seem correct?  I am not storing the contents of the file,
just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys
let me know what you are experiencing?  I don't want to go into production
with something that I should be configuring better.

I am not sure if this helps, but I have a temp index and a real index.  I
index the file into the temp index, and then merge the temp index into the
real index using the addIndexes method on the IndexWriter.  I have also set
the production writer setUseCompoundFile to true.  I did not set this on the
temp index.  The last thing that I do before closing the production writer
is to call the optimize method.

I would really appreciate any ideas to get the index size smaller if it is
at all possible.

Thanks
Rob




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
I did a little more research into my production indexes, and so far the
first index in the only one that has any other files besides the CFS files.
The other indexes that I have seen have just the deletable and segments
files and a whole bunch of cfs files.  Very interesting.  Also worth noting
is that once in awhile one of the production indexes will have a 0 length
FNM file.

Rob
- Original Message - 
From: Rob Jose [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 6:42 AM
Subject: Re: Index Size


Bernhard
Thanks for responding.  I do have an IndexReader open on the Temp index.  I
pass this IndexReader into the addIndexes method on the IndexWriter to add
these files.  I did notice that I have a ton of CFS files that I removed and
was still able to read the indexes.  Are these the temporary segment files
you are talking about?  Here is my code that adds the temp index to the prod
index.
File tempFile = new File(sIndex + File.separatorChar + temp + sCntyCode);
tempReader = IndexReader.open(tempFile);

try

{

boolean createIndex = false;

File f = new File(sIndex + File.separatorChar + sCntyCode);

if (!f.exists())

{

createIndex = true;

}

prodWriter = new IndexWriter(sIndex + File.separatorChar + sCntyCode, new
StandardAnalyzer(), createIndex);

}

catch (Exception e)

{

IndexReader.unlock(FSDirectory.getDirectory(sIndex + File.separatorChar +
sCntyCode, false));

CasesReports.log(Tried to Unlock  + sIndex);

prodWriter = new IndexWriter(sIndex, new StandardAnalyzer(), false);

CasesReports.log(Successfully Unlocked  + sIndex + File.separatorChar +
sCntyCode);

}

prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });



Am I doing something wrong?  Any help would be extremely appreciated.



Thanks

Rob

- Original Message - 
From: Bernhard Messer [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 1:09 AM
Subject: Re: Index Size


Rob,

as Doug and Paul already mentioned, the index size is definitely to big :-(.

What could raise the problem, especially when running on a windows
platform, is that an IndexReader is open during the whole index process.
During indexing, the writer creates temporary segment files which will
be merged into bigger segments. If done, the old segment files will be
deleted. If there is an open IndexReader, the environment is unable to
unlock the files and they still stay in the index directory. You will
end up with an index, several times bigger than the dataset.

Can you check your code for any open IndexReaders when indexing, or
paste the relevant part to the list so we could have a look on it.

hope this helps
Bernhard


Rob Jose wrote:

Hello
I have indexed several thousand (52 to be exact) text files and I keep
running out of disk space to store the indexes.  The size of the documents I
have indexed is around 2.5 GB.  The size of the Lucene indexes is around 287
GB.  Does this seem correct?  I am not storing the contents of the file,
just indexing and tokenizing.  I am using Lucene 1.3 final.  Can you guys
let me know what you are experiencing?  I don't want to go into production
with something that I should be configuring better.

I am not sure if this helps, but I have a temp index and a real index.  I
index the file into the temp index, and then merge the temp index into the
real index using the addIndexes method on the IndexWriter.  I have also set
the production writer setUseCompoundFile to true.  I did not set this on the
temp index.  The last thing that I do before closing the production writer
is to call the optimize method.

I would really appreciate any ideas to get the index size smaller if it is
at all possible.

Thanks
Rob




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Otis
I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4 final?

Rob
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 7:13 AM
Subject: Re: Index Size


I thought this was the case.  I believe there was a bug in one of the
recent Lucene releases that caused old CFS files not to be removed when
they should be removed.  This resulted in your index directory
containing a bunch of old CFS files consuming your disk space.

Try getting a recent nightly build and see if using that takes car eof
your problem.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Hey George
 Thanks for responding.  I am using windows and I don't see any hidden
 files.
 I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
 files.
 I have two FDT files and two FDX files. And three FNM files.  Add
 these
 files to the deletable and segments file and that is all of the files
 that I
 have.   The CFS files are appoximately 11 MB each.  The totals I gave
 you
 before were for all of my indexes together.  This particular index
 has a
 size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
 OK - I just removed all of the CFS files from the directory and I can
 still
 read my indexes.  So know I have to ask what are these CFS files? 
 Why are
 they created?  And how can I get rid of them if I don't need them.  I
 will
 also take a look at the Lucene website to see if I can find any
 information.
 
 Thanks
 Rob
 
 - Original Message - 
 From: Honey George [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 12:29 AM
 Subject: Re: Index Size
 
 
 Hi,
  Please check for hidden files in the index folder. If
 you are using linx, do something like
 
 ls -al index folder
 
 I am also facing a similar problem where the index
 size is greater than the data size. In my case there
 were some hidden temproary files which the lucene
 creates.
 That was taking half of the total size.
 
 My problem is that after deleting the temporary files,
 the index size is same as that of the data size. That
 again seems to be a problem. I am yet to find out the
 reason..
 
 Thanks,
george
 
 
  --- Rob Jose [EMAIL PROTECTED] wrote:
  Hello
  I have indexed several thousand (52 to be exact)
  text files and I keep running out of disk space to
  store the indexes.  The size of the documents I have
  indexed is around 2.5 GB.  The size of the Lucene
  indexes is around 287 GB.  Does this seem correct?
  I am not storing the contents of the file, just
  indexing and tokenizing.  I am using Lucene 1.3
  final.  Can you guys let me know what you are
  experiencing?  I don't want to go into production
  with something that I should be configuring better.
 
 
  I am not sure if this helps, but I have a temp index
  and a real index.  I index the file into the temp
  index, and then merge the temp index into the real
  index using the addIndexes method on the
  IndexWriter.  I have also set the production writer
  setUseCompoundFile to true.  I did not set this on
  the temp index.  The last thing that I do before
  closing the production writer is to call the
  optimize method.
 
  I would really appreciate any ideas to get the index
  size smaller if it is at all possible.
 
  Thanks
  Rob
 
 
 
 
 
 ___ALL-NEW
 Yahoo!
 Messenger - all new features - even more fun! 
 http://uk.messenger.yahoo.com
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8 MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Otis
 I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
 final?

 Rob
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 7:13 AM
 Subject: Re: Index Size


 I thought this was the case.  I believe there was a bug in one of the
 recent Lucene releases that caused old CFS files not to be removed
 when
 they should be removed.  This resulted in your index directory
 containing a bunch of old CFS files consuming your disk space.

 Try getting a recent nightly build and see if using that takes car
 eof
 your problem.

 Otis

 --- Rob Jose [EMAIL PROTECTED] wrote:

  Hey George
  Thanks for responding.  I am using windows and I don't see any
 hidden
  files.
  I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2, etc.)
  files.
  I have two FDT files and two FDX files. And three FNM files.  Add
  these
  files to the deletable and segments file and that is all of the
 files
  that I
  have.   The CFS files are appoximately 11 MB each.  The totals I
 gave
  you
  before were for all of my indexes together.  This particular index
  has a
  size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
  OK - I just removed all of the CFS files from the directory and I
 can
  still
  read my indexes.  So know I have to ask what are these CFS files?
  Why are
  they created?  And how can I get rid of them if I don't need them.
 I
  will
  also take a look at the Lucene website to see if I can find any
  information.
 
  Thanks
  Rob
 
  - Original Message - 
  From: Honey George [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 12:29 AM
  Subject: Re: Index Size
 
 
  Hi,
   Please check for hidden files in the index folder. If
  you are using linx, do something like
 
  ls -al index folder
 
  I am also facing a similar problem where the index
  size is greater than the data size. In my case there
  were some hidden temproary files which the lucene
  creates.
  That was taking half of the total size.
 
  My problem is that after deleting the temporary files,
  the index size is same as that of the data size. That
  again seems to be a problem. I am yet to find out the
  reason..
 
  Thanks,
 george
 
 
   --- Rob Jose [EMAIL PROTECTED] wrote:
   Hello
   I have indexed several thousand (52 to be exact)
   text files and I keep running out of disk space to
   store the indexes.  The size of the documents I have
   indexed is around 2.5 GB.  The size of the Lucene
   indexes is around 287 GB.  Does this seem correct?
   I am not storing the contents of the file, just
   indexing and tokenizing.  I am using Lucene 1.3
   final.  Can you guys let me know what you are
   experiencing?  I don't want to go into production
   with something that I should be configuring better.
  
  
   I am not sure if this helps, but I have a temp index
   and a real index.  I index the file into the temp
   index, and then merge the temp index into the real
   index using the addIndexes method on the
   IndexWriter.  I have also set the production writer
   setUseCompoundFile to true.  I did not set this on
   the temp index.  The last thing that I do before
   closing the production writer is to call the
   optimize method.
  
   I would really appreciate any ideas to get the index
   size smaller if it is at all possible.
  
   Thanks
   Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Dan
Thanks for your response.  Yes, I have used Luke to look at the index and
everything looks good.

Rob
- Original Message - 
From: Armbrust, Daniel C. [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 9:14 AM
Subject: RE: Index Size


Have you tried looking at the contents of this small index with Luke, to see
what actually got put into it?  Maybe one of your stored fields is being fed
something you didn't expect.

Dan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Index Size

2004-08-19 Thread Rob Jose
Grant
Thanks for your response.  I have fixed this issue.  I have indexed 5 MB
worth of text files and I now only use 224 KB.  I was getting 80 MB.  The
only change I made was to change the way I merge my temp index into my prod
index.  My code changed from:
prodWriter.setUseCompoundFile(true);

prodWriter.addIndexes(new IndexReader[] { tempReader });

To:

int iNumDocs = tempReader.numDocs();

for (int y = 0; y  iNumDocs; y++) {

Document tempDoc = tempReader.document(y);

prodWriter.addDocument(tempDoc);

}



I don't know if this is a bug in the IndexWriter.addIndexes(IndexReader)
method or something else I am doing that caused this, but I am getting much
better results now.



Thanks to everyone who helped, I really appreciate it.



Rob

- Original Message - 
From: Grant Ingersoll [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 10:51 AM
Subject: Re: Index Size


How many fields do you have and what analyzer are you using?

 [EMAIL PROTECTED] 8/19/2004 11:54:25 AM 
Otis
I upgraded to 1.4.1.  I deleted all of my old indexes and started from
scratch.  I indexed 2 MB worth of text files and my index size is 8
MB.
Would it be better if I stopped using the
IndexWriter.addIndexes(IndexReader) method and instead traverse the
IndexReader on the temp index and use
IndexWriter.addDocument(Document)
method?

Thanks again for your input, I appreciate it.

Rob
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, August 19, 2004 8:00 AM
Subject: Re: Index Size


Just go for 1.4.1 and look at the CHANGES.txt file to see if there
were
any index format changes.  If there were, you'll need to re-index.

Otis

--- Rob Jose [EMAIL PROTECTED] wrote:

 Otis
 I am using Lucene 1.3 final.  Would it help if I move to Lucene 1.4
 final?

 Rob
 - Original Message - 
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Thursday, August 19, 2004 7:13 AM
 Subject: Re: Index Size


 I thought this was the case.  I believe there was a bug in one of
the
 recent Lucene releases that caused old CFS files not to be removed
 when
 they should be removed.  This resulted in your index directory
 containing a bunch of old CFS files consuming your disk space.

 Try getting a recent nightly build and see if using that takes car
 eof
 your problem.

 Otis

 --- Rob Jose [EMAIL PROTECTED] wrote:

  Hey George
  Thanks for responding.  I am using windows and I don't see any
 hidden
  files.
  I have a ton of CFS files (1366/1405).  I have 22 F# (F1, F2,
etc.)
  files.
  I have two FDT files and two FDX files. And three FNM files.  Add
  these
  files to the deletable and segments file and that is all of the
 files
  that I
  have.   The CFS files are appoximately 11 MB each.  The totals I
 gave
  you
  before were for all of my indexes together.  This particular index
  has a
  size of 21.6 GB.  The files that it indexed have a size of 89 MB.
 
  OK - I just removed all of the CFS files from the directory and I
 can
  still
  read my indexes.  So know I have to ask what are these CFS files?
  Why are
  they created?  And how can I get rid of them if I don't need them.
 I
  will
  also take a look at the Lucene website to see if I can find any
  information.
 
  Thanks
  Rob
 
  - Original Message - 
  From: Honey George [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Thursday, August 19, 2004 12:29 AM
  Subject: Re: Index Size
 
 
  Hi,
   Please check for hidden files in the index folder. If
  you are using linx, do something like
 
  ls -al index folder
 
  I am also facing a similar problem where the index
  size is greater than the data size. In my case there
  were some hidden temproary files which the lucene
  creates.
  That was taking half of the total size.
 
  My problem is that after deleting the temporary files,
  the index size is same as that of the data size. That
  again seems to be a problem. I am yet to find out the
  reason..
 
  Thanks,
 george
 
 
   --- Rob Jose [EMAIL PROTECTED] wrote:
   Hello
   I have indexed several thousand (52 to be exact)
   text files and I keep running out of disk space to
   store the indexes.  The size of the documents I have
   indexed is around 2.5 GB.  The size of the Lucene
   indexes is around 287 GB.  Does this seem correct?
   I am not storing the contents of the file, just
   indexing and tokenizing.  I am using Lucene 1.3
   final.  Can you guys let me know what you are
   experiencing?  I don't want to go into production
   with something that I should be configuring better.
  
  
   I am not sure if this helps, but I have a temp index
   and a real index.  I index the file into the temp
   index, and then merge the temp index into the real
   index using the addIndexes method on the
   IndexWriter.  I have also set the production writer
   setUseCompoundFile to true.  I did not set this on
   the temp

Index Size

2004-08-18 Thread Rob Jose
Hello
I have indexed several thousand (52 to be exact) text files and I keep running out of 
disk space to store the indexes.  The size of the documents I have indexed is around 
2.5 GB.  The size of the Lucene indexes is around 287 GB.  Does this seem correct?  I 
am not storing the contents of the file, just indexing and tokenizing.  I am using 
Lucene 1.3 final.  Can you guys let me know what you are experiencing?  I don't want 
to go into production with something that I should be configuring better.  

I am not sure if this helps, but I have a temp index and a real index.  I index the 
file into the temp index, and then merge the temp index into the real index using the 
addIndexes method on the IndexWriter.  I have also set the production writer 
setUseCompoundFile to true.  I did not set this on the temp index.  The last thing 
that I do before closing the production writer is to call the optimize method.  

I would really appreciate any ideas to get the index size smaller if it is at all 
possible.

Thanks
Rob

Re: Slightly off topic, I need to have luke use my Analyzer

2004-07-22 Thread Rob Jose
Thanks Kannan

Rob
- Original Message - 
From: Chellappa, Kannan [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, July 21, 2004 12:19 PM
Subject: RE: Slightly off topic, I need to have luke use my Analyzer


Sorry typo in the version date in my previous mail -- I meant Luke v 0.5
(2004-06-25)

-Original Message-
From: Chellappa, Kannan
Sent: Wednesday, July 21, 2004 12:16 PM
To: Lucene Users List
Subject: RE: Slightly off topic, I need to have luke use my Analyzer


Worked for me.
I added my jar to the classpath and my analyzer appeared in the analyzers
list in the search tab as well as in the analyzers list in the plugins tab.

I am using Luke v 0.5 (2004-05-25)

Kannan


-Original Message-
From: Rob Jose [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 21, 2004 11:37 AM
To: Lucene Users List
Subject: Slightly off topic, I need to have luke use my Analyzer


Sorry for the slightly off topic post, but I have a need to use luke with my
Analyzer.  Has anyone done this?  I have added a jar file to my classpath,
but that didn't help.

Thanks in advance
Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Slightly off topic, I need to have luke use my Analyzer

2004-07-21 Thread Rob Jose
Sorry for the slightly off topic post, but I have a need to use luke with my
Analyzer.  Has anyone done this?  I have added a jar file to my classpath,
but that didn't help.

Thanks in advance
Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Doing a join?

2004-04-22 Thread Rob Jose
Is it possible to do a join on two fields when searching a Lucene Index.
For example, I have an index of documents that have a StudentName and a
StudentId field and another document that has ClassId, ClassName and
StudentId.  I want to do a search on ClassId or ClassName and get a
list of StudentName.  Both of these documents are in one index, but are
loaded from seperate files, so I can't join at creation time.  Any help is
greatly appreciated.

Rob


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]