Re: Speed up indexing?

2005-07-21 Thread Andrzej Bialecki

Matthias Jaekle wrote:

050721 071234 * Optimizing index...
... this takes a long time ...



Hello,

optimizing the index takes extremly long.
I have the feeling in earlier versions, this was much faster.

I just try to index a 7.000.000 Pages Segment.
This is running till 10 days now.
Processing was starting with around 280 pages / second then falls down 
over the next 3 days to 4 pages / second.


Now it is optimizing or maybe writing the index since 6 days and I have 
no idea how long it will take.


In the index dir I have now around 1.000.000 files, together around 15 GB.

Any ideas how to figure out, how long this process will run?

How could I speed up this process?

Is the system normaly that slow or do I have something missconfigured?


This is related to how Lucene manages its indexes during incremental 
updates. You can change the default parameters in nutch-default.xml (or 
actually, nutch-site.xml) - look for indexer.mergeFactor, minMergeDocs 
and maxMergeDocs. If you know Lucene well, you could hack a bit and try 
to test a scenario where updates are batched using RAMDirectory, and 
only above certain threshold they are merged with on-disk indexes.


A bit of background: the exact performance depends on a lot of factors, 
but the key ones are related to how well your disk subsystem copes with 
managing many small files. This depends on the filesystem and the raw 
disk I/O. Some filesystems don't like directories with millions of 
files, and incur significant performance penalty. Some disk subsystems 
are good with burstable traffic (because of large cache) but quite bad 
with sustained traffic.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle

Hi Andrzej,

thanks for your response. I am not really familar with the lucene internals.

I am just running nutch with the default parameters on a debian sarge 
system with ext3 file system, maximum 1024 files opened, and 1 GB RAM.


So is ext3 a bad file system for millions of files?

I could not change the file system in the moment. So I think I should 
change the parameters.


Which values would you suggest for
* indexer.mergeFactor?
* indexer.minMergeDocs?
* indexer.maxMergeDocs?
* indexer.termIndexInterval?

Many thanks for your support

Matthias



This is related to how Lucene manages its indexes during incremental 
updates. You can change the default parameters in nutch-default.xml (or 
actually, nutch-site.xml) - look for indexer.mergeFactor, minMergeDocs 
and maxMergeDocs. If you know Lucene well, you could hack a bit and try 
to test a scenario where updates are batched using RAMDirectory, and 
only above certain threshold they are merged with on-disk indexes.


A bit of background: the exact performance depends on a lot of factors, 
but the key ones are related to how well your disk subsystem copes with 
managing many small files. This depends on the filesystem and the raw 
disk I/O. Some filesystems don't like directories with millions of 
files, and incur significant performance penalty. Some disk subsystems 
are good with burstable traffic (because of large cache) but quite bad 
with sustained traffic.




Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread ogjunk-nutch
Hi,

--- Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Matthias Jaekle wrote:
  Hi Andrzej,
  
  thanks for your response. I am not really familar with the lucene 
  internals.
  
  I am just running nutch with the default parameters on a debian
 sarge 
  system with ext3 file system, maximum 1024 files opened, and 1 GB
 RAM.
  
  So is ext3 a bad file system for millions of files?
 
 AFAIK reiserfs comes out a much better in benchmarks than
 ext3.noatime, especially for small files.

Never used reiserfs, but I heard the same.

  I could not change the file system in the moment. So I think I
 should 
  change the parameters.
  
  Which values would you suggest for
  * indexer.mergeFactor?
  * indexer.minMergeDocs?
  * indexer.maxMergeDocs?
  * indexer.termIndexInterval?

You probably don't want to touch indexer.termIndexInterval and
indexer.maxMergeDocs (determines the max size of an individual
segment).
How high you can go with minMergeDocs will be determined by your
RAM/Heap, and your maximum open file descriptor limit will determine
how high you can go with your mergeFactor.

Otis


Simpy -- simpy.com -- tags, social bookmarks, personal search engine


Re: [Nutch-general] Re: Speed up indexing?

2005-07-21 Thread Matthias Jaekle

You probably don't want to touch indexer.termIndexInterval and
indexer.maxMergeDocs (determines the max size of an individual
segment).

Why is maxMergeDocs 50 by default? Should not this value be much higher?

I found how to calculate the number of opened files
But how could I calculate the memory which would be used?

And is there any possibility to calculate how much files nutch will 
create while indexing in the peak?


Many thanks

Matthias