Hi, I'd have to poke around the machine(s) to give you better guidance, but here is some initial feedback:
- mergeFactor of 1000 seems crazy. mergeFactor is probably not your problem. I'd go back to default of 10. - 256 MB for ramBufferSizeMB sounds OK. - pinging the DB won't tell you much about the DB server's performance - ssh to the machine and check its CPU load, memory usage, disk IO Other things to look into: - Network as the bottleneck? - Field analysis as the bottleneck? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: Chantal Ackermann <chantal.ackerm...@btelligent.de> > To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> > Sent: Monday, August 3, 2009 12:32:12 PM > Subject: Re: mergeFactor / indexing speed > > Hi all, > > I'm still struggling with the index performance. I've moved the indexer > to a different machine, now, which is faster and less occupied. > > The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18, > running with those settings (and others): > -server -Xms1G -Xmx7G > > Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB. > It has been processing roughly 70k documents in half an hour, so far. > Which means 1,5 hours at least for 200k - which is as fast/slow as > before (on the less performant machine). > > The machine is not swapping. It is only using 13% of the memory. > iostat gives me: > iostat > Linux 2.6.9-67.ELsmp 08/03/2009 > > avg-cpu: %user %nice %sys %iowait %idle > 1.23 0.00 0.03 0.03 98.71 > > Basically, it is doing very little? *scratch* > > The sourcing database is responding as fast as ever. (I checked that > from my own machine, and did only a ping from the linux box to the db > server.) > > Any help, any hint on where to look would be greatly appreciated. > > > Thanks! > Chantal > > > Chantal Ackermann schrieb: > > Hi again! > > > > Thanks for the answer, Grant. > > > > > It could very well be the case that you aren't seeing any merges with > > > only 20K docs. Ultimately, if you really want to, you can look in > > > your data.dir and count the files. If you have indexed a lot and have > > > an MF of 100 and haven't done an optimize, you will see a lot more > > > index files. > > > > Do you mean that 20k is not representative enough to test those settings? > > I've chosen the smaller data set so that the index can run completely > > but doesn't take too long at the same time. > > If it would be faster to begin with, I could use a larger data set, of > > course. I still can't believe that 11 minutes is normal (I haven't > > managed to make it run faster or slower than that, that duration is very > > stable). > > > > It "feels kinda" slow to me... > > Out of your experience - what would you expect as duration for an index > > with: > > - 21 fields, some using a text type with 6 filters > > - database access using DataImportHandler with a query of (far) less > > than 20ms > > - 2 transformers > > > > If I knew that indexing time should be shorter than that, at least, I > > would know that something is definitely wrong with what I am doing or > > with the environment I am using. > > > > > Likely, but not guaranteed. Typically, larger merge factors are good > > > for batch indexing, but a lot of that has changed with Lucene's new > > > background merger, such that I don't know if it matters as much anymore. > > > > Ok. I also read some posting where it basically said that the default > > parameters are ok. And one shouldn't mess around with them. > > > > The thing is that our current search setup uses Lucene directly, and the > > indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The > > fields are different, the complete setup is different. But it will be > > hard to advertise a new implementation/setup where indexing is three > > times slower - unless I can give some reasons why that is. > > > > The full index should be fairly fast because the backing data is update > > every few hours. I want to put in place an incremental/partial update as > > main process, but full indexing might have to be done at certain times > > if data has changed completely, or the schema has to be changed/extended. > > > > > No, those are separate things. The ramBufferSizeMB (although, I like > > > the thought of a "rum"BufferSizeMB too! ;-) ) controls how many docs > > > Lucene holds in memory before it has to flush. MF controls how many > > > segments are on disk > > > > alas! the rum. I had that typo on the commandline before. that's my > > subconscious telling me what I should do when I get home, tonight... > > > > So, increasing ramBufferSize should lead to higher memory usage, > > shouldn't it? I'm not seeing that. :-( > > > > I'll try once more with MF 10 and a higher rum... well, you know... ;-) > > > > Cheers, > > Chantal > > > > Grant Ingersoll schrieb: > >> On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote: > >> > >>> Dear all, > >>> > >>> I want to find out which settings give the best full index > >>> performance for my setup. > >>> Therefore, I have been running a small index (less than 20k > >>> documents) with a mergeFactor of 10 and 100. > >>> In both cases, indexing took about 11.5 min: > >>> > >>> mergeFactor: 10 > >>> 0:11:46.792 > >>> mergeFactor: 100 > >>> /admin/cores?action=RELOAD > >>> 0:11:44.441 > >>> Tomcat restart > >>> 0:11:34.143 > >>> > >>> This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it > >>> always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old > >>> ATA disk). > >>> > >>> > >>> Now, I have three questions: > >>> > >>> 1. How can I check which mergeFactor is really being used? The > >>> solrconfig.xml that is displayed in the admin application is the up- > >>> to-date view on the file system. I tested that. But it's not > >>> necessarily what the current SOLR core is using, isn't it? > >>> Is there a way to check on the actually used mergeFactor (while the > >>> index is running)? > >> It could very well be the case that you aren't seeing any merges with > >> only 20K docs. Ultimately, if you really want to, you can look in > >> your data.dir and count the files. If you have indexed a lot and have > >> an MF of 100 and haven't done an optimize, you will see a lot more > >> index files. > >> > >> > >>> 2. I changed the mergeFactor in both available settings (default and > >>> main index) in the solrconfig.xml file of the core I am reindexing. > >>> That is the correct place? Should a change in performance be > >>> noticeable when increasing from 10 to 100? Or is the change not > >>> perceivable if the requests for data are taking far longer than all > >>> the indexing itself? > >> Likely, but not guaranteed. Typically, larger merge factors are good > >> for batch indexing, but a lot of that has changed with Lucene's new > >> background merger, such that I don't know if it matters as much anymore. > >> > >> > >>> 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor? > >>> (Or some other setting?) > >> No, those are separate things. The ramBufferSizeMB (although, I like > >> the thought of a "rum"BufferSizeMB too! ;-) ) controls how many docs > >> Lucene holds in memory before it has to flush. MF controls how many > >> segments are on disk > >> > >>> (I am still trying to get profiling information on how much > >>> application time is eaten up by db connection/requests/processing. > >>> The root entity query is about (average) 20ms. The child entity > >>> query is less than 10ms. > >>> I have my custom entity processor running on the child entity that > >>> populates the map using a multi-row result set. I have also attached > >>> one regex and one script transformer.) > >>> > >>> Thank you for any tips! > >>> Chantal > >>> > >>> > >>> > >>> -- > >>> Chantal Ackermann > >> -------------------------- > >> Grant Ingersoll > >> http://www.lucidimagination.com/ > >> > >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) > >> using Solr/Lucene: > >> http://www.lucidimagination.com/search > >>