Re: mergeFactor / indexing speed
And - indexing 160k documents now takes 5min instead of 1.5h! Awesome! It works for all! (Now I can go relaxed on vacation. :-D ) Take me along! Cheers Avlesh On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Juhu, great news, guys. I merged my child entity into the root entity, and changed the custom entityprocessor to handle the additional columns correctly. And - indexing 160k documents now takes 5min instead of 1.5h! (Now I can go relaxed on vacation. :-D ) Conclusion: In my case performance was so bad because of constantly querying a database on a different machine (network traffic + db query per document). Thanks for all your help! Chantal Avlesh Singh schrieb: does DIH call commit periodically, or are things done in one big batch? AFAIK, one big batch. yes. There is no index available once the full-import started (and the searcher has no cache, other wise it still reads from that). There is no data (i.e. in the Admin/Luke frontend) visible until the import is finished correctly.
Re: mergeFactor / indexing speed
Juhu, great news, guys. I merged my child entity into the root entity, and changed the custom entityprocessor to handle the additional columns correctly. And - indexing 160k documents now takes 5min instead of 1.5h! (Now I can go relaxed on vacation. :-D ) Conclusion: In my case performance was so bad because of constantly querying a database on a different machine (network traffic + db query per document). Thanks for all your help! Chantal Avlesh Singh schrieb: does DIH call commit periodically, or are things done in one big batch? AFAIK, one big batch. yes. There is no index available once the full-import started (and the searcher has no cache, other wise it still reads from that). There is no data (i.e. in the Admin/Luke frontend) visible until the import is finished correctly.
Re: mergeFactor / indexing speed
On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Juhu, great news, guys. I merged my child entity into the root entity, and changed the custom entityprocessor to handle the additional columns correctly. And - indexing 160k documents now takes 5min instead of 1.5h! I'm a little late to the party but you may also want to look at CachedSqlEntityProcessor. -- Regards, Shalin Shekhar Mangar.
Re: mergeFactor / indexing speed
Thanks for the tip, Shalin. I'm happy with 6 indexes running in parallel and completing in less than 10min, right now, but I'll have look anyway. Shalin Shekhar Mangar schrieb: On Fri, Aug 7, 2009 at 3:58 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Juhu, great news, guys. I merged my child entity into the root entity, and changed the custom entityprocessor to handle the additional columns correctly. And - indexing 160k documents now takes 5min instead of 1.5h! I'm a little late to the party but you may also want to look at CachedSqlEntityProcessor. -- Regards, Shalin Shekhar Mangar.
Re: mergeFactor / indexing speed
Hi all, to keep this thread up to date... ;-) d) jdbc batch size changed to 10. (Was default: 500, then 1000) The problem with my dih setup is that the root entity query returns a huge set (all ids that shall be indexed). A larger fetchsize would be good for that query. The nested entity, however, returns only up 9 rows, ever. The constraints are so strict (by id) that there is no way that any additional data could be pre-fetched. (Actually, anynone using DIH with nested entities should run into that problem?) After changing to 10, I cannot see that this low batch size slowed the indexer down (significantly). As I would like to stick with DIH (instead of dumping the data into CSV and import it then) here is my question: Do you think it's possible to return (in the nested entity) rows independent of the unique id, and let the processor decide when a document is complete? The examples in the wiki always use an ID to get the data for the nested entity, so I'm not sure it was planned with that in mind. But as I'm already handling multiple db rows for one document, it might not be too difficult to change to handling the unique id correctly, as well? Of course, I would need something like a look ahead to know whether the next row is already part of the next document. Cheers, Chantal Concerning the other settings (just fyi): a) mergeFactor 10 (and also tried 100) I don't think that changed anything to the worse, rather to the better. So, I'll stick with 10 from now on. b) ramBufferSizeMB tried 512, 1024. RAM usage went up when I increased from 256 to 512. Not sure about 1024. I'll stick to 512.
Re: mergeFactor / indexing speed
On Mon, Aug 3, 2009 at 12:32 PM, Chantal Ackermannchantal.ackerm...@btelligent.de wrote: avg-cpu: %user %nice %sys %iowait %idle 1.23 0.00 0.03 0.03 98.71 Basically, it is doing very little? *scratch* How often is commit being called? (a Lucene commit sync's all of the index files so a crash won't result in a corrupted index... this can be costly). Guys - does DIH call commit periodically, or are things done in one big batch? Chantal - is autocommit configured in solrconfig.xml? -Yonik http://www.lucidimagination.com
Re: mergeFactor / indexing speed
does DIH call commit periodically, or are things done in one big batch? AFAIK, one big batch. Cheers Avlesh On Thu, Aug 6, 2009 at 11:23 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Aug 3, 2009 at 12:32 PM, Chantal Ackermannchantal.ackerm...@btelligent.de wrote: avg-cpu: %user %nice%sys %iowait %idle 1.230.000.030.03 98.71 Basically, it is doing very little? *scratch* How often is commit being called? (a Lucene commit sync's all of the index files so a crash won't result in a corrupted index... this can be costly). Guys - does DIH call commit periodically, or are things done in one big batch? Chantal - is autocommit configured in solrconfig.xml? -Yonik http://www.lucidimagination.com
Re: mergeFactor / indexing speed
Hi Avlesh, hi Otis, hi Grant, hi all, (enumerating to keep track of all the input) a) mergeFactor 1000 too high I'll change that back to 10. I thought it would make Lucene use more RAM before starting IO. b) ramBufferSize: OK, or maybe more. I'll keep that in mind. c) solrconfig.xml - default and main index: I've always changed both sections, the default and the main index one. d) JDBC batch size: I haven't set it. I'll do that. e) DB server performance: I agree, ping is definitely not much information. I also did queries from my own computer towards it (while the indexer ran) which came back as fast as usual. Currently, I don't have any login to ssh to that machine, but I'm going to try get one. f) Network: I'll definitely need to have a look at that once I have access to the db machine. g) the data g.1) nested entity in DIH conf there is only the root and one nested entity. However, that nested entity returns multiple rows (about 10) for one query. (Fetched rows is about 10 times the number of processed documents.) g.2) my custom EntityProcessor ( The code is pasted at the very end of this e-mail. ) - iterates over those multiple rows, - uses one column to create a key in a map, - uses two other columns to create the corresponding value (String concatenation), - if a key already exists, it gets the value, if that value is a list, it adds the new value to that list, if it's not a list, it creates one and adds the old and the new value to it. I refrained from adding any business logic to that processor. It treats all rows alike, no matter whether they hold values that can appear multiple or values that must appear only once. g.3) the two transformers - to split one value into two (regex) field column=person / field column=participant sourceColName=person regex=([^\|]+)\|.*/ field column=role sourceColName=person regex=[^\|]+\|\d+,\d+,\d+,(.*)/ - to create extract a number from an existing number (bit calculation using the script transformer). As that one works on a field that is potentially multiValued, it needs to take care of creating and populating a list, as well. field column=cat name=cat / script![CDATA[ function getMainCategory(row) { var cat = row.get('cat'); var mainCat; if (cat != null) { // check whether cat is an array if (cat instanceof java.util.List) { var arr = java.util.ArrayList(); for (var i=0; icat.size(); i++) { mainCat = new java.lang.Integer(cat.get(i)8); if (!arr.contains(mainCat)) { arr.add(mainCat); } } row.put('maincat', arr); } else { // it is a single value var mainCat = new java.lang.Integer(cat8); row.put('maincat', mainCat); } } return row; } ]]/script (The EpgValueEntityProcessor decides on creating lists on a case by case basis: only if a value is specified multiple times for a certain data set does it create a list. This is because I didn't want to put any complex configuration or business logic into it.) g.4) fields the DIH extracts 5 fields from the root entity, 11 fields from the nested entity, and the transformers might create additional 3 (multiValued). schema.xml defines 21 fields (two additional fields: the timestamp field (default=NOW) and a field collecting three other text fields for default search (using copy field)): - 2 long - 3 integer - 3 sint - 3 date - 6 text_cs (class=solr.TextField positionIncrementGap=100): analyzer tokenizer class=solr.WhitespaceTokenizerFactory / filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 generateWordParts=0 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 / /analyzer - 4 text_de (one is the field populated by copying from the 3 others): analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LengthFilterFactory min=2 max=5000 / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=German / filter class=solr.RemoveDuplicatesTokenFilterFactory / /analyzer Thank you for taking your time! Cheers, Chantal ** EpgValueEntityProcessor.java *** import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.logging.Logger; import org.apache.solr.handler.dataimport.Context; import org.apache.solr.handler.dataimport.SqlEntityProcessor; public class
Re: mergeFactor / indexing speed
Hi all, I'm still struggling with the index performance. I've moved the indexer to a different machine, now, which is faster and less occupied. The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18, running with those settings (and others): -server -Xms1G -Xmx7G Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB. It has been processing roughly 70k documents in half an hour, so far. Which means 1,5 hours at least for 200k - which is as fast/slow as before (on the less performant machine). The machine is not swapping. It is only using 13% of the memory. iostat gives me: iostat Linux 2.6.9-67.ELsmp 08/03/2009 avg-cpu: %user %nice%sys %iowait %idle 1.230.000.030.03 98.71 Basically, it is doing very little? *scratch* The sourcing database is responding as fast as ever. (I checked that from my own machine, and did only a ping from the linux box to the db server.) Any help, any hint on where to look would be greatly appreciated. Thanks! Chantal Chantal Ackermann schrieb: Hi again! Thanks for the answer, Grant. It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the files. If you have indexed a lot and have an MF of 100 and haven't done an optimize, you will see a lot more index files. Do you mean that 20k is not representative enough to test those settings? I've chosen the smaller data set so that the index can run completely but doesn't take too long at the same time. If it would be faster to begin with, I could use a larger data set, of course. I still can't believe that 11 minutes is normal (I haven't managed to make it run faster or slower than that, that duration is very stable). It feels kinda slow to me... Out of your experience - what would you expect as duration for an index with: - 21 fields, some using a text type with 6 filters - database access using DataImportHandler with a query of (far) less than 20ms - 2 transformers If I knew that indexing time should be shorter than that, at least, I would know that something is definitely wrong with what I am doing or with the environment I am using. Likely, but not guaranteed. Typically, larger merge factors are good for batch indexing, but a lot of that has changed with Lucene's new background merger, such that I don't know if it matters as much anymore. Ok. I also read some posting where it basically said that the default parameters are ok. And one shouldn't mess around with them. The thing is that our current search setup uses Lucene directly, and the indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The fields are different, the complete setup is different. But it will be hard to advertise a new implementation/setup where indexing is three times slower - unless I can give some reasons why that is. The full index should be fairly fast because the backing data is update every few hours. I want to put in place an incremental/partial update as main process, but full indexing might have to be done at certain times if data has changed completely, or the schema has to be changed/extended. No, those are separate things. The ramBufferSizeMB (although, I like the thought of a rumBufferSizeMB too! ;-) ) controls how many docs Lucene holds in memory before it has to flush. MF controls how many segments are on disk alas! the rum. I had that typo on the commandline before. that's my subconscious telling me what I should do when I get home, tonight... So, increasing ramBufferSize should lead to higher memory usage, shouldn't it? I'm not seeing that. :-( I'll try once more with MF 10 and a higher rum... well, you know... ;-) Cheers, Chantal Grant Ingersoll schrieb: On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote: Dear all, I want to find out which settings give the best full index performance for my setup. Therefore, I have been running a small index (less than 20k documents) with a mergeFactor of 10 and 100. In both cases, indexing took about 11.5 min: mergeFactor: 10 str name=Time taken 0:11:46.792/str mergeFactor: 100 /admin/cores?action=RELOAD str name=Time taken 0:11:44.441/str Tomcat restart str name=Time taken 0:11:34.143/str This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old ATA disk). Now, I have three questions: 1. How can I check which mergeFactor is really being used? The solrconfig.xml that is displayed in the admin application is the up- to-date view on the file system. I tested that. But it's not necessarily what the current SOLR core is using, isn't it? Is there a way to check on the actually used mergeFactor (while the index is running)? It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the
Re: mergeFactor / indexing speed
avg-cpu: %user %nice%sys %iowait %idle 1.230.000.030.03 98.71 I agree, real bad statistics, actually. Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB. To me the former appears to be too high and latter too low (for your machine configuration). You can safely increase the ramBufferSize (or maxBufferedDocs) to a higher value. Couple of things - 1. The stock solrconfig.xml comes with two sections indexDefaults and mainIndex. Options in the latter override the former. Just make sure that you have right values at the right place. 2. Do you have too many nested entities inside the DIH's data-config? If yes, a database level optimization (creating views, in memory tables ...) might hold the answer. 3. Tried playing around with jdbc paramters in the data source? Setting batchSize property to a considerable value might help. Cheers Avlesh On Mon, Aug 3, 2009 at 10:02 PM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi all, I'm still struggling with the index performance. I've moved the indexer to a different machine, now, which is faster and less occupied. The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18, running with those settings (and others): -server -Xms1G -Xmx7G Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB. It has been processing roughly 70k documents in half an hour, so far. Which means 1,5 hours at least for 200k - which is as fast/slow as before (on the less performant machine). The machine is not swapping. It is only using 13% of the memory. iostat gives me: iostat Linux 2.6.9-67.ELsmp 08/03/2009 avg-cpu: %user %nice%sys %iowait %idle 1.230.000.030.03 98.71 Basically, it is doing very little? *scratch* The sourcing database is responding as fast as ever. (I checked that from my own machine, and did only a ping from the linux box to the db server.) Any help, any hint on where to look would be greatly appreciated. Thanks! Chantal Chantal Ackermann schrieb: Hi again! Thanks for the answer, Grant. It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the files. If you have indexed a lot and have an MF of 100 and haven't done an optimize, you will see a lot more index files. Do you mean that 20k is not representative enough to test those settings? I've chosen the smaller data set so that the index can run completely but doesn't take too long at the same time. If it would be faster to begin with, I could use a larger data set, of course. I still can't believe that 11 minutes is normal (I haven't managed to make it run faster or slower than that, that duration is very stable). It feels kinda slow to me... Out of your experience - what would you expect as duration for an index with: - 21 fields, some using a text type with 6 filters - database access using DataImportHandler with a query of (far) less than 20ms - 2 transformers If I knew that indexing time should be shorter than that, at least, I would know that something is definitely wrong with what I am doing or with the environment I am using. Likely, but not guaranteed. Typically, larger merge factors are good for batch indexing, but a lot of that has changed with Lucene's new background merger, such that I don't know if it matters as much anymore. Ok. I also read some posting where it basically said that the default parameters are ok. And one shouldn't mess around with them. The thing is that our current search setup uses Lucene directly, and the indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The fields are different, the complete setup is different. But it will be hard to advertise a new implementation/setup where indexing is three times slower - unless I can give some reasons why that is. The full index should be fairly fast because the backing data is update every few hours. I want to put in place an incremental/partial update as main process, but full indexing might have to be done at certain times if data has changed completely, or the schema has to be changed/extended. No, those are separate things. The ramBufferSizeMB (although, I like the thought of a rumBufferSizeMB too! ;-) ) controls how many docs Lucene holds in memory before it has to flush. MF controls how many segments are on disk alas! the rum. I had that typo on the commandline before. that's my subconscious telling me what I should do when I get home, tonight... So, increasing ramBufferSize should lead to higher memory usage, shouldn't it? I'm not seeing that. :-( I'll try once more with MF 10 and a higher rum... well, you know... ;-) Cheers, Chantal Grant Ingersoll schrieb: On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote: Dear all, I want to find
Re: mergeFactor / indexing speed
Hi, I'd have to poke around the machine(s) to give you better guidance, but here is some initial feedback: - mergeFactor of 1000 seems crazy. mergeFactor is probably not your problem. I'd go back to default of 10. - 256 MB for ramBufferSizeMB sounds OK. - pinging the DB won't tell you much about the DB server's performance - ssh to the machine and check its CPU load, memory usage, disk IO Other things to look into: - Network as the bottleneck? - Field analysis as the bottleneck? Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message From: Chantal Ackermann chantal.ackerm...@btelligent.de To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Monday, August 3, 2009 12:32:12 PM Subject: Re: mergeFactor / indexing speed Hi all, I'm still struggling with the index performance. I've moved the indexer to a different machine, now, which is faster and less occupied. The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18, running with those settings (and others): -server -Xms1G -Xmx7G Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB. It has been processing roughly 70k documents in half an hour, so far. Which means 1,5 hours at least for 200k - which is as fast/slow as before (on the less performant machine). The machine is not swapping. It is only using 13% of the memory. iostat gives me: iostat Linux 2.6.9-67.ELsmp 08/03/2009 avg-cpu: %user %nice%sys %iowait %idle 1.230.000.030.03 98.71 Basically, it is doing very little? *scratch* The sourcing database is responding as fast as ever. (I checked that from my own machine, and did only a ping from the linux box to the db server.) Any help, any hint on where to look would be greatly appreciated. Thanks! Chantal Chantal Ackermann schrieb: Hi again! Thanks for the answer, Grant. It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the files. If you have indexed a lot and have an MF of 100 and haven't done an optimize, you will see a lot more index files. Do you mean that 20k is not representative enough to test those settings? I've chosen the smaller data set so that the index can run completely but doesn't take too long at the same time. If it would be faster to begin with, I could use a larger data set, of course. I still can't believe that 11 minutes is normal (I haven't managed to make it run faster or slower than that, that duration is very stable). It feels kinda slow to me... Out of your experience - what would you expect as duration for an index with: - 21 fields, some using a text type with 6 filters - database access using DataImportHandler with a query of (far) less than 20ms - 2 transformers If I knew that indexing time should be shorter than that, at least, I would know that something is definitely wrong with what I am doing or with the environment I am using. Likely, but not guaranteed. Typically, larger merge factors are good for batch indexing, but a lot of that has changed with Lucene's new background merger, such that I don't know if it matters as much anymore. Ok. I also read some posting where it basically said that the default parameters are ok. And one shouldn't mess around with them. The thing is that our current search setup uses Lucene directly, and the indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The fields are different, the complete setup is different. But it will be hard to advertise a new implementation/setup where indexing is three times slower - unless I can give some reasons why that is. The full index should be fairly fast because the backing data is update every few hours. I want to put in place an incremental/partial update as main process, but full indexing might have to be done at certain times if data has changed completely, or the schema has to be changed/extended. No, those are separate things. The ramBufferSizeMB (although, I like the thought of a rumBufferSizeMB too! ;-) ) controls how many docs Lucene holds in memory before it has to flush. MF controls how many segments are on disk alas! the rum. I had that typo on the commandline before. that's my subconscious telling me what I should do when I get home, tonight... So, increasing ramBufferSize should lead to higher memory usage, shouldn't it? I'm not seeing that. :-( I'll try once more with MF 10 and a higher rum... well, you know... ;-) Cheers, Chantal Grant Ingersoll schrieb: On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote: Dear all, I want to find out which settings give the best full index performance for my setup
Re: mergeFactor / indexing speed
How big are your documents? I haven't benchmarked DIH, so I am not sure what to expect, but it does seem like something isn't right. Can you fully describe how you are indexing? Have you done any profiling? On Aug 3, 2009, at 12:32 PM, Chantal Ackermann wrote: Hi all, I'm still struggling with the index performance. I've moved the indexer to a different machine, now, which is faster and less occupied. The new machine is a 64bit 8Gig-RAM RedHat. JDK1.6, Tomcat 6.0.18, running with those settings (and others): -server -Xms1G -Xmx7G Currently, I've set mergeFactor to 1000 and ramBufferSize to 256MB. It has been processing roughly 70k documents in half an hour, so far. Which means 1,5 hours at least for 200k - which is as fast/slow as before (on the less performant machine). The machine is not swapping. It is only using 13% of the memory. iostat gives me: iostat Linux 2.6.9-67.ELsmp 08/03/2009 avg-cpu: %user %nice%sys %iowait %idle 1.230.000.030.03 98.71 Basically, it is doing very little? *scratch* The sourcing database is responding as fast as ever. (I checked that from my own machine, and did only a ping from the linux box to the db server.) Any help, any hint on where to look would be greatly appreciated. Thanks! Chantal Chantal Ackermann schrieb: Hi again! Thanks for the answer, Grant. It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the files. If you have indexed a lot and have an MF of 100 and haven't done an optimize, you will see a lot more index files. Do you mean that 20k is not representative enough to test those settings? I've chosen the smaller data set so that the index can run completely but doesn't take too long at the same time. If it would be faster to begin with, I could use a larger data set, of course. I still can't believe that 11 minutes is normal (I haven't managed to make it run faster or slower than that, that duration is very stable). It feels kinda slow to me... Out of your experience - what would you expect as duration for an index with: - 21 fields, some using a text type with 6 filters - database access using DataImportHandler with a query of (far) less than 20ms - 2 transformers If I knew that indexing time should be shorter than that, at least, I would know that something is definitely wrong with what I am doing or with the environment I am using. Likely, but not guaranteed. Typically, larger merge factors are good for batch indexing, but a lot of that has changed with Lucene's new background merger, such that I don't know if it matters as much anymore. Ok. I also read some posting where it basically said that the default parameters are ok. And one shouldn't mess around with them. The thing is that our current search setup uses Lucene directly, and the indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The fields are different, the complete setup is different. But it will be hard to advertise a new implementation/setup where indexing is three times slower - unless I can give some reasons why that is. The full index should be fairly fast because the backing data is update every few hours. I want to put in place an incremental/partial update as main process, but full indexing might have to be done at certain times if data has changed completely, or the schema has to be changed/ extended. No, those are separate things. The ramBufferSizeMB (although, I like the thought of a rumBufferSizeMB too! ;-) ) controls how many docs Lucene holds in memory before it has to flush. MF controls how many segments are on disk alas! the rum. I had that typo on the commandline before. that's my subconscious telling me what I should do when I get home, tonight... So, increasing ramBufferSize should lead to higher memory usage, shouldn't it? I'm not seeing that. :-( I'll try once more with MF 10 and a higher rum... well, you know... ;-) Cheers, Chantal Grant Ingersoll schrieb: On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote: Dear all, I want to find out which settings give the best full index performance for my setup. Therefore, I have been running a small index (less than 20k documents) with a mergeFactor of 10 and 100. In both cases, indexing took about 11.5 min: mergeFactor: 10 str name=Time taken 0:11:46.792/str mergeFactor: 100 /admin/cores?action=RELOAD str name=Time taken 0:11:44.441/str Tomcat restart str name=Time taken 0:11:34.143/str This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old ATA disk). Now, I have three questions: 1. How can I check which mergeFactor is really being used? The solrconfig.xml that is displayed in the admin application is the up- to-date view on the file system. I tested that. But
mergeFactor / indexing speed
Dear all, I want to find out which settings give the best full index performance for my setup. Therefore, I have been running a small index (less than 20k documents) with a mergeFactor of 10 and 100. In both cases, indexing took about 11.5 min: mergeFactor: 10 str name=Time taken 0:11:46.792/str mergeFactor: 100 /admin/cores?action=RELOAD str name=Time taken 0:11:44.441/str Tomcat restart str name=Time taken 0:11:34.143/str This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old ATA disk). Now, I have three questions: 1. How can I check which mergeFactor is really being used? The solrconfig.xml that is displayed in the admin application is the up-to-date view on the file system. I tested that. But it's not necessarily what the current SOLR core is using, isn't it? Is there a way to check on the actually used mergeFactor (while the index is running)? 2. I changed the mergeFactor in both available settings (default and main index) in the solrconfig.xml file of the core I am reindexing. That is the correct place? Should a change in performance be noticeable when increasing from 10 to 100? Or is the change not perceivable if the requests for data are taking far longer than all the indexing itself? 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor? (Or some other setting?) (I am still trying to get profiling information on how much application time is eaten up by db connection/requests/processing. The root entity query is about (average) 20ms. The child entity query is less than 10ms. I have my custom entity processor running on the child entity that populates the map using a multi-row result set. I have also attached one regex and one script transformer.) Thank you for any tips! Chantal -- Chantal Ackermann
Re: mergeFactor / indexing speed
On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote: Dear all, I want to find out which settings give the best full index performance for my setup. Therefore, I have been running a small index (less than 20k documents) with a mergeFactor of 10 and 100. In both cases, indexing took about 11.5 min: mergeFactor: 10 str name=Time taken 0:11:46.792/str mergeFactor: 100 /admin/cores?action=RELOAD str name=Time taken 0:11:44.441/str Tomcat restart str name=Time taken 0:11:34.143/str This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old ATA disk). Now, I have three questions: 1. How can I check which mergeFactor is really being used? The solrconfig.xml that is displayed in the admin application is the up- to-date view on the file system. I tested that. But it's not necessarily what the current SOLR core is using, isn't it? Is there a way to check on the actually used mergeFactor (while the index is running)? It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the files. If you have indexed a lot and have an MF of 100 and haven't done an optimize, you will see a lot more index files. 2. I changed the mergeFactor in both available settings (default and main index) in the solrconfig.xml file of the core I am reindexing. That is the correct place? Should a change in performance be noticeable when increasing from 10 to 100? Or is the change not perceivable if the requests for data are taking far longer than all the indexing itself? Likely, but not guaranteed. Typically, larger merge factors are good for batch indexing, but a lot of that has changed with Lucene's new background merger, such that I don't know if it matters as much anymore. 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor? (Or some other setting?) No, those are separate things. The ramBufferSizeMB (although, I like the thought of a rumBufferSizeMB too! ;-) ) controls how many docs Lucene holds in memory before it has to flush. MF controls how many segments are on disk (I am still trying to get profiling information on how much application time is eaten up by db connection/requests/processing. The root entity query is about (average) 20ms. The child entity query is less than 10ms. I have my custom entity processor running on the child entity that populates the map using a multi-row result set. I have also attached one regex and one script transformer.) Thank you for any tips! Chantal -- Chantal Ackermann -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: mergeFactor / indexing speed
Hi again! Thanks for the answer, Grant. It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the files. If you have indexed a lot and have an MF of 100 and haven't done an optimize, you will see a lot more index files. Do you mean that 20k is not representative enough to test those settings? I've chosen the smaller data set so that the index can run completely but doesn't take too long at the same time. If it would be faster to begin with, I could use a larger data set, of course. I still can't believe that 11 minutes is normal (I haven't managed to make it run faster or slower than that, that duration is very stable). It feels kinda slow to me... Out of your experience - what would you expect as duration for an index with: - 21 fields, some using a text type with 6 filters - database access using DataImportHandler with a query of (far) less than 20ms - 2 transformers If I knew that indexing time should be shorter than that, at least, I would know that something is definitely wrong with what I am doing or with the environment I am using. Likely, but not guaranteed. Typically, larger merge factors are good for batch indexing, but a lot of that has changed with Lucene's new background merger, such that I don't know if it matters as much anymore. Ok. I also read some posting where it basically said that the default parameters are ok. And one shouldn't mess around with them. The thing is that our current search setup uses Lucene directly, and the indexer takes less than an hour (MF: 500, maxBufferedDocs: 7500). The fields are different, the complete setup is different. But it will be hard to advertise a new implementation/setup where indexing is three times slower - unless I can give some reasons why that is. The full index should be fairly fast because the backing data is update every few hours. I want to put in place an incremental/partial update as main process, but full indexing might have to be done at certain times if data has changed completely, or the schema has to be changed/extended. No, those are separate things. The ramBufferSizeMB (although, I like the thought of a rumBufferSizeMB too! ;-) ) controls how many docs Lucene holds in memory before it has to flush. MF controls how many segments are on disk alas! the rum. I had that typo on the commandline before. that's my subconscious telling me what I should do when I get home, tonight... So, increasing ramBufferSize should lead to higher memory usage, shouldn't it? I'm not seeing that. :-( I'll try once more with MF 10 and a higher rum... well, you know... ;-) Cheers, Chantal Grant Ingersoll schrieb: On Jul 31, 2009, at 8:04 AM, Chantal Ackermann wrote: Dear all, I want to find out which settings give the best full index performance for my setup. Therefore, I have been running a small index (less than 20k documents) with a mergeFactor of 10 and 100. In both cases, indexing took about 11.5 min: mergeFactor: 10 str name=Time taken 0:11:46.792/str mergeFactor: 100 /admin/cores?action=RELOAD str name=Time taken 0:11:44.441/str Tomcat restart str name=Time taken 0:11:34.143/str This is a Tomcat 5.5.20, started with a max heap size of 1GB. But it always used much less. No swapping (RedHat Linux 32bit, 3GB RAM, old ATA disk). Now, I have three questions: 1. How can I check which mergeFactor is really being used? The solrconfig.xml that is displayed in the admin application is the up- to-date view on the file system. I tested that. But it's not necessarily what the current SOLR core is using, isn't it? Is there a way to check on the actually used mergeFactor (while the index is running)? It could very well be the case that you aren't seeing any merges with only 20K docs. Ultimately, if you really want to, you can look in your data.dir and count the files. If you have indexed a lot and have an MF of 100 and haven't done an optimize, you will see a lot more index files. 2. I changed the mergeFactor in both available settings (default and main index) in the solrconfig.xml file of the core I am reindexing. That is the correct place? Should a change in performance be noticeable when increasing from 10 to 100? Or is the change not perceivable if the requests for data are taking far longer than all the indexing itself? Likely, but not guaranteed. Typically, larger merge factors are good for batch indexing, but a lot of that has changed with Lucene's new background merger, such that I don't know if it matters as much anymore. 3. Do I have to increase rumBufferSizeMB if I increase mergeFactor? (Or some other setting?) No, those are separate things. The ramBufferSizeMB (although, I like the thought of a rumBufferSizeMB too! ;-) ) controls how many docs Lucene holds in memory before it has to flush. MF controls how many segments are on disk (I am still trying to get