Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Yes, committing after each document will greatly degrade performance. I typically use autoCommit and autoSoftCommit to set the time interval between commits, but commitWithin should have a similar effect.. I often see performance of 2000+ docs per second on the load using auto commits. When explicitly committing after each document, your commits will happen too frequently, overworking the indexing process. Joel Bernstein Search Engineer at Heliosearch On Wed, Feb 12, 2014 at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.comwrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
RE: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Cross-posting my answer from SO: According to this wiki: https://wiki.apache.org/solr/NearRealtimeSearch the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant. What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations). On Wed, Feb 12, 2014 at 5:28 PM, Pisarev, Vitaliy vitaliy.pisa...@hp.comwrote: I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40? -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: twitter.com/dmitrykan
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
Here's some additional background that may shed light on the performance.. http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick On Wed, Feb 12, 2014 at 7:40 AM, Dmitry Kan solrexp...@gmail.com wrote: Cross-posting my answer from SO: According to this wiki: https://wiki.apache.org/solr/NearRealtimeSearch the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant. What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations). On Wed, Feb 12, 2014 at 5:28 PM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40? -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: twitter.com/dmitrykan
Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something
The explicit commit will cause your app to be delayed until that commit completes, and then Solr would be idle until that request completion makes its way back to your app and you submit another request which finds its way to Solr, maybe a few ms. That includes network latency. That interval of time could well be more than enough for the short-interval autoCommit or commitWithin to run in the background and in parallel with the request return to your app and the submission by your app of the subsequent request. The magic of asynchronous operation in a parallel and distributed computing environment, coupled with multi-core processors and parallel threads. -- Jack Krupansky -Original Message- From: Pisarev, Vitaliy Sent: Wednesday, February 12, 2014 10:28 AM To: solr-user@lucene.apache.org Subject: RE: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something I absolutely agree and I even read the NRT page before posting this question. The thing that baffles me is this: Doing a commit after each add kills the performance. On the other hand, when I use commit within and specify an (absurd) 1ms delay,- I expect that this behavior will be equivalent to making a commit- from a functional perspective. Seeing that there is no magic in the world, I am trying to understand what is the price I am actually paying when using the commitWithin feature, on the one hand it commits almost immediately, on the other hand, it performs wonderfully. Where is the catch? -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: יום ד 12 פברואר 2014 17:00 To: solr-user Subject: Re: Solr perfromance with commitWithin seesm too good to be true. I am afraid I am missing something Doing a standard commit after every document is a Solr anti-pattern. commitWithin is a “near-realtime” commit in recent versions of Solr and not a standard commit. https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching - Mark http://about.me/markrmiller On Feb 12, 2014, at 9:52 AM, Pisarev, Vitaliy vitaliy.pisa...@hp.com wrote: I am running a very simple performance experiment where I post 2000 documents to my application. Who in turn persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request). I am testing 3 use cases: 1. No indexing at all - ~45 sec to post 2000 documents 2. Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents 3. Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI). I am worried that I am missing something very big. The code I use for point 2: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc); solrConnection.commit(); Whereas the code for point 3: SolrInputDocument = // get doc SolrServer solrConnection = // get connection solrConnection.add(doc, 1); // According to API documentation I understand there is no need to explicitly call commit with this API Is it possible that committing after each add will degrade performance by a factor of 40?