Parallel Indexing With Solr?
Does Solr allows parallelism (parallel computing) for indexing?
Re: Parallel Indexing With Solr?
On 29 March 2013 14:56, Furkan KAMACI furkankam...@gmail.com wrote: Does Solr allows parallelism (parallel computing) for indexing? What do you mean by parallel computing in this context? Solr can use multiple threads for indexing if that is what you are asking. Regards, Gora
Re: Parallel Indexing With Solr?
Yes. You can index from any app that can hit SOlr with multiple threads. You can use StreamingUpdateSolrServer, at least in older Solrs, to handle multi-threading for you. You can index from a MapReduce job Otis -- Solr ElasticSearch Support http://sematext.com/ On Fri, Mar 29, 2013 at 5:26 AM, Furkan KAMACI furkankam...@gmail.com wrote: Does Solr allows parallelism (parallel computing) for indexing?
Re: Parallel Indexing With Solr?
Can you tell more about You can index from a MapReduce job ? I use nutch and it says Solr to index and reindex. I know that I can use Map Reduce jobs at nutch side however can I use Map Reduce jobs at Solr side (i.e for indexing etc.)? 2013/3/29 Otis Gospodnetic otis.gospodne...@gmail.com Yes. You can index from any app that can hit SOlr with multiple threads. You can use StreamingUpdateSolrServer, at least in older Solrs, to handle multi-threading for you. You can index from a MapReduce job Otis -- Solr ElasticSearch Support http://sematext.com/ On Fri, Mar 29, 2013 at 5:26 AM, Furkan KAMACI furkankam...@gmail.com wrote: Does Solr allows parallelism (parallel computing) for indexing?
Re: Parallel indexing in Solr
On Mon, Feb 6, 2012 at 5:55 PM, Per Steffensen st...@designware.dk wrote: Sami Siren skrev: On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? Grrr, thats where I have a little fight with operations. For now they gave me one (fairly big) machine with XenServer. I create my machines as Xen VM's on top of that. One of the things I dont like about this (besides that I dont trust Xen to do its virtualization right, or at least not provide me with correct readings on IO) is that disk space is assigned from an iSCSI connected SAN that they all share (including the line out there). But for now actually it doesnt look like disk IO problems. It looks like networks-bottlenecks (but to some extend they all also shard network) among all the components in our setup - our client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it is complex, but anyways ... You could try to isolate the bottleneck by testing the indexing speed from the local machine hosting Solr. Also tools like iostat or sar might give you more details about the disk side. -- Sami Siren
Re: Parallel indexing in Solr
You could try to isolate the bottleneck by testing the indexing speed from the local machine hosting Solr. Also tools like iostat or sar might give you more details about the disk side. Yes, I am doing different stuff to isolate bottleneck. Im also profiling JVM. And I am using iostat, top and sar already. Thanks. This questions was originally just to get an early indication of whether or not Jetty was at all designed for parallel production-like processing. Now I believe it is, until I prove that it does not live up to my requirements. Thanks! -- Sami Siren
Re: Parallel indexing in Solr
See response below Erick Erickson skrev: Unfortunately, the answer is it depends(tm). First question: How are you indexing things? SolrJ? post.jar? SolrJ, CommonsHttpSolrServer But some observations: 1 sure, using multiple cores will have some parallelism. So will using a single core but using something like SolrJ and StreamingUpdateSolrServer. So SolrJ with CommonsHttpSolrServer will not support handling several requests concurrently? Especially with trunk (4.0) and the Document Writer Per Thread stuff. We are using trunk (4.0). Can you provide me with a little more info on this Document Writer Per Thread stuff. A link or something? In 3.x, you'll see some pauses when segments are merged that you can't get around (per core). See: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ for an excellent writeup. But whether or not you use several cores should be determined by your problem space, certainly not by trying to increase the throughput. Indexing usually take a back seat to search performance. We will have few searches, but a lot of indexing. 2 general settings are hard to come by. If you're sending structured documents that use Tika to parse the data behind the scenes, your performance will be much different (slower) than sending SolrInputDocuments (SolrJ). We are sending SolrInputDocuments 3 The recommended servlet container is, generally, The one you're most comfortable with. Tomcat is certainly popular. That said, use whatever you're most comfortable with until you see a performance problem. Odds are you'll find your load on Solr is a at its limit before your servlet container has problems. So Jetty in not a easy to use, but non-performance-container? 4 Monitor you CPU, fire more requests at it until it hits 100%. Note that there are occasions where the servlet container limits the number of outstanding requests it will allow and queues ones over that limit (find the magic setting to increase this if it's a problem, it differs by container). If you start to see your response times lengthen but the CPU not being fully utilized, that may be the cause. Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. Well eventaually we need to be able to index and delete about 50mio documents per day. We will need to keep a history of 2 years of data in our system, deletion will not start before we have been in production for 2 years. At that point in time the system needs to contain 2 year * 365 days/year * 50mio docs/day = 36,5billion documents. At that point 50mio documents need to be deleted and index per day - before that we only need to index 50mio documents per day. We are aware that we are probably going to need a certain amout of hardware for this, but most important thing is that we make a scalable setup so that we can get to this kind of numbers at all. Right now I am focusing on getting most out of one Solr instance potentially with several cores, though. Best Erick On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote: Hi This topic has probably been covered before, but I havnt had the luck to find the answer. We are running solr instances with several cores inside. Solr running out-of-the-box on top of jetty. I believe jetty is receiving all the http-requests about indexing ned documents, and forwards it to the solr engine. What kind of parallelism does this setup provide. Can more than one index-request get processed concurrently? How many? How to increase the number of index-requests that can be handled in parallel? Will I get better parallelism by running on another web-container than jetty - e.g. tomcat? What is the recommended web-container for high performance production systems? Thanks! Regards, Per Steffensen
Re: Parallel indexing in Solr
Right. See below. On Mon, Feb 6, 2012 at 7:53 AM, Per Steffensen st...@designware.dk wrote: See response below Erick Erickson skrev: Unfortunately, the answer is it depends(tm). First question: How are you indexing things? SolrJ? post.jar? SolrJ, CommonsHttpSolrServer But some observations: 1 sure, using multiple cores will have some parallelism. So will using a single core but using something like SolrJ and StreamingUpdateSolrServer. So SolrJ with CommonsHttpSolrServer will not support handling several requests concurrently? Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with a different constructor. Especially with trunk (4.0) and the Document Writer Per Thread stuff. We are using trunk (4.0). Can you provide me with a little more info on this Document Writer Per Thread stuff. A link or something? I already did, follow the link I provided. In 3.x, you'll see some pauses when segments are merged that you can't get around (per core). See: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ for an excellent writeup. But whether or not you use several cores should be determined by your problem space, certainly not by trying to increase the throughput. Indexing usually take a back seat to search performance. We will have few searches, but a lot of indexing. Hmmm, this the inverse of most installations, so it's good to know. 2 general settings are hard to come by. If you're sending structured documents that use Tika to parse the data behind the scenes, your performance will be much different (slower) than sending SolrInputDocuments (SolrJ). We are sending SolrInputDocuments 3 The recommended servlet container is, generally, The one you're most comfortable with. Tomcat is certainly popular. That said, use whatever you're most comfortable with until you see a performance problem. Odds are you'll find your load on Solr is a at its limit before your servlet container has problems. So Jetty in not a easy to use, but non-performance-container? Again, test and see. Lots of commercial systems use Jetty. Consider that you're just sending sets of documents at Solr, the container is doing very little work. You are batching up your Solr documents aren't you? 4 Monitor you CPU, fire more requests at it until it hits 100%. Note that there are occasions where the servlet container limits the number of outstanding requests it will allow and queues ones over that limit (find the magic setting to increase this if it's a problem, it differs by container). If you start to see your response times lengthen but the CPU not being fully utilized, that may be the cause. Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. You should see this differ with StreamingUpdateSolrServer assuming your client can feed documents fast enough. You can consider having multiple clients feed the same solr indexer if necessary. 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. Well eventaually we need to be able to index and delete about 50mio documents per day. We will need to keep a history of 2 years of data in our system, deletion will not start before we have been in production for 2 years. At that point in time the system needs to contain 2 year * 365 days/year * 50mio docs/day = 36,5billion documents. At that point 50mio documents need to be deleted and index per day - before that we only need to index 50mio documents per day. We are aware that we are probably going to need a certain amout of hardware for this, but most important thing is that we make a scalable setup so that we can get to this kind of numbers at all. Right now I am focusing on getting most out of one Solr instance potentially with several cores, though. My off-the-top-of-my-head feeling is that this will be a LOT of hardware. You'll without doubt be sharding the index. NOTE: Shards are cores, just special purpose ones, i.e. they're all use the same schema. When Solr folks see cores, we assume that the several cores that may have different schemas and handle unrelated queries. It sounds like you're talking about a sharded
Re: Parallel indexing in Solr
On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? -- Sami Siren
Re: Parallel indexing in Solr
So SolrJ with CommonsHttpSolrServer will not support handling several requests concurrently? Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with a different constructor. I will try to do that. It is a little bit difficult for me, as we are actually not dealing with Solr ourselves. We are using Lily, but I will modify Lily, compile and try to see how goes. Especially with trunk (4.0) and the Document Writer Per Thread stuff. We are using trunk (4.0). Can you provide me with a little more info on this Document Writer Per Thread stuff. A link or something? I already did, follow the link I provided. Ahh ok, didnt get it the first time, that the link below was about that http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ So Jetty in not a easy to use, but non-performance-container? Again, test and see. Lots of commercial systems use Jetty. Consider that you're just sending sets of documents at Solr, the container is doing very little work. You are batching up your Solr documents aren't you? Havnt looked into Lily to see whether or not documents are batched, but I will. I didnt expect Jetty to be the problem, basically just wanted to know that is was not a stupid everything-in-a-single-thread container, almost designed to not perform (because the focus might be different, e.g. providing an easy-to-use/understand container for testing etc.) Actually right now, I am trying to find our what my bottleneck is. You should see this differ with StreamingUpdateSolrServer assuming your client can feed documents fast enough. You can consider having multiple clients feed the same solr indexer if necessary. Thanks! 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. 50mio documents need to be deleted and indexed per day. 2 years history = 36 billion docs in store My off-the-top-of-my-head feeling is that this will be a LOT of hardware. Well it takes what it takes. Someone else will buy the hardware. My first concern is to make sure we have a system that scales, so that we can buy us out of problems by buying more hardware. On the other hand of course I want to privide at system that makes the most of the hardware. You'll without doubt be sharding the index. NOTE: Shards are cores, just special purpose ones, i.e. they're all use the same schema. When Solr folks see cores, we assume that the several cores that may have different schemas and handle unrelated queries. It sounds like you're talking about a sharded system rather than independent cores, is that so? Yes that is correct. We only have one single schema/config shared by all cores through ZK. So the many cores are just for sharding, because I do not expect that it will work very well with 20 billion docs in the same core/shard :-) You should have no trouble indexing 50M documents/day, even assuming that the ingestion rate is not evenly distributed. The link I referenced talks about indexing 10M documents in a little over 6 minutes. YMMV however. I think you're going along the right path when trying to push a single indexer to the max. My setup uses Jetty and is getting 5-7K docs/second so I doubt it's inherently a Jetty problem, although there may be configuration tweaks getting in your way. Bottom line: I doubt it's a Jetty issue at this point but I've been wrong on too many occasions to count. I'd be looking other places first though. Start with the streaming update solr server though, and also whether your clients can spit out documents fast enough... I will have a look at all that. Thanks! Best Erick
Re: Parallel indexing in Solr
Sami Siren skrev: On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? Grrr, thats where I have a little fight with operations. For now they gave me one (fairly big) machine with XenServer. I create my machines as Xen VM's on top of that. One of the things I dont like about this (besides that I dont trust Xen to do its virtualization right, or at least not provide me with correct readings on IO) is that disk space is assigned from an iSCSI connected SAN that they all share (including the line out there). But for now actually it doesnt look like disk IO problems. It looks like networks-bottlenecks (but to some extend they all also shard network) among all the components in our setup - our client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it is complex, but anyways ... -- Sami Siren
Re: Parallel indexing in Solr
grin. I've had recurring discussions with executive level folks that no matter how many VMs you host on a machine, and no matter how big that machine is, there really, truly, *is* some hardware underlying it all that really, truly, *does* have some limits. And adding more VMs doesn't somehow get around those limits.. Good Luck! Erick On Mon, Feb 6, 2012 at 10:55 AM, Per Steffensen st...@designware.dk wrote: Sami Siren skrev: On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? Grrr, thats where I have a little fight with operations. For now they gave me one (fairly big) machine with XenServer. I create my machines as Xen VM's on top of that. One of the things I dont like about this (besides that I dont trust Xen to do its virtualization right, or at least not provide me with correct readings on IO) is that disk space is assigned from an iSCSI connected SAN that they all share (including the line out there). But for now actually it doesnt look like disk IO problems. It looks like networks-bottlenecks (but to some extend they all also shard network) among all the components in our setup - our client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it is complex, but anyways ... -- Sami Siren
Parallel indexing in Solr
Hi This topic has probably been covered before, but I havnt had the luck to find the answer. We are running solr instances with several cores inside. Solr running out-of-the-box on top of jetty. I believe jetty is receiving all the http-requests about indexing ned documents, and forwards it to the solr engine. What kind of parallelism does this setup provide. Can more than one index-request get processed concurrently? How many? How to increase the number of index-requests that can be handled in parallel? Will I get better parallelism by running on another web-container than jetty - e.g. tomcat? What is the recommended web-container for high performance production systems? Thanks! Regards, Per Steffensen
Re: Parallel indexing in Solr
Unfortunately, the answer is it depends(tm). First question: How are you indexing things? SolrJ? post.jar? But some observations: 1 sure, using multiple cores will have some parallelism. So will using a single core but using something like SolrJ and StreamingUpdateSolrServer. Especially with trunk (4.0) and the Document Writer Per Thread stuff. In 3.x, you'll see some pauses when segments are merged that you can't get around (per core). See: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ for an excellent writeup. But whether or not you use several cores should be determined by your problem space, certainly not by trying to increase the throughput. Indexing usually take a back seat to search performance. 2 general settings are hard to come by. If you're sending structured documents that use Tika to parse the data behind the scenes, your performance will be much different (slower) than sending SolrInputDocuments (SolrJ). 3 The recommended servlet container is, generally, The one you're most comfortable with. Tomcat is certainly popular. That said, use whatever you're most comfortable with until you see a performance problem. Odds are you'll find your load on Solr is a at its limit before your servlet container has problems. 4 Monitor you CPU, fire more requests at it until it hits 100%. Note that there are occasions where the servlet container limits the number of outstanding requests it will allow and queues ones over that limit (find the magic setting to increase this if it's a problem, it differs by container). If you start to see your response times lengthen but the CPU not being fully utilized, that may be the cause. 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. Best Erick On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote: Hi This topic has probably been covered before, but I havnt had the luck to find the answer. We are running solr instances with several cores inside. Solr running out-of-the-box on top of jetty. I believe jetty is receiving all the http-requests about indexing ned documents, and forwards it to the solr engine. What kind of parallelism does this setup provide. Can more than one index-request get processed concurrently? How many? How to increase the number of index-requests that can be handled in parallel? Will I get better parallelism by running on another web-container than jetty - e.g. tomcat? What is the recommended web-container for high performance production systems? Thanks! Regards, Per Steffensen