Re: SOLR Data Locality
On 3/17/2017 11:14 AM, Imad Qureshi wrote: > I understand that but unfortunately that's not an option right now. We > already have 16 TB of index in HDFS. > > So let me rephrase this question. How important is data locality for SOLR. Is > performance impacted if SOLR data is on a remote node? What's going to matter is how fast the data can be retrieved. With standard local filesystems, the operating system will use unallocated memory to cache the data, so if you have enough available memory for that caching to be effective, access is lightning fast -- the most requested index data will be in memory, and pulled directly from there into the application. If the disk has to be read to obtain the needed data, it will be slow. If data has to be transferred over a network that's gigabit or slower, that is also slow. Faster network technologies are available for a price premium, but if a disk has to be read to get the data, the network speed won't matter. Good performance means avoiding going to the disk or transferring over the network. SSD storage is faster than regular disks, but still not as fast as main memory, and increased storage speed probably won't matter if the network can't keep up. If I'm not mistaken, I think an HDFS client can allocate system memory for caching purposes to avoid the slow transfer for frequently requested data. If my understanding is correct, then enough memory allocated to the HDFS client MIGHT avoid network/disk transfer for the important data in the index ... but whether this works in practice is a question I cannot answer. Unless your 16TB of index data is being utilized by MANY Solr servers that each use a very small part of the data and have the ability to cache a significant percentage of the data they're using, it's highly unlikely that you're going to have enough memory for good caching. Indexes that large are typically slow unless you can afford a LOT of hardware, which means a lot of money. Thanks, Shawn
Re: SOLR Data Locality
Imad Qureshiwrote: > I understand that but unfortunately that's not an option right now. > We already have 16 TB of index in HDFS. > > So let me rephrase this question. How important is data locality for > SOLR. Is performance impacted if SOLR data is on a remote node? The short answer is yes, the long answer is https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Anecdotally we did some experiments prior to building our multi-TB search setup, where we compared local SSDs with remote (Isilon) SSDs. That setup was with simple searches and some faceting. I was a bit surprised that the slowdown was only 3x. I would expect the speed difference to be even smaller if the underlying storage is slow (spinning disks). Old blog post at https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ I don't understand the expected gain of adding replicas, if the data are remote. Why can't the replica Solrs run on the nodes with the data? Do you have very CPU-intensive search? - Toke Eskildsen
Re: SOLR Data Locality
Hi Mike I understand that but unfortunately that's not an option right now. We already have 16 TB of index in HDFS. So let me rephrase this question. How important is data locality for SOLR. Is performance impacted if SOLR data is on a remote node? Thanks Imad > On Mar 17, 2017, at 12:02 PM, Mike Thomsenwrote: > > I've only ever used the HDFS support with Cloudera's build, but my experience > turned me off to use HDFS. I'd much rather use the native file system over > HDFS. > >> On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi >> wrote: >> We have a 30 node Hadoop cluster and each data node has a SOLR instance also >> running. Data is stored in HDFS. We are adding 10 nodes to the cluster. >> After adding nodes, we'll run HDFS balancer and also create SOLR replicas on >> new nodes. This will affect data locality. does this impact how solr works >> (I mean performance) if the data is on a remote node? ThanksImad >
Re: SOLR Data Locality
I've only ever used the HDFS support with Cloudera's build, but my experience turned me off to use HDFS. I'd much rather use the native file system over HDFS. On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi < imadgr...@yahoo.com.invalid> wrote: > We have a 30 node Hadoop cluster and each data node has a SOLR instance > also running. Data is stored in HDFS. We are adding 10 nodes to the > cluster. After adding nodes, we'll run HDFS balancer and also create SOLR > replicas on new nodes. This will affect data locality. does this impact how > solr works (I mean performance) if the data is on a remote node? ThanksImad >
SOLR Data Locality
We have a 30 node Hadoop cluster and each data node has a SOLR instance also running. Data is stored in HDFS. We are adding 10 nodes to the cluster. After adding nodes, we'll run HDFS balancer and also create SOLR replicas on new nodes. This will affect data locality. does this impact how solr works (I mean performance) if the data is on a remote node? ThanksImad