Re: SOLR Data Locality

2017-03-20 Thread Shawn Heisey
On 3/17/2017 11:14 AM, Imad Qureshi wrote:
> I understand that but unfortunately that's not an option right now. We 
> already have 16 TB of index in HDFS. 
>
> So let me rephrase this question. How important is data locality for SOLR. Is 
> performance impacted if SOLR data is on a remote node?

What's going to matter is how fast the data can be retrieved.  With
standard local filesystems, the operating system will use unallocated
memory to cache the data, so if you have enough available memory for
that caching to be effective, access is lightning fast -- the most
requested index data will be in memory, and pulled directly from there
into the application.  If the disk has to be read to obtain the needed
data, it will be slow.  If data has to be transferred over a network
that's gigabit or slower, that is also slow.  Faster network
technologies are available for a price premium, but if a disk has to be
read to get the data, the network speed won't matter.  Good performance
means avoiding going to the disk or transferring over the network.

SSD storage is faster than regular disks, but still not as fast as main
memory, and increased storage speed probably won't matter if the network
can't keep up.

If I'm not mistaken, I think an HDFS client can allocate system memory
for caching purposes to avoid the slow transfer for frequently requested
data.  If my understanding is correct, then enough memory allocated to
the HDFS client MIGHT avoid network/disk transfer for the important data
in the index ... but whether this works in practice is a question I
cannot answer.

Unless your 16TB of index data is being utilized by MANY Solr servers
that each use a very small part of the data and have the ability to
cache a significant percentage of the data they're using, it's highly
unlikely that you're going to have enough memory for good caching. 
Indexes that large are typically slow unless you can afford a LOT of
hardware, which means a lot of money.

Thanks,
Shawn



Re: SOLR Data Locality

2017-03-17 Thread Toke Eskildsen
Imad Qureshi  wrote:
> I understand that but unfortunately that's not an option right now.
> We already have 16 TB of index in HDFS.
> 
> So let me rephrase this question. How important is data locality for
> SOLR. Is performance impacted if SOLR data is on a remote node?

The short answer is yes, the long answer is 
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Anecdotally we did some experiments prior to building our multi-TB search 
setup, where we compared local SSDs with remote (Isilon) SSDs. That setup was 
with simple searches and some faceting. I was a bit surprised that the slowdown 
was only 3x. I would expect the speed difference to be even smaller if the 
underlying storage is slow (spinning disks). Old blog post at 
https://sbdevel.wordpress.com/2013/12/06/danish-webscale/


I don't understand the expected gain of adding replicas, if the data are 
remote. Why can't the replica Solrs run on the nodes with the data? Do you have 
very CPU-intensive search?

- Toke Eskildsen


Re: SOLR Data Locality

2017-03-17 Thread Imad Qureshi
Hi Mike

I understand that but unfortunately that's not an option right now. We already 
have 16 TB of index in HDFS. 

So let me rephrase this question. How important is data locality for SOLR. Is 
performance impacted if SOLR data is on a remote node?

Thanks
Imad

> On Mar 17, 2017, at 12:02 PM, Mike Thomsen  wrote:
> 
> I've only ever used the HDFS support with Cloudera's build, but my experience 
> turned me off to use HDFS. I'd much rather use the native file system over 
> HDFS.
> 
>> On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi 
>>  wrote:
>> We have a 30 node Hadoop cluster and each data node has a SOLR instance also 
>> running. Data is stored in HDFS. We are adding 10 nodes to the cluster. 
>> After adding nodes, we'll run HDFS balancer and also create SOLR replicas on 
>> new nodes. This will affect data locality. does this impact how solr works 
>> (I mean performance) if the data is on a remote node? ThanksImad
> 


Re: SOLR Data Locality

2017-03-17 Thread Mike Thomsen
I've only ever used the HDFS support with Cloudera's build, but my
experience turned me off to use HDFS. I'd much rather use the native file
system over HDFS.

On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi <
imadgr...@yahoo.com.invalid> wrote:

> We have a 30 node Hadoop cluster and each data node has a SOLR instance
> also running. Data is stored in HDFS. We are adding 10 nodes to the
> cluster. After adding nodes, we'll run HDFS balancer and also create SOLR
> replicas on new nodes. This will affect data locality. does this impact how
> solr works (I mean performance) if the data is on a remote node? ThanksImad
>


SOLR Data Locality

2017-03-14 Thread Muhammad Imad Qureshi
We have a 30 node Hadoop cluster and each data node has a SOLR instance also 
running. Data is stored in HDFS. We are adding 10 nodes to the cluster. After 
adding nodes, we'll run HDFS balancer and also create SOLR replicas on new 
nodes. This will affect data locality. does this impact how solr works (I mean 
performance) if the data is on a remote node? ThanksImad