Re: Solr on HDFS: increase in query time with increase in data
On 12/16/2016 11:58 AM, Chetas Joshi wrote: > How different the index data caching mechanism is for the Streaming > API from the cursor approach? Solr and Lucene do not handle that caching. Systems external to Solr (like the OS, or HDFS) handle the caching. The cache effectiveness will be a combination of the cache size, overall data size, and the data access patterns of the application. I do not know enough to tell you how the cursorMark feature and the streaming API work when they access the index data. I would imagine them to be pretty similar, but cannot be sure about that. Thanks, Shawn
Re: Solr on HDFS: increase in query time with increase in data
Thank you everyone. I would add nodes to the SolrCloud and split the shards. Shawn, Thank you for explaining why putting index data on local file system could be a better idea than using HDFS. I need to find out how HDFS caches the index files in a resource constrained environment. I would also like to add that when I try the Streaming API instead of using the cursor approach, it starts running into JSON parsing exceptions when my nodes (running Solr shards) don't have enough RAM to fit the entire index into memory. FYI: I have other services (Yarn, Spark) deployed on the same boxes as well. Spark jobs also use a lot of disk cache. When I have enough RAM (more than 70 GB so that all the index data could fit in memory), the streaming API succeeds without running into any exceptions. How different the index data caching mechanism is for the Streaming API from the cursor approach? Thanks! On Fri, Dec 16, 2016 at 6:52 AM, Shawn Heiseywrote: > On 12/14/2016 11:58 AM, Chetas Joshi wrote: > > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > > the following config. > > maxShardsperNode: 1 > > replicationFactor: 1 > > > > I have been ingesting data into Solr for the last 3 months. With increase > > in data, I am observing increase in the query time. Currently the size of > > my indices is 70 GB per shard (i.e. per node). > > Query times will increase as the index size increases, but significant > jumps in the query time may be an indication of a performance problem. > Performance problems are usually caused by insufficient resources, > memory in particular. > > With HDFS, I am honestly not sure *where* the cache memory is needed. I > would assume that it's needed on the HDFS hosts, that a lot of spare > memory on the Solr (HDFS client) hosts probably won't make much > difference. I could be wrong -- I have no idea what kind of caching > HDFS does. If the HDFS client can cache data, then you probably would > want extra memory on the Solr machines. > > > I am using cursor approach (/export handler) using SolrJ client to get > back > > results from Solr. All the fields I am querying on and all the fields > that > > I get back from Solr are indexed and have docValues enabled as well. What > > could be the reason behind increase in query time? > > If actual disk access is required to satisfy a query, Solr is going to > be slow. Caching is absolutely required for good performance. If your > query times are really long but used to be short, chances are that your > index size has exceeded your system's ability to cache it effectively. > > One thing to keep in mind: Gigabit Ethernet is comparable in speed to > the sustained transfer rate of a single modern SATA magnetic disk, so if > the data has to traverse a gigabit network, it probably will be nearly > as slow as it would be if it were coming from a single disk. Having a > 10gig network for your storage is probably a good idea ... but current > fast memory chips can leave 10gig in the dust, so if the data can come > from cache and the chips are new enough, then it can be faster than > network storage. > > Because the network can be a potential bottleneck, I strongly recommend > putting index data on local disks. If you have enough memory, the disk > doesn't even need to be super-fast. > > > Has this got something to do with the OS disk cache that is used for > > loading the Solr indices? When a query is fired, will Solr wait for all > > (70GB) of disk cache being available so that it can load the index file? > > Caching the files on the disk is not handled by Solr, so Solr won't wait > for the entire index to be cached unless the underlying storage waits > for some reason. The caching is usually handled by the OS. For HDFS, > it might be handled by a combination of the OS and Hadoop, but I don't > know enough about HDFS to comment. Solr makes a request for the parts > of the index files that it needs to satisfy the request. If the > underlying system is capable of caching the data, if that feature is > enabled, and if there's memory available for that purpose, then it gets > cached. > > Thanks, > Shawn > >
Re: Solr on HDFS: increase in query time with increase in data
I think 70GB is too huge for a shard. How much memory does the system is having? Incase solr does not have sufficient memory to load the indexes, it will use only the amount of memory defined in your Solr Caches. Although you are on HFDS, solr performances will be really bad if it has do disk IO at the query time. The best option for you is to shard it into atleast 8-10 nodes and create appropriate replicas according to your read traffic. Regards, Piyush On Fri, Dec 16, 2016 at 12:15 PM, Reth RMwrote: > I think the shard index size is huge and should be split. > > On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi > wrote: > > > Hi everyone, > > > > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > > the following config. > > maxShardsperNode: 1 > > replicationFactor: 1 > > > > I have been ingesting data into Solr for the last 3 months. With increase > > in data, I am observing increase in the query time. Currently the size of > > my indices is 70 GB per shard (i.e. per node). > > > > I am using cursor approach (/export handler) using SolrJ client to get > back > > results from Solr. All the fields I am querying on and all the fields > that > > I get back from Solr are indexed and have docValues enabled as well. What > > could be the reason behind increase in query time? > > > > Has this got something to do with the OS disk cache that is used for > > loading the Solr indices? When a query is fired, will Solr wait for all > > (70GB) of disk cache being available so that it can load the index file? > > > > Thnaks! > > >
Re: Solr on HDFS: increase in query time with increase in data
I think the shard index size is huge and should be split. On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshiwrote: > Hi everyone, > > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have > the following config. > maxShardsperNode: 1 > replicationFactor: 1 > > I have been ingesting data into Solr for the last 3 months. With increase > in data, I am observing increase in the query time. Currently the size of > my indices is 70 GB per shard (i.e. per node). > > I am using cursor approach (/export handler) using SolrJ client to get back > results from Solr. All the fields I am querying on and all the fields that > I get back from Solr are indexed and have docValues enabled as well. What > could be the reason behind increase in query time? > > Has this got something to do with the OS disk cache that is used for > loading the Solr indices? When a query is fired, will Solr wait for all > (70GB) of disk cache being available so that it can load the index file? > > Thnaks! >
Solr on HDFS: increase in query time with increase in data
Hi everyone, I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have the following config. maxShardsperNode: 1 replicationFactor: 1 I have been ingesting data into Solr for the last 3 months. With increase in data, I am observing increase in the query time. Currently the size of my indices is 70 GB per shard (i.e. per node). I am using cursor approach (/export handler) using SolrJ client to get back results from Solr. All the fields I am querying on and all the fields that I get back from Solr are indexed and have docValues enabled as well. What could be the reason behind increase in query time? Has this got something to do with the OS disk cache that is used for loading the Solr indices? When a query is fired, will Solr wait for all (70GB) of disk cache being available so that it can load the index file? Thnaks!