Re: persistent cache
On Tue, 2010-02-16 at 10:35 +0100, Tim Terlegård wrote: > I actually tried SSD yesterday. Queries which need to go to disk are > much faster now. I did expect that warmup for sort fields would be > much quicker as well, but that seems to be cpu bound. That and bulk I/O. The sorter imports the Terms into RAM by iterating, which means that the IO-access for this is sequential. Most modern SSDs are faster than conventional harddisks for this, but not by much. > It still takes a minute to cache the six sort fields of the 40 million > document index. I am not aware of any solutions to this, besides beefing hardware bulk reads and processor speed (the sorter is not threaded as far as I remember). It it technically possible to move this step to the indexer, but the only win would be for setups with few builders and many searchers. > Are there any differences among SSD disks. Why is Intel X25-M your favourite? A soft reason is that I have faith in support from Intel: There has been problems with earlier versions of the drive (nuking content in some edge-cases and performance degradation (which hits all SSDs)) and Intel has responded well by acknowledging the problems and resolving them. That's very subjective though and I'm sure that some would turn that around and say that Intel delivered crap in the first place. On the harder side, the Intel drive is surprisingly cheap and provides random IO performance ahead of most competitors. Especially for random writes, which is normally the weak point for SSDs. Some graphs can be found at Anandtech: http://anandtech.com/storage/showdoc.aspx?i=3631&p=22 Anandtech is BTW a very fine starting point on SSD's as they go into details that too many reviewers skip over. To be truthful here, standard index building and searching with Lucene requires three things from the IO-system: Bulk writes, bulk reads (mainly for sorting) and random reads. The Intel drive is not stellar for bulk writes and being superior for random writes does not make a difference for Lucene/SOLR. if we're only talking search: Pick whatever SSD you can get your hands on: They are all fine for random reads and the CPU will probably be the bottleneck. However, random write speed is a bonus that might show indirectly: Untarring a million small files, updating a database and similiar is often part of the workflow with search. Back in 2007 we were fortunate enough to get a test-machine with 2 types of SSD, 2 10,000 RPM harddisks and 2 15,000 RPM harddisks. Some quick notes can be found at http://wiki.statsbiblioteket.dk/summa/Hardware The world has moved on since then, but that has only widened the gap between SSDs and harddisks. Regards, Toke Eskildsen
Re: persistent cache
On a related note. Maybe it'd be good to have wiki page of experiences and possibly stats of various SSD drives? Either on Lucene or Solr wiki sites? 2010/2/16 Tim Terlegård : > 2010/2/15 Toke Eskildsen : >> From: Tim Terlegård [tim.terleg...@gmail.com] >>> If the index size is more than you can have in RAM, do you recommend >>> to split the index to several servers so it can all be in RAM? >>> >>> I do expect phrase queries. Total index size is 107 GB. *prx files are >>> total 65GB and *frq files 38GB. It's probably worth buying more RAM. >> >> Have you considered throwing one or more SSD's at the problem? Intel >> X25-M G2 (or X25-E if you're dictated by your organization to buy >> enterprise level) is my personal favorite right now. They are, compared >> to RAM or even high-end spinning harddrives, often quite cost-effective. > > I actually tried SSD yesterday. Queries which need to go to disk are > much faster now. I did expect that warmup for sort fields would be > much quicker as well, but that seems to be cpu bound. It still takes a > minute to cache the six sort fields of the 40 million document index. > But I'm happy the queries are faster with SSD. > > Are there any differences among SSD disks. Why is Intel X25-M your favourite? > > /Tim >
Re: persistent cache
2010/2/15 Toke Eskildsen : > From: Tim Terlegård [tim.terleg...@gmail.com] >> If the index size is more than you can have in RAM, do you recommend >> to split the index to several servers so it can all be in RAM? >> >> I do expect phrase queries. Total index size is 107 GB. *prx files are >> total 65GB and *frq files 38GB. It's probably worth buying more RAM. > > Have you considered throwing one or more SSD's at the problem? Intel > X25-M G2 (or X25-E if you're dictated by your organization to buy > enterprise level) is my personal favorite right now. They are, compared > to RAM or even high-end spinning harddrives, often quite cost-effective. I actually tried SSD yesterday. Queries which need to go to disk are much faster now. I did expect that warmup for sort fields would be much quicker as well, but that seems to be cpu bound. It still takes a minute to cache the six sort fields of the 40 million document index. But I'm happy the queries are faster with SSD. Are there any differences among SSD disks. Why is Intel X25-M your favourite? /Tim
Re: persistent cache
Hi Tim, Due to our performance needs we optimize the index early in the morning and then run the cache-warming queries once we mount the optimized index on our servers. If you are indexing and serving using the same Solr instance, you shouldn't have to re-run the cache warming queries when you add documents. I believe that the disk writes caused by adding the documents to the index should put that data in the OS cache. Actually 1600 queries are not a lot of queries. If you are using actual user queries from your logs you may need more. We used some tools based on Luke to analyze our index and determine which words would most benefit by being in the OS cache (assuming users entered a phrase query containing those words.) You can experiment to see how many queries you need to fill memory by emptying the OS cache and then send queries and use top to watch memory usage. Your options (assuming peformance with current hardware does not meet your needs ) are using SSD's, increasing memory on the machine, or splitting the index using Solr shards. If you either increase memory on the machine or split the index, you will still have to run cache warming queries. One other thing you might consider is to use stop words or CommonGrams to reduce disk I/O requirments for phrase queries containing common words. (Our experiments with CommonGrams and cache-warming are described in our blog : http://www.hathitrust.org/blogs/large-scale-search http://www.hathitrust.org/blogs/large-scale-search ) Tom Hi Tom, 1600 warming queries, that's quite many. Do you run them every time a document is added to the index? Do you have any tips on warming? If the index size is more than you can have in RAM, do you recommend to split the index to several servers so it can all be in RAM? I do expect phrase queries. Total index size is 107 GB. *prx files are total 65GB and *frq files 38GB. It's probably worth buying more RAM. /Tim -- View this message in context: http://old.nabble.com/persistent-cache-tp27562126p27598026.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: persistent cache
From: Tim Terlegård [tim.terleg...@gmail.com] > If the index size is more than you can have in RAM, do you recommend > to split the index to several servers so it can all be in RAM? > > I do expect phrase queries. Total index size is 107 GB. *prx files are > total 65GB and *frq files 38GB. It's probably worth buying more RAM. Have you considered throwing one or more SSD's at the problem? Intel X25-M G2 (or X25-E if you're dictated by your organization to buy enterprise level) is my personal favorite right now. They are, compared to RAM or even high-end spinning harddrives, often quite cost-effective. Most SSD's has random access time for reads at about 0.1ms. For us that meant that we moved the bottleneck for a 70GB index (10 million documents) from IO to CPU on a quad-core machine. We tried testing SSD vs. RAMDirectory and found it to perform at about 75% speed for a 14GB subset of the index. - Toke Eskildsen - http://statsbiblioteket.dk
Re: persistent cache
Hi Tom, 1600 warming queries, that's quite many. Do you run them every time a document is added to the index? Do you have any tips on warming? If the index size is more than you can have in RAM, do you recommend to split the index to several servers so it can all be in RAM? I do expect phrase queries. Total index size is 107 GB. *prx files are total 65GB and *frq files 38GB. It's probably worth buying more RAM. /Tim 2010/2/12 Tom Burton-West : > > Hi Tim, > > We generally run about 1600 cache-warming queries to warm up the OS disk > cache and the Solr caches when we mount a new index. > > Do you have/expect phrase queries? If you don't, then you don't need to > get any position information into your OS disk cache. Our position > information takes about 85% of the total index size (*prx files). So with a > 100GB index, your *frq files might only be 15-20GB and you could probably > get more than half of that in 16GB of memory. > > If you have limited memory and a large index, then you need to choose cache > warming queries carefully as once the cache is full, further queries will > start evicting older data from the cache. The tradeoff is to populate the > cache with data that would require the most disk access if the data was not > in the cache versus populating the cache based on your best guess of what > queries your users will execute. A good overview of the issues is the paper > by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of > Caching on Search Engines ) > > > Tom Burton-West > Digital Library Production Service > University of Michigan Library > -- > View this message in context: > http://old.nabble.com/persistent-cache-tp27562126p27567840.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: persistent cache
Hi Tim, We generally run about 1600 cache-warming queries to warm up the OS disk cache and the Solr caches when we mount a new index. Do you have/expect phrase queries? If you don't, then you don't need to get any position information into your OS disk cache. Our position information takes about 85% of the total index size (*prx files). So with a 100GB index, your *frq files might only be 15-20GB and you could probably get more than half of that in 16GB of memory. If you have limited memory and a large index, then you need to choose cache warming queries carefully as once the cache is full, further queries will start evicting older data from the cache. The tradeoff is to populate the cache with data that would require the most disk access if the data was not in the cache versus populating the cache based on your best guess of what queries your users will execute. A good overview of the issues is the paper by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of Caching on Search Engines ) Tom Burton-West Digital Library Production Service University of Michigan Library -- View this message in context: http://old.nabble.com/persistent-cache-tp27562126p27567840.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: persistent cache
One solution is to add the persistent cache with memcache at the application layer. -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 2/12/10 5:19 AM, Tim Terlegård wrote: 2010/2/12 Shalin Shekhar Mangar: 2010/2/12 Tim Terlegård Does Solr use some sort of a persistent cache? Solr does not have a persistent cache. That is the operating system's file cache at work. Aha, that's very interesting and seems to make sense. So is the primary goal of warmup queries to allow the operating system to cache all the files in the data/index directory? Because I think the difference (768ms vs 52ms) is pretty big. I just do one warmup query and get 52 ms response on a 40 million documents index. I think that's pretty nice performance without tinkering with the caches at all. The only tinkering that seems to be needed is this operating system file caching. What's the best way to make sure that my warmup queries have cached all the files? And does a file cache have the complete file in memory? I guess it can get tough to get my 100GB index into the 16GB memory. /Tim -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
Re: persistent cache
2010/2/12 Shalin Shekhar Mangar : > 2010/2/12 Tim Terlegård > >> Does Solr use some sort of a persistent cache? >> > Solr does not have a persistent cache. That is the operating system's file > cache at work. Aha, that's very interesting and seems to make sense. So is the primary goal of warmup queries to allow the operating system to cache all the files in the data/index directory? Because I think the difference (768ms vs 52ms) is pretty big. I just do one warmup query and get 52 ms response on a 40 million documents index. I think that's pretty nice performance without tinkering with the caches at all. The only tinkering that seems to be needed is this operating system file caching. What's the best way to make sure that my warmup queries have cached all the files? And does a file cache have the complete file in memory? I guess it can get tough to get my 100GB index into the 16GB memory. /Tim
Re: persistent cache
2010/2/12 Tim Terlegård > Does Solr use some sort of a persistent cache? > > I do this 10 times in a loop: > * start solr > * create a core > * execute warmup query > * execute query with sort fields > * stop solr > > Executing the query with sort fields takes 5-20 times longer the first > iteration than the other 9 iterations. For instance I have a query > 'hockey' with one date sort field. That takes 768 ms in the first > iteration of the loop. The next 9 iterations the query takes 52 ms. > The solr and jetty server really stops in each iteration so the RAM > must be emptied. So the only way I can think of why this happens is > because there is some persistent cache that survives the solr > restarts. Is this the case? Or why could this be? > > Solr does not have a persistent cache. That is the operating system's file cache at work. -- Regards, Shalin Shekhar Mangar.
persistent cache
Does Solr use some sort of a persistent cache? I do this 10 times in a loop: * start solr * create a core * execute warmup query * execute query with sort fields * stop solr Executing the query with sort fields takes 5-20 times longer the first iteration than the other 9 iterations. For instance I have a query 'hockey' with one date sort field. That takes 768 ms in the first iteration of the loop. The next 9 iterations the query takes 52 ms. The solr and jetty server really stops in each iteration so the RAM must be emptied. So the only way I can think of why this happens is because there is some persistent cache that survives the solr restarts. Is this the case? Or why could this be? /Tim