Re: persistent cache

2010-02-17 Thread Toke Eskildsen
On Tue, 2010-02-16 at 10:35 +0100, Tim Terlegård wrote:
> I actually tried SSD yesterday. Queries which need to go to disk are
> much faster now. I did expect that warmup for sort fields would be
> much quicker as well, but that seems to be cpu bound.

That and bulk I/O. The sorter imports the Terms into RAM by iterating,
which means that the IO-access for this is sequential. Most modern SSDs
are faster than conventional harddisks for this, but not by much.

> It still takes a minute to cache the six sort fields of the 40 million 
> document index.

I am not aware of any solutions to this, besides beefing hardware bulk
reads and processor speed (the sorter is not threaded as far as I
remember). It it technically possible to move this step to the indexer,
but the only win would be for setups with few builders and many
searchers.

> Are there any differences among SSD disks. Why is Intel X25-M your favourite?

A soft reason is that I have faith in support from Intel: There has been
problems with earlier versions of the drive (nuking content in some
edge-cases and performance degradation (which hits all SSDs)) and Intel
has responded well by acknowledging the problems and resolving them.
That's very subjective though and I'm sure that some would turn that
around and say that Intel delivered crap in the first place.

On the harder side, the Intel drive is surprisingly cheap and provides
random IO performance ahead of most competitors. Especially for random
writes, which is normally the weak point for SSDs. Some graphs can be
found at Anandtech: 
http://anandtech.com/storage/showdoc.aspx?i=3631&p=22
Anandtech is BTW a very fine starting point on SSD's as they go into
details that too many reviewers skip over.

To be truthful here, standard index building and searching with Lucene
requires three things from the IO-system: Bulk writes, bulk reads
(mainly for sorting) and random reads. The Intel drive is not stellar
for bulk writes and being superior for random writes does not make a
difference for Lucene/SOLR. if we're only talking search: Pick whatever
SSD you can get your hands on: They are all fine for random reads and
the CPU will probably be the bottleneck.

However, random write speed is a bonus that might show indirectly:
Untarring a million small files, updating a database and similiar is
often part of the workflow with search.


Back in 2007 we were fortunate enough to get a test-machine with 2 types
of SSD, 2 10,000 RPM harddisks and 2 15,000 RPM harddisks. Some quick
notes can be found at http://wiki.statsbiblioteket.dk/summa/Hardware

The world has moved on since then, but that has only widened the gap
between SSDs and harddisks.

Regards,
Toke Eskildsen



Re: persistent cache

2010-02-16 Thread Jason Rutherglen
On a related note.  Maybe it'd be good to have wiki page of
experiences and possibly stats of various SSD drives?  Either on
Lucene or Solr wiki sites?

2010/2/16 Tim Terlegård :
> 2010/2/15 Toke Eskildsen :
>> From: Tim Terlegård [tim.terleg...@gmail.com]
>>> If the index size is more than you can have in RAM, do you recommend
>>> to split the index to several servers so it can all be in RAM?
>>>
>>> I do expect phrase queries. Total index size is 107 GB. *prx files are
>>> total 65GB and *frq files 38GB. It's probably worth buying more RAM.
>>
>> Have you considered throwing one or more SSD's at the problem? Intel
>> X25-M G2 (or X25-E if you're dictated by your organization to buy
>> enterprise level) is my personal favorite right now. They are, compared
>> to RAM or even high-end spinning harddrives, often quite cost-effective.
>
> I actually tried SSD yesterday. Queries which need to go to disk are
> much faster now. I did expect that warmup for sort fields would be
> much quicker as well, but that seems to be cpu bound. It still takes a
> minute to cache the six sort fields of the 40 million document index.
> But I'm happy the queries are faster with SSD.
>
> Are there any differences among SSD disks. Why is Intel X25-M your favourite?
>
> /Tim
>


Re: persistent cache

2010-02-16 Thread Tim Terlegård
2010/2/15 Toke Eskildsen :
> From: Tim Terlegård [tim.terleg...@gmail.com]
>> If the index size is more than you can have in RAM, do you recommend
>> to split the index to several servers so it can all be in RAM?
>>
>> I do expect phrase queries. Total index size is 107 GB. *prx files are
>> total 65GB and *frq files 38GB. It's probably worth buying more RAM.
>
> Have you considered throwing one or more SSD's at the problem? Intel
> X25-M G2 (or X25-E if you're dictated by your organization to buy
> enterprise level) is my personal favorite right now. They are, compared
> to RAM or even high-end spinning harddrives, often quite cost-effective.

I actually tried SSD yesterday. Queries which need to go to disk are
much faster now. I did expect that warmup for sort fields would be
much quicker as well, but that seems to be cpu bound. It still takes a
minute to cache the six sort fields of the 40 million document index.
But I'm happy the queries are faster with SSD.

Are there any differences among SSD disks. Why is Intel X25-M your favourite?

/Tim


Re: persistent cache

2010-02-15 Thread Tom Burton-West

Hi Tim,

Due to our performance needs we optimize the index early in the morning and
then run the cache-warming queries once we mount the optimized index on our
servers.  If you are indexing and serving using the same Solr instance, you
shouldn't have to re-run the cache warming queries when you add documents. 
I believe that the disk writes caused by adding the documents to the index
should put that data in the OS cache.   Actually 1600 queries are not a lot
of queries.  If you are using actual user queries from your logs you may
need more.   We used some tools based on Luke to analyze our index and
determine which words would most benefit by being in the OS cache (assuming
users entered a phrase query containing those words.)  You can experiment to
see how many queries you need to fill memory by emptying the OS cache and
then send queries and use top to watch memory usage.

Your options  (assuming peformance with current hardware does not meet your
needs ) are using SSD's, increasing memory on the machine, or splitting the
index using Solr shards.  If you either increase memory on the machine or
split the index, you will still have to run cache warming queries.

One other thing you might consider is to use stop words or CommonGrams to
reduce disk I/O requirments for phrase queries containing common words.  
(Our experiments with CommonGrams and cache-warming are described in our
blog : http://www.hathitrust.org/blogs/large-scale-search
http://www.hathitrust.org/blogs/large-scale-search )

Tom




Hi Tom,

1600 warming queries, that's quite many. Do you run them every time a
document is added to the index? Do you have any tips on warming?

If the index size is more than you can have in RAM, do you recommend
to split the index to several servers so it can all be in RAM?

I do expect phrase queries. Total index size is 107 GB. *prx files are
total 65GB and *frq files 38GB. It's probably worth buying more RAM.

/Tim


-- 
View this message in context: 
http://old.nabble.com/persistent-cache-tp27562126p27598026.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: persistent cache

2010-02-15 Thread Toke Eskildsen
From: Tim Terlegård [tim.terleg...@gmail.com]
> If the index size is more than you can have in RAM, do you recommend
> to split the index to several servers so it can all be in RAM?
>
> I do expect phrase queries. Total index size is 107 GB. *prx files are
> total 65GB and *frq files 38GB. It's probably worth buying more RAM.

Have you considered throwing one or more SSD's at the problem? Intel X25-M G2 
(or X25-E if you're dictated by your organization to buy enterprise level) is 
my personal favorite right now. They are, compared to RAM or even high-end 
spinning harddrives, often quite cost-effective. Most SSD's has random access 
time for reads at about 0.1ms. For us that meant that we moved the bottleneck 
for a 70GB index (10 million documents) from IO to CPU on a quad-core machine. 
We tried testing SSD vs. RAMDirectory and found it to perform at about 75% 
speed for a 14GB subset of the index.

- Toke Eskildsen - http://statsbiblioteket.dk

Re: persistent cache

2010-02-15 Thread Tim Terlegård
Hi Tom,

1600 warming queries, that's quite many. Do you run them every time a
document is added to the index? Do you have any tips on warming?

If the index size is more than you can have in RAM, do you recommend
to split the index to several servers so it can all be in RAM?

I do expect phrase queries. Total index size is 107 GB. *prx files are
total 65GB and *frq files 38GB. It's probably worth buying more RAM.

/Tim

2010/2/12 Tom Burton-West :
>
> Hi Tim,
>
> We generally run about 1600 cache-warming queries to warm up the OS disk
> cache and the Solr caches when we mount a new index.
>
> Do you have/expect phrase queries?   If you don't, then you don't need to
> get any position information into your OS disk cache.  Our position
> information takes about 85% of the total index size (*prx files).  So with a
> 100GB index, your *frq files might only be 15-20GB and you could probably
> get more than half of that in 16GB of memory.
>
> If you have limited memory and a large index, then you need to choose cache
> warming queries carefully as once the cache is full, further queries will
> start evicting older data from the cache.  The tradeoff is to populate the
> cache with data that would require the most disk access if the data was not
> in the cache versus populating the cache based on your best guess of what
> queries your users will execute.  A good overview of the issues is the paper
> by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of
> Caching on Search Engines )
>
>
> Tom Burton-West
> Digital Library Production Service
> University of Michigan Library
> --
> View this message in context: 
> http://old.nabble.com/persistent-cache-tp27562126p27567840.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: persistent cache

2010-02-12 Thread Tom Burton-West

Hi Tim,

We generally run about 1600 cache-warming queries to warm up the OS disk
cache and the Solr caches when we mount a new index.

Do you have/expect phrase queries?   If you don't, then you don't need to
get any position information into your OS disk cache.  Our position
information takes about 85% of the total index size (*prx files).  So with a
100GB index, your *frq files might only be 15-20GB and you could probably
get more than half of that in 16GB of memory.

If you have limited memory and a large index, then you need to choose cache
warming queries carefully as once the cache is full, further queries will
start evicting older data from the cache.  The tradeoff is to populate the
cache with data that would require the most disk access if the data was not
in the cache versus populating the cache based on your best guess of what
queries your users will execute.  A good overview of the issues is the paper
by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of
Caching on Search Engines )


Tom Burton-West
Digital Library Production Service
University of Michigan Library
-- 
View this message in context: 
http://old.nabble.com/persistent-cache-tp27562126p27567840.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: persistent cache

2010-02-12 Thread Tommy Chheng
 One solution is to add the persistent cache with memcache at the 
application layer.


--
Tommy Chheng

Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



On 2/12/10 5:19 AM, Tim Terlegård wrote:

2010/2/12 Shalin Shekhar Mangar:

2010/2/12 Tim Terlegård


Does Solr use some sort of a persistent cache?


Solr does not have a persistent cache. That is the operating system's file
cache at work.

Aha, that's very interesting and seems to make sense.

So is the primary goal of warmup queries to allow the operating system
to cache all the files in the data/index directory? Because I think
the difference (768ms vs 52ms) is pretty big. I just do one warmup
query and get 52 ms response on a 40 million documents index. I think
that's pretty nice performance without tinkering with the caches at
all. The only tinkering that seems to be needed is this operating
system file caching. What's the best way to make sure that my warmup
queries have cached all the files? And does a file cache have the
complete file in memory? I guess it can get tough to get my 100GB
index into the 16GB memory.

/Tim



--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



Re: persistent cache

2010-02-12 Thread Tim Terlegård
2010/2/12 Shalin Shekhar Mangar :
> 2010/2/12 Tim Terlegård 
>
>> Does Solr use some sort of a persistent cache?
>>
> Solr does not have a persistent cache. That is the operating system's file
> cache at work.

Aha, that's very interesting and seems to make sense.

So is the primary goal of warmup queries to allow the operating system
to cache all the files in the data/index directory? Because I think
the difference (768ms vs 52ms) is pretty big. I just do one warmup
query and get 52 ms response on a 40 million documents index. I think
that's pretty nice performance without tinkering with the caches at
all. The only tinkering that seems to be needed is this operating
system file caching. What's the best way to make sure that my warmup
queries have cached all the files? And does a file cache have the
complete file in memory? I guess it can get tough to get my 100GB
index into the 16GB memory.

/Tim


Re: persistent cache

2010-02-12 Thread Shalin Shekhar Mangar
2010/2/12 Tim Terlegård 

> Does Solr use some sort of a persistent cache?
>
> I do this 10 times in a loop:
>  * start solr
>  * create a core
>  * execute warmup query
>  * execute query with sort fields
>  * stop solr
>
> Executing the query with sort fields takes 5-20 times longer the first
> iteration than the other 9 iterations. For instance I have a query
> 'hockey' with one date sort field. That takes 768 ms in the first
> iteration of the loop. The next 9 iterations the query takes 52 ms.
> The solr and jetty server really stops in each iteration so the RAM
> must be emptied. So the only way I can think of why this happens is
> because there is some persistent cache that survives the solr
> restarts. Is this the case? Or why could this be?
>
>
Solr does not have a persistent cache. That is the operating system's file
cache at work.

-- 
Regards,
Shalin Shekhar Mangar.


persistent cache

2010-02-12 Thread Tim Terlegård
Does Solr use some sort of a persistent cache?

I do this 10 times in a loop:
  * start solr
  * create a core
  * execute warmup query
  * execute query with sort fields
  * stop solr

Executing the query with sort fields takes 5-20 times longer the first
iteration than the other 9 iterations. For instance I have a query
'hockey' with one date sort field. That takes 768 ms in the first
iteration of the loop. The next 9 iterations the query takes 52 ms.
The solr and jetty server really stops in each iteration so the RAM
must be emptied. So the only way I can think of why this happens is
because there is some persistent cache that survives the solr
restarts. Is this the case? Or why could this be?

/Tim