Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)

Shawn Heisey Wed, 24 Oct 2012 18:50:55 -0700

On 10/24/2012 6:29 PM, Aaron Daubman wrote:

Let me be clear that that I am not interested in RAMDirectory.
However, I would like to better understand the oft-recommended and
currently-default MMapDirectory, and what the tradeoffs would be, when
using a 64-bit linux server dedicated to this single solr instance,
with plenty (more than 2x index size) of RAM, of storing the index
files on SSDs versus on a ramfs mount.


I understand that using the default MMapDirectory will allow caching
of the index in-memory, however, my understanding is that mmaped files
are demand-paged (lazy evaluated), meaning that only after a block is
read from disk will it be paged into memory - is this correct? is it
actually block-by-block (page size by page size?) - any pointers to
decent documentation on this regardless of the effectiveness of the
approach would be appreciated...

You are correct that the data must have just been accessed to be in thedisk cache.This does however include writes -- so any data that getsindexed will be in the cache because it has just been written. I dobelieve that it is read in one page block at a time, and I believe thatthe blocks are 4k in size.

My concern with using MMapDirectory for an index stored on disk (even
SSDs), if my understanding is correct, is that there is still a large
startup cost to MMapDirectory, as it may take many queries before even
most of a 20G index has been loaded into memory, and there may yet
still be "dark corners" that only come up in edge-case queries that
cause QTime spikes should these queries ever occur.

I would like to ensure that, at startup, no query will incur
disk-seek/read penalties.

Is the "right" way to achieve this to copy the index to a ramfs (NOT
ramdisk) mount and then continue to use MMapDirectory in Solr to read
the index? I am under the impression that when using ramfs (rather
than ramdisk, for which this would not work) a file mmaped on a ramfs
mount will actually share the same address space, and so would not
incur the typical double-ram overhead of mmaping a file in memory just
o have yet another copy of the file created in a second memory
location. Is this correct? If not, would you please point me to
documentation stating otherwise (I haven't found much documentation
either way).

I am not familiar with any "double-ram overhead" from using mmap. Itshould be extroardinarily efficient, so much so that even when yourindex won't fit in RAM, performance is typically still excellent. Usingan SSD instead of a spinning disk will increase performance across theboard, until enough of the index is cached in RAM, after which it won'tmake a lot of difference.

My parting thoughts, with a general note to the masses: Do not try thisif you are not absolutely sure your index will fit in memory! It willtend to cause WAY more problems than it will solve for most people withlarge indexes.

If you actually do have considerably more RAM than your index size, andyou know that the index will never grow to where it might not fit, youcan use a simple trick to get it all cached, even before runningqueries. Just read the entire contents of the index, discardingeverything you read. There are two main OS variants to consider here,and both can be scripted, as noted below. Run the command twice to seethe difference that caching makes for the second run. Note that an SSDwould speed the first run of these commands up considerably:


*NIX (may work on a mac too):
cat /path/to/index/files/* > /dev/null

Windows:
type C:\Path\To\Index\Files\* > NUL

Thanks,
Shawn

Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)

Reply via email to