Re: JVM safepoints, mmap, and slow disks

Josh Snyder Mon, 10 Oct 2016 11:20:47 -0700

Do you know if there are any publicly available benchmarks on disk_access_mode,
preferably after the fix from CASSANDRA-10249?


If it turns out that syscall I/O is not significantly slower, I'd consider
switching. If I don't know the costs, I think I'd prefer to stick with the
devil I know how to mitigate (i.e. by policing by my block devices) rather than
switching to the devil that is non-standard and undocumented. :)

I may have time to do some benchmarking myself. If so, I'll be sure to inform
the list.

Josh

On Sun, Oct 9, 2016 at 2:39 AM, Benedict Elliott Smith
<bened...@apache.org> wrote:
> The biggest problem with pread was the issue of over reading (reading 64k
> where 4k would suffice), which was significantly improved in 2.2 iirc. I
> don't think the penalty is very significant anymore, and if you are
> experiencing time to safe point issues it's very likely a worthwhile switch
> to flip.
>
>
> On Sunday, 9 October 2016, Graham Sanderson <gra...@vast.com> wrote:
>>
>> I was using the term “touch” loosely to hopefully mean pre-fetch, though I
>> suspect (I think intel has been de-emphasizing) you can still do a sensible
>> prefetch instruction in native code. Even if not you are still better
>> blocking in JNI code - I haven’t looked at the link to see if the correct
>> barriers are enforced by the sun-misc-unsafe method.
>>
>> I do suspect that you’ll see up to about 5-10% sys call overhead if you
>> hit pread.
>>
>> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>> >
>> > Hi,
>> >
>> > This is starting to get into dev list territory.
>> >
>> > Interesting idea to touch every 4K page you are going to read.
>> >
>> > You could use this to minimize the cost.
>> >
>> > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>> >
>> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
>> > with out prefetching though.
>> >
>> > There is a system call to page the memory in which might be better for
>> > larger reads. Still no guarantee things stay cached though.
>> >
>> > Ariel
>> >
>> >
>> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> >> I haven’t studied the read path that carefully, but there might be a
>> >> spot at the C* level rather than JVM level where you could effectively do 
>> >> a
>> >> JNI touch of the mmap region you’re going to need next.
>> >>
>> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote:
>> >>>
>> >>> We don’t use Azul’s Zing, but it does have the nice feature that all
>> >>> threads don’t have to reach safepoints at the same time. That said we 
>> >>> make
>> >>> heavy use of Cassandra (with off heap memtables - not directly related 
>> >>> but
>> >>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>> >>> because
>> >>> it FAR out performed pread variants - in no cases have we noticed long 
>> >>> time
>> >>> to safe point (then again our IO is lightning fast).
>> >>>
>> >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com>
>> >>>> wrote:
>> >>>>
>> >>>> Linux automatically uses free memory as cache.  It's not swap.
>> >>>>
>> >>>> http://www.tldp.org/LDP/lki/lki-4.html
>> >>>>
>> >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin
>> >>>> <vla...@winguzone.com> wrote:
>> >>>>> __
>> >>>>> Sorry, I don't catch something. What page (memory) cache can exist
>> >>>>> if there is no swap file.
>> >>>>> Where are those page written/read?
>> >>>>>
>> >>>>>
>> >>>>> Best regards, Vladimir Yudovin,
>> >>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>> >>>>> Cassandra on Azure and SoftLayer.
>> >>>>> Launch your cluster in minutes.
>> > *
>> >>>>>
>> >>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel
>> >>>>> Weisberg<ar...@weisberg.ws>* wrote ----
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains
>> >>>>>> using free memory a file cache. It uses free (and some of the time 
>> >>>>>> not so
>> >>>>>> free!) memory to buffer writes and to cache recently written/read 
>> >>>>>> data.
>> >>>>>>
>> >>>>>> http://www.tldp.org/LDP/lki/lki-4.html
>> >>>>>>
>> >>>>>> When Linux decides it needs free memory it can either evict stuff
>> >>>>>> from the page cache, flush dirty pages and then evict, or swap 
>> >>>>>> anonymous
>> >>>>>> memory out. When you disable swap you only disable the last behavior.
>> >>>>>>
>> >>>>>> Maybe we are talking at cross purposes? What I meant is that
>> >>>>>> increasing the heap size to reduce GC frequency is a legitimate thing 
>> >>>>>> to do
>> >>>>>> and it does have an impact on the performance of the page cache even 
>> >>>>>> if you
>> >>>>>> have swap disabled?
>> >>>>>>
>> >>>>>> Ariel
>> >>>>>>
>> >>>>>>
>> >>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>> >>>>>>>> Page cache is data pending flush to disk and data cached from
>> >>>>>>>> disk.
>> >>>>>>>
>> >>>>>>> Do you mean file cache?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Best regards, Vladimir Yudovin,
>> >>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>> >>>>>>> Cassandra on Azure and SoftLayer.
>> >>>>>>> Launch your cluster in minutes.*
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg
>> >>>>>>> <ar...@weisberg.ws>* wrote ----
>> >>>>>>>> Hi,
>> >>>>>>>>
>> >>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous
>> >>>>>>>> memory, and whatever else the Linux kernel supports paging out. 
>> >>>>>>>> Page cache
>> >>>>>>>> is data pending flush to disk and data cached from disk.
>> >>>>>>>>
>> >>>>>>>> Given how bad the GC pauses are in C* I think it's not the high
>> >>>>>>>> pole in the tent. Until key things are off heap and C* can run with 
>> >>>>>>>> CMS and
>> >>>>>>>> get 10 millisecond GCs all day long.
>> >>>>>>>>
>> >>>>>>>> You can go through tuning and hardware selection try to get more
>> >>>>>>>> consistent IO pauses and remove outliers as you mention and as a 
>> >>>>>>>> user I
>> >>>>>>>> think this is your best bet. Generally it's either bad device or 
>> >>>>>>>> filesystem
>> >>>>>>>> behavior if you get page faults taking more than 200 milliseconds 
>> >>>>>>>> O(G1 gc
>> >>>>>>>> collection).
>> >>>>>>>>
>> >>>>>>>> I think a JVM change to allow safe points around memory mapped
>> >>>>>>>> file access is really unlikely although I agree it would be great. 
>> >>>>>>>> I think
>> >>>>>>>> the best hack around it is to code up your memory mapped file 
>> >>>>>>>> access into
>> >>>>>>>> JNI methods and find some way to get that to work. Right now if you 
>> >>>>>>>> want to
>> >>>>>>>> create a safe point a JNI method is the way to do it. The problem 
>> >>>>>>>> is that
>> >>>>>>>> JNI methods and POJOs don't get along well.
>> >>>>>>>>
>> >>>>>>>> If you think about it the reason non-memory mapped IO works well
>> >>>>>>>> is that it's all JNI methods so they don't impact time to safe 
>> >>>>>>>> point. I
>> >>>>>>>> think there is a tradeoff between tolerance for outliers and 
>> >>>>>>>> performance.
>> >>>>>>>>
>> >>>>>>>> I don't know the state of the non-memory mapped path and how
>> >>>>>>>> reliable that is. If it were reliable and I couldn't tolerate the 
>> >>>>>>>> outliers I
>> >>>>>>>> would use that. I have to ask though, why are you not able to 
>> >>>>>>>> tolerate the
>> >>>>>>>> outliers? If you are reading and writing at quorum how is this 
>> >>>>>>>> impacting
>> >>>>>>>> you?
>> >>>>>>>>
>> >>>>>>>> Regards,
>> >>>>>>>> Ariel
>> >>>>>>>>
>> >>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>> >>>>>>>>> Hi Josh,
>> >>>>>>>>>
>> >>>>>>>>>> Running with increased heap size would reduce GC frequency, at
>> >>>>>>>>>> the cost of page cache.
>> >>>>>>>>>
>> >>>>>>>>> Actually  it's recommended to run C* without virtual memory
>> >>>>>>>>> enabled. So if there  is no enough memory JVM fails instead of 
>> >>>>>>>>> blocking
>> >>>>>>>>>
>> >>>>>>>>> Best regards, Vladimir Yudovin,
>> >>>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>> >>>>>>>>> Cassandra on Azure and SoftLayer.
>> >>>>>>>>> Launch your cluster in minutes.*
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh
>> >>>>>>>>> Snyder<j...@code406.com>* wrote ----
>> >>>>>>>>>> Hello cassandra-users,
>> >>>>>>>>>>
>> >>>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a
>> >>>>>>>>>> safepoint.  I'd
>> >>>>>>>>>> like the list's input on confirming my hypothesis and finding
>> >>>>>>>>>> mitigations.
>> >>>>>>>>>>
>> >>>>>>>>>> My hypothesis is that slow block devices are causing
>> >>>>>>>>>> Cassandra's JVM to pause
>> >>>>>>>>>> completely while attempting to reach a safepoint.
>> >>>>>>>>>>
>> >>>>>>>>>> Background:
>> >>>>>>>>>>
>> >>>>>>>>>> Hotspot occasionally performs maintenance tasks that
>> >>>>>>>>>> necessitate stopping all
>> >>>>>>>>>> of its threads. Threads running JITed code occasionally read
>> >>>>>>>>>> from a given
>> >>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading
>> >>>>>>>>>> from that page
>> >>>>>>>>>> essentially catapults the thread into purgatory until the
>> >>>>>>>>>> safepoint completes
>> >>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing
>> >>>>>>>>>> syscalls or
>> >>>>>>>>>> executing native code do this check upon their return into the
>> >>>>>>>>>> JVM.
>> >>>>>>>>>>
>> >>>>>>>>>> In this way, during the safepoint Hotspot can be sure that all
>> >>>>>>>>>> of its threads
>> >>>>>>>>>> are either patiently waiting for safepoint completion or in a
>> >>>>>>>>>> system call.
>> >>>>>>>>>>
>> >>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation.
>> >>>>>>>>>> When doing
>> >>>>>>>>>> mmapped reads, the JVM executes userspace code to effect a read
>> >>>>>>>>>> from a file. On
>> >>>>>>>>>> the fast path (when the page needed is already mapped into the
>> >>>>>>>>>> process), this
>> >>>>>>>>>> instruction is very fast. When the page is not cached, the CPU
>> >>>>>>>>>> triggers a page
>> >>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't
>> >>>>>>>>>> even realize that
>> >>>>>>>>>> anything interesting is happening: to it, the thread is just
>> >>>>>>>>>> executing a mov
>> >>>>>>>>>> instruction that happens to take a while.
>> >>>>>>>>>>
>> >>>>>>>>>> The OS, meanwhile, puts the thread in question in the D state
>> >>>>>>>>>> (assuming Linux,
>> >>>>>>>>>> here) and goes off to find the desired page. This may take
>> >>>>>>>>>> microseconds, this
>> >>>>>>>>>> may take milliseconds, or it may take seconds (or longer). When
>> >>>>>>>>>> I/O occurs
>> >>>>>>>>>> while the JVM is trying to enter a safepoint, every thread has
>> >>>>>>>>>> to wait for the
>> >>>>>>>>>> laggard I/O to complete.
>> >>>>>>>>>>
>> >>>>>>>>>> If you log safepoints with the right options [1], you can see
>> >>>>>>>>>> these occurrences
>> >>>>>>>>>> in the JVM output:
>> >>>>>>>>>>
>> >>>>>>>>>>> # SafepointSynchronize::begin: Timeout detected:
>> >>>>>>>>>>> # SafepointSynchronize::begin: Timed out while spinning to
>> >>>>>>>>>>> reach a safepoint.
>> >>>>>>>>>>> # SafepointSynchronize::begin: Threads which did not reach the
>> >>>>>>>>>>> safepoint:
>> >>>>>>>>>>> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0
>> >>>>>>>>>>> tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000]
>> >>>>>>>>>>>   java.lang.Thread.State: RUNNABLE
>> >>>>>>>>>>>
>> >>>>>>>>>>> # SafepointSynchronize::begin: (End of list)
>> >>>>>>>>>>>         vmop                    [threads: total
>> >>>>>>>>>>> initially_running wait_to_block]    [time: spin block sync 
>> >>>>>>>>>>> cleanup vmop]
>> >>>>>>>>>>> page_trap_count
>> >>>>>>>>>>> 58099.941: G1IncCollectionPause             [     447
>> >>>>>>>>>>> 1              1    ]      [  3304     0  3305     1   190    ]  
>> >>>>>>>>>>> 1
>> >>>>>>>>>>
>> >>>>>>>>>> If that safepoint happens to be a garbage collection (which
>> >>>>>>>>>> this one was), you
>> >>>>>>>>>> can also see it in GC logs:
>> >>>>>>>>>>
>> >>>>>>>>>>> 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which
>> >>>>>>>>>>> application threads were stopped: 3.4971808 seconds, Stopping 
>> >>>>>>>>>>> threads took:
>> >>>>>>>>>>> 3.3050644 seconds
>> >>>>>>>>>>
>> >>>>>>>>>> In this way, JVM safepoints become a powerful weapon for
>> >>>>>>>>>> transmuting a single
>> >>>>>>>>>> thread's slow I/O into the entire JVM's lockup.
>> >>>>>>>>>>
>> >>>>>>>>>> Does all of the above sound correct?
>> >>>>>>>>>>
>> >>>>>>>>>> Mitigations:
>> >>>>>>>>>>
>> >>>>>>>>>> 1) don't tolerate block devices that are slow
>> >>>>>>>>>>
>> >>>>>>>>>> This is easy in theory, and only somewhat difficult in
>> >>>>>>>>>> practice. Tools like
>> >>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you
>> >>>>>>>>>> know when a block
>> >>>>>>>>>> device is slow.
>> >>>>>>>>>>
>> >>>>>>>>>> It is sad, though, because this makes running Cassandra on
>> >>>>>>>>>> mixed hardware (e.g.
>> >>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing.
>> >>>>>>>>>>
>> >>>>>>>>>> 2) have fewer safepoints
>> >>>>>>>>>>
>> >>>>>>>>>> Two of the biggest sources of safepoints are garbage collection
>> >>>>>>>>>> and revocation
>> >>>>>>>>>> of biased locks. Evidence points toward biased locking being
>> >>>>>>>>>> unhelpful for
>> >>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking)
>> >>>>>>>>>> is a quick way
>> >>>>>>>>>> to eliminate one source of safepoints.
>> >>>>>>>>>>
>> >>>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running
>> >>>>>>>>>> with increased
>> >>>>>>>>>> heap size would reduce GC frequency, at the cost of page cache.
>> >>>>>>>>>> But sacrificing
>> >>>>>>>>>> page cache would increase page fault frequency, which is
>> >>>>>>>>>> another thing we're
>> >>>>>>>>>> trying to avoid! I don't view this as a serious option.
>> >>>>>>>>>>
>> >>>>>>>>>> 3) use a different IO strategy
>> >>>>>>>>>>
>> >>>>>>>>>> Looking at the Cassandra source code, there appears to be an
>> >>>>>>>>>> un(der)documented
>> >>>>>>>>>> configuration parameter called disk_access_mode. It appears
>> >>>>>>>>>> that changing this
>> >>>>>>>>>> to 'standard' would switch to using pread() and pwrite() for
>> >>>>>>>>>> I/O, instead of
>> >>>>>>>>>> mmap. I imagine there would be a throughput penalty here for
>> >>>>>>>>>> the case when
>> >>>>>>>>>> pages are in the disk cache.
>> >>>>>>>>>>
>> >>>>>>>>>> Is this a serious option? It seems far too underdocumented to
>> >>>>>>>>>> be thought of as
>> >>>>>>>>>> a contender.
>> >>>>>>>>>>
>> >>>>>>>>>> 4) modify the JVM
>> >>>>>>>>>>
>> >>>>>>>>>> This is a longer term option. For the purposes of safepoints,
>> >>>>>>>>>> perhaps the JVM
>> >>>>>>>>>> could treat reads from an mmapped file in the same way it
>> >>>>>>>>>> treats threads that
>> >>>>>>>>>> are running JNI code. That is, the safepoint will proceed even
>> >>>>>>>>>> though the
>> >>>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped
>> >>>>>>>>>> read, the
>> >>>>>>>>>> reading thread would test the safepoint page (check whether a
>> >>>>>>>>>> safepoint is in
>> >>>>>>>>>> progress, in other words).
>> >>>>>>>>>>
>> >>>>>>>>>> Conclusion:
>> >>>>>>>>>>
>> >>>>>>>>>> I don't imagine there's an easy solution here. I plan to go
>> >>>>>>>>>> ahead with
>> >>>>>>>>>> mitigation #1: "don't tolerate block devices that are slow",
>> >>>>>>>>>> but I'd appreciate
>> >>>>>>>>>> any approach that doesn't require my hardware to be flawless
>> >>>>>>>>>> all the time.
>> >>>>>>>>>>
>> >>>>>>>>>> Josh
>> >>>>>>>>>>
>> >>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100
>> >>>>>>>>>> -XX:+PrintSafepointStatistics
>> >>>>>>>>>> -XX:PrintSafepointStatisticsCount=1
>> >>>>>>>>>> [2]
>> >>>>>>>>>> https://github.com/brendangregg/perf-tools/blob/master/iosnoop
>> >>>>>>>>
>> >>>>>>
>> >> Email had 1 attachment:
>> >
>> >
>> >> * smime.p7s
>> >>   2k (application/pkcs7-signature)
>>
>

Re: JVM safepoints, mmap, and slow disks

Reply via email to