Re: JVM safepoints, mmap, and slow disks

Graham Sanderson Sat, 08 Oct 2016 21:10:26 -0700

I was using the term “touch” loosely to hopefully mean pre-fetch, though I 
suspect (I think intel has been de-emphasizing) you can still do a sensible 
prefetch instruction in native code. Even if not you are still better blocking 
in JNI code - I haven’t looked at the link to see if the correct barriers are 
enforced by the sun-misc-unsafe method.


I do suspect that you’ll see up to about 5-10% sys call overhead if you hit 
pread.

> On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
> 
> Hi,
> 
> This is starting to get into dev list territory.
> 
> Interesting idea to touch every 4K page you are going to read.
> 
> You could use this to minimize the cost.
> http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
> 
> Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> with out prefetching though.
> 
> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.
> 
> Ariel
> 
> 
> On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> I haven’t studied the read path that carefully, but there might be a spot at 
>> the C* level rather than JVM level where you could effectively do a JNI 
>> touch of the mmap region you’re going to need next.
>> 
>>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote:
>>> 
>>> We don’t use Azul’s Zing, but it does have the nice feature that all 
>>> threads don’t have to reach safepoints at the same time. That said we make 
>>> heavy use of Cassandra (with off heap memtables - not directly related but 
>>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>>> because it FAR out performed pread variants - in no cases have we noticed 
>>> long time to safe point (then again our IO is lightning fast).
>>> 
>>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>>> 
>>>> Linux automatically uses free memory as cache.  It's not swap.
>>>> 
>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>> 
>>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com> 
>>>> wrote:
>>>>> __
>>>>> Sorry, I don't catch something. What page (memory) cache can exist if 
>>>>> there is no swap file.
>>>>> Where are those page written/read?
>>>>> 
>>>>> 
>>>>> Best regards, Vladimir Yudovin, 
>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
>>>>> Azure and SoftLayer.
>>>>> Launch your cluster in minutes.
> *
>>>>> 
>>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
>>>>> Weisberg<ar...@weisberg.ws>* wrote ---- 
>>>>>> Hi,
>>>>>> 
>>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using 
>>>>>> free memory a file cache. It uses free (and some of the time not so 
>>>>>> free!) memory to buffer writes and to cache recently written/read data.
>>>>>> 
>>>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>>>> 
>>>>>> When Linux decides it needs free memory it can either evict stuff from 
>>>>>> the page cache, flush dirty pages and then evict, or swap anonymous 
>>>>>> memory out. When you disable swap you only disable the last behavior.
>>>>>> 
>>>>>> Maybe we are talking at cross purposes? What I meant is that increasing 
>>>>>> the heap size to reduce GC frequency is a legitimate thing to do and it 
>>>>>> does have an impact on the performance of the page cache even if you 
>>>>>> have swap disabled?
>>>>>> 
>>>>>> Ariel
>>>>>> 
>>>>>> 
>>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>>>>>>> Page cache is data pending flush to disk and data cached from disk.
>>>>>>> 
>>>>>>> Do you mean file cache?
>>>>>>> 
>>>>>>> 
>>>>>>> Best regards, Vladimir Yudovin, 
>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra 
>>>>>>> on Azure and SoftLayer.
>>>>>>> Launch your cluster in minutes.*
>>>>>>> 
>>>>>>> 
>>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg 
>>>>>>> <ar...@weisberg.ws>* wrote ---- 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous 
>>>>>>>> memory, and whatever else the Linux kernel supports paging out. Page 
>>>>>>>> cache is data pending flush to disk and data cached from disk.
>>>>>>>> 
>>>>>>>> Given how bad the GC pauses are in C* I think it's not the high pole 
>>>>>>>> in the tent. Until key things are off heap and C* can run with CMS and 
>>>>>>>> get 10 millisecond GCs all day long.
>>>>>>>> 
>>>>>>>> You can go through tuning and hardware selection try to get more 
>>>>>>>> consistent IO pauses and remove outliers as you mention and as a user 
>>>>>>>> I think this is your best bet. Generally it's either bad device or 
>>>>>>>> filesystem behavior if you get page faults taking more than 200 
>>>>>>>> milliseconds O(G1 gc collection).
>>>>>>>> 
>>>>>>>> I think a JVM change to allow safe points around memory mapped file 
>>>>>>>> access is really unlikely although I agree it would be great. I think 
>>>>>>>> the best hack around it is to code up your memory mapped file access 
>>>>>>>> into JNI methods and find some way to get that to work. Right now if 
>>>>>>>> you want to create a safe point a JNI method is the way to do it. The 
>>>>>>>> problem is that JNI methods and POJOs don't get along well.
>>>>>>>> 
>>>>>>>> If you think about it the reason non-memory mapped IO works well is 
>>>>>>>> that it's all JNI methods so they don't impact time to safe point. I 
>>>>>>>> think there is a tradeoff between tolerance for outliers and 
>>>>>>>> performance.
>>>>>>>> 
>>>>>>>> I don't know the state of the non-memory mapped path and how reliable 
>>>>>>>> that is. If it were reliable and I couldn't tolerate the outliers I 
>>>>>>>> would use that. I have to ask though, why are you not able to tolerate 
>>>>>>>> the outliers? If you are reading and writing at quorum how is this 
>>>>>>>> impacting you?
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Ariel
>>>>>>>> 
>>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>>>>>>>>> Hi Josh,
>>>>>>>>> 
>>>>>>>>>> Running with increased heap size would reduce GC frequency, at the 
>>>>>>>>>> cost of page cache.
>>>>>>>>> 
>>>>>>>>> Actually  it's recommended to run C* without virtual memory enabled. 
>>>>>>>>> So if there  is no enough memory JVM fails instead of blocking
>>>>>>>>> 
>>>>>>>>> Best regards, Vladimir Yudovin, 
>>>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra 
>>>>>>>>> on Azure and SoftLayer.
>>>>>>>>> Launch your cluster in minutes.*
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh 
>>>>>>>>> Snyder<j...@code406.com>* wrote ---- 
>>>>>>>>>> Hello cassandra-users, 
>>>>>>>>>> 
>>>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a 
>>>>>>>>>> safepoint.  I'd 
>>>>>>>>>> like the list's input on confirming my hypothesis and finding 
>>>>>>>>>> mitigations. 
>>>>>>>>>> 
>>>>>>>>>> My hypothesis is that slow block devices are causing Cassandra's JVM 
>>>>>>>>>> to pause 
>>>>>>>>>> completely while attempting to reach a safepoint. 
>>>>>>>>>> 
>>>>>>>>>> Background: 
>>>>>>>>>> 
>>>>>>>>>> Hotspot occasionally performs maintenance tasks that necessitate 
>>>>>>>>>> stopping all 
>>>>>>>>>> of its threads. Threads running JITed code occasionally read from a 
>>>>>>>>>> given 
>>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading from 
>>>>>>>>>> that page 
>>>>>>>>>> essentially catapults the thread into purgatory until the safepoint 
>>>>>>>>>> completes 
>>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing 
>>>>>>>>>> syscalls or 
>>>>>>>>>> executing native code do this check upon their return into the JVM. 
>>>>>>>>>> 
>>>>>>>>>> In this way, during the safepoint Hotspot can be sure that all of 
>>>>>>>>>> its threads 
>>>>>>>>>> are either patiently waiting for safepoint completion or in a system 
>>>>>>>>>> call. 
>>>>>>>>>> 
>>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation. When 
>>>>>>>>>> doing 
>>>>>>>>>> mmapped reads, the JVM executes userspace code to effect a read from 
>>>>>>>>>> a file. On 
>>>>>>>>>> the fast path (when the page needed is already mapped into the 
>>>>>>>>>> process), this 
>>>>>>>>>> instruction is very fast. When the page is not cached, the CPU 
>>>>>>>>>> triggers a page 
>>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't even 
>>>>>>>>>> realize that 
>>>>>>>>>> anything interesting is happening: to it, the thread is just 
>>>>>>>>>> executing a mov 
>>>>>>>>>> instruction that happens to take a while. 
>>>>>>>>>> 
>>>>>>>>>> The OS, meanwhile, puts the thread in question in the D state 
>>>>>>>>>> (assuming Linux, 
>>>>>>>>>> here) and goes off to find the desired page. This may take 
>>>>>>>>>> microseconds, this 
>>>>>>>>>> may take milliseconds, or it may take seconds (or longer). When I/O 
>>>>>>>>>> occurs 
>>>>>>>>>> while the JVM is trying to enter a safepoint, every thread has to 
>>>>>>>>>> wait for the 
>>>>>>>>>> laggard I/O to complete. 
>>>>>>>>>> 
>>>>>>>>>> If you log safepoints with the right options [1], you can see these 
>>>>>>>>>> occurrences 
>>>>>>>>>> in the JVM output: 
>>>>>>>>>> 
>>>>>>>>>>> # SafepointSynchronize::begin: Timeout detected: 
>>>>>>>>>>> # SafepointSynchronize::begin: Timed out while spinning to reach a 
>>>>>>>>>>> safepoint. 
>>>>>>>>>>> # SafepointSynchronize::begin: Threads which did not reach the 
>>>>>>>>>>> safepoint: 
>>>>>>>>>>> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 
>>>>>>>>>>> tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] 
>>>>>>>>>>>   java.lang.Thread.State: RUNNABLE 
>>>>>>>>>>> 
>>>>>>>>>>> # SafepointSynchronize::begin: (End of list) 
>>>>>>>>>>>         vmop                    [threads: total initially_running 
>>>>>>>>>>> wait_to_block]    [time: spin block sync cleanup vmop] 
>>>>>>>>>>> page_trap_count 
>>>>>>>>>>> 58099.941: G1IncCollectionPause             [     447          1    
>>>>>>>>>>>           1    ]      [  3304     0  3305     1   190    ]  1 
>>>>>>>>>> 
>>>>>>>>>> If that safepoint happens to be a garbage collection (which this one 
>>>>>>>>>> was), you 
>>>>>>>>>> can also see it in GC logs: 
>>>>>>>>>> 
>>>>>>>>>>> 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which 
>>>>>>>>>>> application threads were stopped: 3.4971808 seconds, Stopping 
>>>>>>>>>>> threads took: 3.3050644 seconds 
>>>>>>>>>> 
>>>>>>>>>> In this way, JVM safepoints become a powerful weapon for transmuting 
>>>>>>>>>> a single 
>>>>>>>>>> thread's slow I/O into the entire JVM's lockup. 
>>>>>>>>>> 
>>>>>>>>>> Does all of the above sound correct? 
>>>>>>>>>> 
>>>>>>>>>> Mitigations: 
>>>>>>>>>> 
>>>>>>>>>> 1) don't tolerate block devices that are slow 
>>>>>>>>>> 
>>>>>>>>>> This is easy in theory, and only somewhat difficult in practice. 
>>>>>>>>>> Tools like 
>>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you know 
>>>>>>>>>> when a block 
>>>>>>>>>> device is slow. 
>>>>>>>>>> 
>>>>>>>>>> It is sad, though, because this makes running Cassandra on mixed 
>>>>>>>>>> hardware (e.g. 
>>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing. 
>>>>>>>>>> 
>>>>>>>>>> 2) have fewer safepoints 
>>>>>>>>>> 
>>>>>>>>>> Two of the biggest sources of safepoints are garbage collection and 
>>>>>>>>>> revocation 
>>>>>>>>>> of biased locks. Evidence points toward biased locking being 
>>>>>>>>>> unhelpful for 
>>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a 
>>>>>>>>>> quick way 
>>>>>>>>>> to eliminate one source of safepoints. 
>>>>>>>>>> 
>>>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running with 
>>>>>>>>>> increased 
>>>>>>>>>> heap size would reduce GC frequency, at the cost of page cache. But 
>>>>>>>>>> sacrificing 
>>>>>>>>>> page cache would increase page fault frequency, which is another 
>>>>>>>>>> thing we're 
>>>>>>>>>> trying to avoid! I don't view this as a serious option. 
>>>>>>>>>> 
>>>>>>>>>> 3) use a different IO strategy 
>>>>>>>>>> 
>>>>>>>>>> Looking at the Cassandra source code, there appears to be an 
>>>>>>>>>> un(der)documented 
>>>>>>>>>> configuration parameter called disk_access_mode. It appears that 
>>>>>>>>>> changing this 
>>>>>>>>>> to 'standard' would switch to using pread() and pwrite() for I/O, 
>>>>>>>>>> instead of 
>>>>>>>>>> mmap. I imagine there would be a throughput penalty here for the 
>>>>>>>>>> case when 
>>>>>>>>>> pages are in the disk cache. 
>>>>>>>>>> 
>>>>>>>>>> Is this a serious option? It seems far too underdocumented to be 
>>>>>>>>>> thought of as 
>>>>>>>>>> a contender. 
>>>>>>>>>> 
>>>>>>>>>> 4) modify the JVM 
>>>>>>>>>> 
>>>>>>>>>> This is a longer term option. For the purposes of safepoints, 
>>>>>>>>>> perhaps the JVM 
>>>>>>>>>> could treat reads from an mmapped file in the same way it treats 
>>>>>>>>>> threads that 
>>>>>>>>>> are running JNI code. That is, the safepoint will proceed even 
>>>>>>>>>> though the 
>>>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped read, 
>>>>>>>>>> the 
>>>>>>>>>> reading thread would test the safepoint page (check whether a 
>>>>>>>>>> safepoint is in 
>>>>>>>>>> progress, in other words). 
>>>>>>>>>> 
>>>>>>>>>> Conclusion: 
>>>>>>>>>> 
>>>>>>>>>> I don't imagine there's an easy solution here. I plan to go ahead 
>>>>>>>>>> with 
>>>>>>>>>> mitigation #1: "don't tolerate block devices that are slow", but I'd 
>>>>>>>>>> appreciate 
>>>>>>>>>> any approach that doesn't require my hardware to be flawless all the 
>>>>>>>>>> time. 
>>>>>>>>>> 
>>>>>>>>>> Josh 
>>>>>>>>>> 
>>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 
>>>>>>>>>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 
>>>>>>>>>> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop 
>>>>>>>> 
>>>>>> 
>> Email had 1 attachment:
> 
> 
>> * smime.p7s
>>   2k (application/pkcs7-signature)

smime.p7s
Description: S/MIME cryptographic signature

Re: JVM safepoints, mmap, and slow disks

Reply via email to