Do you know if there are any publicly available benchmarks on disk_access_mode, preferably after the fix from CASSANDRA-10249?
If it turns out that syscall I/O is not significantly slower, I'd consider switching. If I don't know the costs, I think I'd prefer to stick with the devil I know how to mitigate (i.e. by policing by my block devices) rather than switching to the devil that is non-standard and undocumented. :) I may have time to do some benchmarking myself. If so, I'll be sure to inform the list. Josh On Sun, Oct 9, 2016 at 2:39 AM, Benedict Elliott Smith <bened...@apache.org> wrote: > The biggest problem with pread was the issue of over reading (reading 64k > where 4k would suffice), which was significantly improved in 2.2 iirc. I > don't think the penalty is very significant anymore, and if you are > experiencing time to safe point issues it's very likely a worthwhile switch > to flip. > > > On Sunday, 9 October 2016, Graham Sanderson <gra...@vast.com> wrote: >> >> I was using the term “touch” loosely to hopefully mean pre-fetch, though I >> suspect (I think intel has been de-emphasizing) you can still do a sensible >> prefetch instruction in native code. Even if not you are still better >> blocking in JNI code - I haven’t looked at the link to see if the correct >> barriers are enforced by the sun-misc-unsafe method. >> >> I do suspect that you’ll see up to about 5-10% sys call overhead if you >> hit pread. >> >> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ar...@weisberg.ws> wrote: >> > >> > Hi, >> > >> > This is starting to get into dev list territory. >> > >> > Interesting idea to touch every 4K page you are going to read. >> > >> > You could use this to minimize the cost. >> > >> > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652 >> > >> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses >> > with out prefetching though. >> > >> > There is a system call to page the memory in which might be better for >> > larger reads. Still no guarantee things stay cached though. >> > >> > Ariel >> > >> > >> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote: >> >> I haven’t studied the read path that carefully, but there might be a >> >> spot at the C* level rather than JVM level where you could effectively do >> >> a >> >> JNI touch of the mmap region you’re going to need next. >> >> >> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote: >> >>> >> >>> We don’t use Azul’s Zing, but it does have the nice feature that all >> >>> threads don’t have to reach safepoints at the same time. That said we >> >>> make >> >>> heavy use of Cassandra (with off heap memtables - not directly related >> >>> but >> >>> allows us a lot more GC headroom) and SOLR where we switched to mmap >> >>> because >> >>> it FAR out performed pread variants - in no cases have we noticed long >> >>> time >> >>> to safe point (then again our IO is lightning fast). >> >>> >> >>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com> >> >>>> wrote: >> >>>> >> >>>> Linux automatically uses free memory as cache. It's not swap. >> >>>> >> >>>> http://www.tldp.org/LDP/lki/lki-4.html >> >>>> >> >>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin >> >>>> <vla...@winguzone.com> wrote: >> >>>>> __ >> >>>>> Sorry, I don't catch something. What page (memory) cache can exist >> >>>>> if there is no swap file. >> >>>>> Where are those page written/read? >> >>>>> >> >>>>> >> >>>>> Best regards, Vladimir Yudovin, >> >>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud >> >>>>> Cassandra on Azure and SoftLayer. >> >>>>> Launch your cluster in minutes. >> > * >> >>>>> >> >>>>> ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel >> >>>>> Weisberg<ar...@weisberg.ws>* wrote ---- >> >>>>>> Hi, >> >>>>>> >> >>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains >> >>>>>> using free memory a file cache. It uses free (and some of the time >> >>>>>> not so >> >>>>>> free!) memory to buffer writes and to cache recently written/read >> >>>>>> data. >> >>>>>> >> >>>>>> http://www.tldp.org/LDP/lki/lki-4.html >> >>>>>> >> >>>>>> When Linux decides it needs free memory it can either evict stuff >> >>>>>> from the page cache, flush dirty pages and then evict, or swap >> >>>>>> anonymous >> >>>>>> memory out. When you disable swap you only disable the last behavior. >> >>>>>> >> >>>>>> Maybe we are talking at cross purposes? What I meant is that >> >>>>>> increasing the heap size to reduce GC frequency is a legitimate thing >> >>>>>> to do >> >>>>>> and it does have an impact on the performance of the page cache even >> >>>>>> if you >> >>>>>> have swap disabled? >> >>>>>> >> >>>>>> Ariel >> >>>>>> >> >>>>>> >> >>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote: >> >>>>>>>> Page cache is data pending flush to disk and data cached from >> >>>>>>>> disk. >> >>>>>>> >> >>>>>>> Do you mean file cache? >> >>>>>>> >> >>>>>>> >> >>>>>>> Best regards, Vladimir Yudovin, >> >>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud >> >>>>>>> Cassandra on Azure and SoftLayer. >> >>>>>>> Launch your cluster in minutes.* >> >>>>>>> >> >>>>>>> >> >>>>>>> ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg >> >>>>>>> <ar...@weisberg.ws>* wrote ---- >> >>>>>>>> Hi, >> >>>>>>>> >> >>>>>>>> Page cache is in use even if you disable swap. Swap is anonymous >> >>>>>>>> memory, and whatever else the Linux kernel supports paging out. >> >>>>>>>> Page cache >> >>>>>>>> is data pending flush to disk and data cached from disk. >> >>>>>>>> >> >>>>>>>> Given how bad the GC pauses are in C* I think it's not the high >> >>>>>>>> pole in the tent. Until key things are off heap and C* can run with >> >>>>>>>> CMS and >> >>>>>>>> get 10 millisecond GCs all day long. >> >>>>>>>> >> >>>>>>>> You can go through tuning and hardware selection try to get more >> >>>>>>>> consistent IO pauses and remove outliers as you mention and as a >> >>>>>>>> user I >> >>>>>>>> think this is your best bet. Generally it's either bad device or >> >>>>>>>> filesystem >> >>>>>>>> behavior if you get page faults taking more than 200 milliseconds >> >>>>>>>> O(G1 gc >> >>>>>>>> collection). >> >>>>>>>> >> >>>>>>>> I think a JVM change to allow safe points around memory mapped >> >>>>>>>> file access is really unlikely although I agree it would be great. >> >>>>>>>> I think >> >>>>>>>> the best hack around it is to code up your memory mapped file >> >>>>>>>> access into >> >>>>>>>> JNI methods and find some way to get that to work. Right now if you >> >>>>>>>> want to >> >>>>>>>> create a safe point a JNI method is the way to do it. The problem >> >>>>>>>> is that >> >>>>>>>> JNI methods and POJOs don't get along well. >> >>>>>>>> >> >>>>>>>> If you think about it the reason non-memory mapped IO works well >> >>>>>>>> is that it's all JNI methods so they don't impact time to safe >> >>>>>>>> point. I >> >>>>>>>> think there is a tradeoff between tolerance for outliers and >> >>>>>>>> performance. >> >>>>>>>> >> >>>>>>>> I don't know the state of the non-memory mapped path and how >> >>>>>>>> reliable that is. If it were reliable and I couldn't tolerate the >> >>>>>>>> outliers I >> >>>>>>>> would use that. I have to ask though, why are you not able to >> >>>>>>>> tolerate the >> >>>>>>>> outliers? If you are reading and writing at quorum how is this >> >>>>>>>> impacting >> >>>>>>>> you? >> >>>>>>>> >> >>>>>>>> Regards, >> >>>>>>>> Ariel >> >>>>>>>> >> >>>>>>>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote: >> >>>>>>>>> Hi Josh, >> >>>>>>>>> >> >>>>>>>>>> Running with increased heap size would reduce GC frequency, at >> >>>>>>>>>> the cost of page cache. >> >>>>>>>>> >> >>>>>>>>> Actually it's recommended to run C* without virtual memory >> >>>>>>>>> enabled. So if there is no enough memory JVM fails instead of >> >>>>>>>>> blocking >> >>>>>>>>> >> >>>>>>>>> Best regards, Vladimir Yudovin, >> >>>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud >> >>>>>>>>> Cassandra on Azure and SoftLayer. >> >>>>>>>>> Launch your cluster in minutes.* >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh >> >>>>>>>>> Snyder<j...@code406.com>* wrote ---- >> >>>>>>>>>> Hello cassandra-users, >> >>>>>>>>>> >> >>>>>>>>>> I'm investigating an issue with JVMs taking a while to reach a >> >>>>>>>>>> safepoint. I'd >> >>>>>>>>>> like the list's input on confirming my hypothesis and finding >> >>>>>>>>>> mitigations. >> >>>>>>>>>> >> >>>>>>>>>> My hypothesis is that slow block devices are causing >> >>>>>>>>>> Cassandra's JVM to pause >> >>>>>>>>>> completely while attempting to reach a safepoint. >> >>>>>>>>>> >> >>>>>>>>>> Background: >> >>>>>>>>>> >> >>>>>>>>>> Hotspot occasionally performs maintenance tasks that >> >>>>>>>>>> necessitate stopping all >> >>>>>>>>>> of its threads. Threads running JITed code occasionally read >> >>>>>>>>>> from a given >> >>>>>>>>>> safepoint page. If Hotspot has initiated a safepoint, reading >> >>>>>>>>>> from that page >> >>>>>>>>>> essentially catapults the thread into purgatory until the >> >>>>>>>>>> safepoint completes >> >>>>>>>>>> (the mechanism behind this is pretty cool). Threads performing >> >>>>>>>>>> syscalls or >> >>>>>>>>>> executing native code do this check upon their return into the >> >>>>>>>>>> JVM. >> >>>>>>>>>> >> >>>>>>>>>> In this way, during the safepoint Hotspot can be sure that all >> >>>>>>>>>> of its threads >> >>>>>>>>>> are either patiently waiting for safepoint completion or in a >> >>>>>>>>>> system call. >> >>>>>>>>>> >> >>>>>>>>>> Cassandra makes heavy use of mmapped reads in normal operation. >> >>>>>>>>>> When doing >> >>>>>>>>>> mmapped reads, the JVM executes userspace code to effect a read >> >>>>>>>>>> from a file. On >> >>>>>>>>>> the fast path (when the page needed is already mapped into the >> >>>>>>>>>> process), this >> >>>>>>>>>> instruction is very fast. When the page is not cached, the CPU >> >>>>>>>>>> triggers a page >> >>>>>>>>>> fault and asks the OS to go fetch the page. The JVM doesn't >> >>>>>>>>>> even realize that >> >>>>>>>>>> anything interesting is happening: to it, the thread is just >> >>>>>>>>>> executing a mov >> >>>>>>>>>> instruction that happens to take a while. >> >>>>>>>>>> >> >>>>>>>>>> The OS, meanwhile, puts the thread in question in the D state >> >>>>>>>>>> (assuming Linux, >> >>>>>>>>>> here) and goes off to find the desired page. This may take >> >>>>>>>>>> microseconds, this >> >>>>>>>>>> may take milliseconds, or it may take seconds (or longer). When >> >>>>>>>>>> I/O occurs >> >>>>>>>>>> while the JVM is trying to enter a safepoint, every thread has >> >>>>>>>>>> to wait for the >> >>>>>>>>>> laggard I/O to complete. >> >>>>>>>>>> >> >>>>>>>>>> If you log safepoints with the right options [1], you can see >> >>>>>>>>>> these occurrences >> >>>>>>>>>> in the JVM output: >> >>>>>>>>>> >> >>>>>>>>>>> # SafepointSynchronize::begin: Timeout detected: >> >>>>>>>>>>> # SafepointSynchronize::begin: Timed out while spinning to >> >>>>>>>>>>> reach a safepoint. >> >>>>>>>>>>> # SafepointSynchronize::begin: Threads which did not reach the >> >>>>>>>>>>> safepoint: >> >>>>>>>>>>> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 >> >>>>>>>>>>> tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] >> >>>>>>>>>>> java.lang.Thread.State: RUNNABLE >> >>>>>>>>>>> >> >>>>>>>>>>> # SafepointSynchronize::begin: (End of list) >> >>>>>>>>>>> vmop [threads: total >> >>>>>>>>>>> initially_running wait_to_block] [time: spin block sync >> >>>>>>>>>>> cleanup vmop] >> >>>>>>>>>>> page_trap_count >> >>>>>>>>>>> 58099.941: G1IncCollectionPause [ 447 >> >>>>>>>>>>> 1 1 ] [ 3304 0 3305 1 190 ] >> >>>>>>>>>>> 1 >> >>>>>>>>>> >> >>>>>>>>>> If that safepoint happens to be a garbage collection (which >> >>>>>>>>>> this one was), you >> >>>>>>>>>> can also see it in GC logs: >> >>>>>>>>>> >> >>>>>>>>>>> 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which >> >>>>>>>>>>> application threads were stopped: 3.4971808 seconds, Stopping >> >>>>>>>>>>> threads took: >> >>>>>>>>>>> 3.3050644 seconds >> >>>>>>>>>> >> >>>>>>>>>> In this way, JVM safepoints become a powerful weapon for >> >>>>>>>>>> transmuting a single >> >>>>>>>>>> thread's slow I/O into the entire JVM's lockup. >> >>>>>>>>>> >> >>>>>>>>>> Does all of the above sound correct? >> >>>>>>>>>> >> >>>>>>>>>> Mitigations: >> >>>>>>>>>> >> >>>>>>>>>> 1) don't tolerate block devices that are slow >> >>>>>>>>>> >> >>>>>>>>>> This is easy in theory, and only somewhat difficult in >> >>>>>>>>>> practice. Tools like >> >>>>>>>>>> perf and iosnoop [2] can do pretty good jobs of letting you >> >>>>>>>>>> know when a block >> >>>>>>>>>> device is slow. >> >>>>>>>>>> >> >>>>>>>>>> It is sad, though, because this makes running Cassandra on >> >>>>>>>>>> mixed hardware (e.g. >> >>>>>>>>>> fast SSD and slow disks in a JBOD) quite unappetizing. >> >>>>>>>>>> >> >>>>>>>>>> 2) have fewer safepoints >> >>>>>>>>>> >> >>>>>>>>>> Two of the biggest sources of safepoints are garbage collection >> >>>>>>>>>> and revocation >> >>>>>>>>>> of biased locks. Evidence points toward biased locking being >> >>>>>>>>>> unhelpful for >> >>>>>>>>>> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) >> >>>>>>>>>> is a quick way >> >>>>>>>>>> to eliminate one source of safepoints. >> >>>>>>>>>> >> >>>>>>>>>> Garbage collection, on the other hand, is unavoidable. Running >> >>>>>>>>>> with increased >> >>>>>>>>>> heap size would reduce GC frequency, at the cost of page cache. >> >>>>>>>>>> But sacrificing >> >>>>>>>>>> page cache would increase page fault frequency, which is >> >>>>>>>>>> another thing we're >> >>>>>>>>>> trying to avoid! I don't view this as a serious option. >> >>>>>>>>>> >> >>>>>>>>>> 3) use a different IO strategy >> >>>>>>>>>> >> >>>>>>>>>> Looking at the Cassandra source code, there appears to be an >> >>>>>>>>>> un(der)documented >> >>>>>>>>>> configuration parameter called disk_access_mode. It appears >> >>>>>>>>>> that changing this >> >>>>>>>>>> to 'standard' would switch to using pread() and pwrite() for >> >>>>>>>>>> I/O, instead of >> >>>>>>>>>> mmap. I imagine there would be a throughput penalty here for >> >>>>>>>>>> the case when >> >>>>>>>>>> pages are in the disk cache. >> >>>>>>>>>> >> >>>>>>>>>> Is this a serious option? It seems far too underdocumented to >> >>>>>>>>>> be thought of as >> >>>>>>>>>> a contender. >> >>>>>>>>>> >> >>>>>>>>>> 4) modify the JVM >> >>>>>>>>>> >> >>>>>>>>>> This is a longer term option. For the purposes of safepoints, >> >>>>>>>>>> perhaps the JVM >> >>>>>>>>>> could treat reads from an mmapped file in the same way it >> >>>>>>>>>> treats threads that >> >>>>>>>>>> are running JNI code. That is, the safepoint will proceed even >> >>>>>>>>>> though the >> >>>>>>>>>> reading thread has not "joined in". Upon finishing its mmapped >> >>>>>>>>>> read, the >> >>>>>>>>>> reading thread would test the safepoint page (check whether a >> >>>>>>>>>> safepoint is in >> >>>>>>>>>> progress, in other words). >> >>>>>>>>>> >> >>>>>>>>>> Conclusion: >> >>>>>>>>>> >> >>>>>>>>>> I don't imagine there's an easy solution here. I plan to go >> >>>>>>>>>> ahead with >> >>>>>>>>>> mitigation #1: "don't tolerate block devices that are slow", >> >>>>>>>>>> but I'd appreciate >> >>>>>>>>>> any approach that doesn't require my hardware to be flawless >> >>>>>>>>>> all the time. >> >>>>>>>>>> >> >>>>>>>>>> Josh >> >>>>>>>>>> >> >>>>>>>>>> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 >> >>>>>>>>>> -XX:+PrintSafepointStatistics >> >>>>>>>>>> -XX:PrintSafepointStatisticsCount=1 >> >>>>>>>>>> [2] >> >>>>>>>>>> https://github.com/brendangregg/perf-tools/blob/master/iosnoop >> >>>>>>>> >> >>>>>> >> >> Email had 1 attachment: >> > >> > >> >> * smime.p7s >> >> 2k (application/pkcs7-signature) >> >