That's a great idea. Even if the results were immediately thrown away, pre-reading in a JNI method would eliminate cache misses with very high probability. The only thing I'd worry about is the increased overhead of JNI interfering with the fast path (cache hits). I don't have enough knowledge on the read path or about JNI latency to comment on whether this concern is "real" or not.
Josh On Sat, Oct 8, 2016 at 5:21 PM, Graham Sanderson <gra...@vast.com> wrote: > I haven’t studied the read path that carefully, but there might be a spot at > the C* level rather than JVM level where you could effectively do a JNI > touch of the mmap region you’re going to need next. > > On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote: > > We don’t use Azul’s Zing, but it does have the nice feature that all threads > don’t have to reach safepoints at the same time. That said we make heavy use > of Cassandra (with off heap memtables - not directly related but allows us a > lot more GC headroom) and SOLR where we switched to mmap because it FAR out > performed pread variants - in no cases have we noticed long time to safe > point (then again our IO is lightning fast). > > On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com> wrote: > > Linux automatically uses free memory as cache. It's not swap. > > http://www.tldp.org/LDP/lki/lki-4.html > > On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com> > wrote: >> >> Sorry, I don't catch something. What page (memory) cache can exist if >> there is no swap file. >> Where are those page written/read? >> >> >> Best regards, Vladimir Yudovin, >> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer. >> Launch your cluster in minutes. >> >> >> >> ---- On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg<ar...@weisberg.ws> >> wrote ---- >> >> Hi, >> >> Nope I mean page cache. Linux doesn't call the cache it maintains using >> free memory a file cache. It uses free (and some of the time not so free!) >> memory to buffer writes and to cache recently written/read data. >> >> http://www.tldp.org/LDP/lki/lki-4.html >> >> When Linux decides it needs free memory it can either evict stuff from the >> page cache, flush dirty pages and then evict, or swap anonymous memory out. >> When you disable swap you only disable the last behavior. >> >> Maybe we are talking at cross purposes? What I meant is that increasing >> the heap size to reduce GC frequency is a legitimate thing to do and it does >> have an impact on the performance of the page cache even if you have swap >> disabled? >> >> Ariel >> >> >> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote: >> >> >Page cache is data pending flush to disk and data cached from disk. >> >> Do you mean file cache? >> >> >> Best regards, Vladimir Yudovin, >> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer. >> Launch your cluster in minutes. >> >> >> ---- On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg <ar...@weisberg.ws> >> wrote ---- >> >> Hi, >> >> Page cache is in use even if you disable swap. Swap is anonymous memory, >> and whatever else the Linux kernel supports paging out. Page cache is data >> pending flush to disk and data cached from disk. >> >> Given how bad the GC pauses are in C* I think it's not the high pole in >> the tent. Until key things are off heap and C* can run with CMS and get 10 >> millisecond GCs all day long. >> >> You can go through tuning and hardware selection try to get more >> consistent IO pauses and remove outliers as you mention and as a user I >> think this is your best bet. Generally it's either bad device or filesystem >> behavior if you get page faults taking more than 200 milliseconds O(G1 gc >> collection). >> >> I think a JVM change to allow safe points around memory mapped file access >> is really unlikely although I agree it would be great. I think the best hack >> around it is to code up your memory mapped file access into JNI methods and >> find some way to get that to work. Right now if you want to create a safe >> point a JNI method is the way to do it. The problem is that JNI methods and >> POJOs don't get along well. >> >> If you think about it the reason non-memory mapped IO works well is that >> it's all JNI methods so they don't impact time to safe point. I think there >> is a tradeoff between tolerance for outliers and performance. >> >> I don't know the state of the non-memory mapped path and how reliable that >> is. If it were reliable and I couldn't tolerate the outliers I would use >> that. I have to ask though, why are you not able to tolerate the outliers? >> If you are reading and writing at quorum how is this impacting you? >> >> Regards, >> Ariel >> >> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote: >> >> Hi Josh, >> >> >Running with increased heap size would reduce GC frequency, at the cost >> > of page cache. >> >> Actually it's recommended to run C* without virtual memory enabled. So if >> there is no enough memory JVM fails instead of blocking >> >> Best regards, Vladimir Yudovin, >> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer. >> Launch your cluster in minutes. >> >> >> ---- On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder<j...@code406.com> >> wrote ---- >> >> Hello cassandra-users, >> >> I'm investigating an issue with JVMs taking a while to reach a safepoint. >> I'd >> like the list's input on confirming my hypothesis and finding mitigations. >> >> My hypothesis is that slow block devices are causing Cassandra's JVM to >> pause >> completely while attempting to reach a safepoint. >> >> Background: >> >> Hotspot occasionally performs maintenance tasks that necessitate stopping >> all >> of its threads. Threads running JITed code occasionally read from a given >> safepoint page. If Hotspot has initiated a safepoint, reading from that >> page >> essentially catapults the thread into purgatory until the safepoint >> completes >> (the mechanism behind this is pretty cool). Threads performing syscalls or >> executing native code do this check upon their return into the JVM. >> >> In this way, during the safepoint Hotspot can be sure that all of its >> threads >> are either patiently waiting for safepoint completion or in a system call. >> >> Cassandra makes heavy use of mmapped reads in normal operation. When doing >> mmapped reads, the JVM executes userspace code to effect a read from a >> file. On >> the fast path (when the page needed is already mapped into the process), >> this >> instruction is very fast. When the page is not cached, the CPU triggers a >> page >> fault and asks the OS to go fetch the page. The JVM doesn't even realize >> that >> anything interesting is happening: to it, the thread is just executing a >> mov >> instruction that happens to take a while. >> >> The OS, meanwhile, puts the thread in question in the D state (assuming >> Linux, >> here) and goes off to find the desired page. This may take microseconds, >> this >> may take milliseconds, or it may take seconds (or longer). When I/O occurs >> while the JVM is trying to enter a safepoint, every thread has to wait for >> the >> laggard I/O to complete. >> >> If you log safepoints with the right options [1], you can see these >> occurrences >> in the JVM output: >> >> > # SafepointSynchronize::begin: Timeout detected: >> > # SafepointSynchronize::begin: Timed out while spinning to reach a >> > safepoint. >> > # SafepointSynchronize::begin: Threads which did not reach the >> > safepoint: >> > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 >> > tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] >> > java.lang.Thread.State: RUNNABLE >> > >> > # SafepointSynchronize::begin: (End of list) >> > vmop [threads: total initially_running wait_to_block] [time: spin block >> > sync cleanup vmop] page_trap_count >> > 58099.941: G1IncCollectionPause [ 447 1 1 ] [ 3304 0 3305 1 190 ] 1 >> >> If that safepoint happens to be a garbage collection (which this one was), >> you >> can also see it in GC logs: >> >> > 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which >> > application threads were stopped: 3.4971808 seconds, Stopping threads took: >> > 3.3050644 seconds >> >> In this way, JVM safepoints become a powerful weapon for transmuting a >> single >> thread's slow I/O into the entire JVM's lockup. >> >> Does all of the above sound correct? >> >> Mitigations: >> >> 1) don't tolerate block devices that are slow >> >> This is easy in theory, and only somewhat difficult in practice. Tools >> like >> perf and iosnoop [2] can do pretty good jobs of letting you know when a >> block >> device is slow. >> >> It is sad, though, because this makes running Cassandra on mixed hardware >> (e.g. >> fast SSD and slow disks in a JBOD) quite unappetizing. >> >> 2) have fewer safepoints >> >> Two of the biggest sources of safepoints are garbage collection and >> revocation >> of biased locks. Evidence points toward biased locking being unhelpful for >> Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a quick >> way >> to eliminate one source of safepoints. >> >> Garbage collection, on the other hand, is unavoidable. Running with >> increased >> heap size would reduce GC frequency, at the cost of page cache. But >> sacrificing >> page cache would increase page fault frequency, which is another thing >> we're >> trying to avoid! I don't view this as a serious option. >> >> 3) use a different IO strategy >> >> Looking at the Cassandra source code, there appears to be an >> un(der)documented >> configuration parameter called disk_access_mode. It appears that changing >> this >> to 'standard' would switch to using pread() and pwrite() for I/O, instead >> of >> mmap. I imagine there would be a throughput penalty here for the case when >> pages are in the disk cache. >> >> Is this a serious option? It seems far too underdocumented to be thought >> of as >> a contender. >> >> 4) modify the JVM >> >> This is a longer term option. For the purposes of safepoints, perhaps the >> JVM >> could treat reads from an mmapped file in the same way it treats threads >> that >> are running JNI code. That is, the safepoint will proceed even though the >> reading thread has not "joined in". Upon finishing its mmapped read, the >> reading thread would test the safepoint page (check whether a safepoint is >> in >> progress, in other words). >> >> Conclusion: >> >> I don't imagine there's an easy solution here. I plan to go ahead with >> mitigation #1: "don't tolerate block devices that are slow", but I'd >> appreciate >> any approach that doesn't require my hardware to be flawless all the time. >> >> Josh >> >> [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 >> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 >> [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop >> >> >> >> > >