Linux automatically uses free memory as cache. It's not swap. http://www.tldp.org/LDP/lki/lki-4.html
On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com> wrote: > Sorry, I don't catch something. What page (memory) cache can exist if > there is no swap file. > Where are those page written/read? > > > Best regards, Vladimir Yudovin, > > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud Cassandra on > Azure and SoftLayer.Launch your cluster in minutes.* > > > ---- On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel Weisberg<ar...@weisberg.ws > <ar...@weisberg.ws>>* wrote ---- > > Hi, > > Nope I mean page cache. Linux doesn't call the cache it maintains using > free memory a file cache. It uses free (and some of the time not so free!) > memory to buffer writes and to cache recently written/read data. > > http://www.tldp.org/LDP/lki/lki-4.html > > When Linux decides it needs free memory it can either evict stuff from the > page cache, flush dirty pages and then evict, or swap anonymous memory out. > When you disable swap you only disable the last behavior. > > Maybe we are talking at cross purposes? What I meant is that increasing > the heap size to reduce GC frequency is a legitimate thing to do and it > does have an impact on the performance of the page cache even if you have > swap disabled? > > Ariel > > > On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote: > > >Page cache is data pending flush to disk and data cached from disk. > > Do you mean file cache? > > > Best regards, Vladimir Yudovin, > > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud Cassandra on > Azure and SoftLayer.Launch your cluster in minutes.* > > > ---- On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg > <ar...@weisberg.ws <ar...@weisberg.ws>>* wrote ---- > > Hi, > > Page cache is in use even if you disable swap. Swap is anonymous memory, > and whatever else the Linux kernel supports paging out. Page cache is data > pending flush to disk and data cached from disk. > > Given how bad the GC pauses are in C* I think it's not the high pole in > the tent. Until key things are off heap and C* can run with CMS and get 10 > millisecond GCs all day long. > > You can go through tuning and hardware selection try to get more > consistent IO pauses and remove outliers as you mention and as a user I > think this is your best bet. Generally it's either bad device or filesystem > behavior if you get page faults taking more than 200 milliseconds O(G1 gc > collection). > > I think a JVM change to allow safe points around memory mapped file access > is really unlikely although I agree it would be great. I think the best > hack around it is to code up your memory mapped file access into JNI > methods and find some way to get that to work. Right now if you want to > create a safe point a JNI method is the way to do it. The problem is that > JNI methods and POJOs don't get along well. > > If you think about it the reason non-memory mapped IO works well is that > it's all JNI methods so they don't impact time to safe point. I think there > is a tradeoff between tolerance for outliers and performance. > > I don't know the state of the non-memory mapped path and how reliable that > is. If it were reliable and I couldn't tolerate the outliers I would use > that. I have to ask though, why are you not able to tolerate the outliers? > If you are reading and writing at quorum how is this impacting you? > > Regards, > Ariel > > On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote: > > Hi Josh, > > >Running with increased heap size would reduce GC frequency, at the cost > of page cache. > > Actually it's recommended to run C* without virtual memory enabled. So if > there is no enough memory JVM fails instead of blocking > > Best regards, Vladimir Yudovin, > > *Winguzone <https://winguzone.com?from=list> - Hosted Cloud Cassandra on > Azure and SoftLayer.Launch your cluster in minutes.* > > > ---- On Fri, 07 Oct 2016 21:06:24 -0400 *Josh Snyder<j...@code406.com > <j...@code406.com>>* wrote ---- > > Hello cassandra-users, > > I'm investigating an issue with JVMs taking a while to reach a safepoint. > I'd > like the list's input on confirming my hypothesis and finding mitigations. > > My hypothesis is that slow block devices are causing Cassandra's JVM to > pause > completely while attempting to reach a safepoint. > > Background: > > Hotspot occasionally performs maintenance tasks that necessitate stopping > all > of its threads. Threads running JITed code occasionally read from a given > safepoint page. If Hotspot has initiated a safepoint, reading from that > page > essentially catapults the thread into purgatory until the safepoint > completes > (the mechanism behind this is pretty cool). Threads performing syscalls or > executing native code do this check upon their return into the JVM. > > In this way, during the safepoint Hotspot can be sure that all of its > threads > are either patiently waiting for safepoint completion or in a system call. > > Cassandra makes heavy use of mmapped reads in normal operation. When doing > mmapped reads, the JVM executes userspace code to effect a read from a > file. On > the fast path (when the page needed is already mapped into the process), > this > instruction is very fast. When the page is not cached, the CPU triggers a > page > fault and asks the OS to go fetch the page. The JVM doesn't even realize > that > anything interesting is happening: to it, the thread is just executing a > mov > instruction that happens to take a while. > > The OS, meanwhile, puts the thread in question in the D state (assuming > Linux, > here) and goes off to find the desired page. This may take microseconds, > this > may take milliseconds, or it may take seconds (or longer). When I/O occurs > while the JVM is trying to enter a safepoint, every thread has to wait for > the > laggard I/O to complete. > > If you log safepoints with the right options [1], you can see these > occurrences > in the JVM output: > > > # SafepointSynchronize::begin: Timeout detected: > > # SafepointSynchronize::begin: Timed out while spinning to reach a > safepoint. > > # SafepointSynchronize::begin: Threads which did not reach the > safepoint: > > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 > tid=0x00007f8785bb1f30 nid=0x4e14 runnable [0x0000000000000000] > > java.lang.Thread.State: RUNNABLE > > > > # SafepointSynchronize::begin: (End of list) > > vmop [threads: total initially_running wait_to_block] [time: spin block > sync cleanup vmop] page_trap_count > > 58099.941: G1IncCollectionPause [ 447 1 1 ] [ 3304 0 3305 1 190 ] 1 > > If that safepoint happens to be a garbage collection (which this one was), > you > can also see it in GC logs: > > > 2016-10-07T13:19:50.029+0000: 58103.440: Total time for which > application threads were stopped: 3.4971808 seconds, Stopping threads took: > 3.3050644 seconds > > In this way, JVM safepoints become a powerful weapon for transmuting a > single > thread's slow I/O into the entire JVM's lockup. > > Does all of the above sound correct? > > Mitigations: > > 1) don't tolerate block devices that are slow > > This is easy in theory, and only somewhat difficult in practice. Tools > like > perf and iosnoop [2] can do pretty good jobs of letting you know when a > block > device is slow. > > It is sad, though, because this makes running Cassandra on mixed hardware > (e.g. > fast SSD and slow disks in a JBOD) quite unappetizing. > > 2) have fewer safepoints > > Two of the biggest sources of safepoints are garbage collection and > revocation > of biased locks. Evidence points toward biased locking being unhelpful for > Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a quick > way > to eliminate one source of safepoints. > > Garbage collection, on the other hand, is unavoidable. Running with > increased > heap size would reduce GC frequency, at the cost of page cache. But > sacrificing > page cache would increase page fault frequency, which is another thing > we're > trying to avoid! I don't view this as a serious option. > > 3) use a different IO strategy > > Looking at the Cassandra source code, there appears to be an > un(der)documented > configuration parameter called disk_access_mode. It appears that changing > this > to 'standard' would switch to using pread() and pwrite() for I/O, instead > of > mmap. I imagine there would be a throughput penalty here for the case when > pages are in the disk cache. > > Is this a serious option? It seems far too underdocumented to be thought > of as > a contender. > > 4) modify the JVM > > This is a longer term option. For the purposes of safepoints, perhaps the > JVM > could treat reads from an mmapped file in the same way it treats threads > that > are running JNI code. That is, the safepoint will proceed even though the > reading thread has not "joined in". Upon finishing its mmapped read, the > reading thread would test the safepoint page (check whether a safepoint is > in > progress, in other words). > > Conclusion: > > I don't imagine there's an easy solution here. I plan to go ahead with > mitigation #1: "don't tolerate block devices that are slow", but I'd > appreciate > any approach that doesn't require my hardware to be flawless all the time. > > Josh > > [1] -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=100 > -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 > [2] https://github.com/brendangregg/perf-tools/blob/master/iosnoop > > > > >