Re: JVM safepoints, mmap, and slow disks

2016-10-10 Thread Ariel Weisberg
Hi,

> That StackOverflow headline is interesting. Based on my reading of
> Hotspot's
> code, it looks like sun.misc.unsafe is used under the hood to
> perform mmapped
> I/O. I need to learn more about Hotspot's implementation before I can
>  comment
> further.
A memory mapped file is "just" memory so it's accessed using a
ByteBuffer pointing to off heap memory. Works the same as if you had
mapped in some anonymous memory.

> Not sure what you mean here. Aren't there going to be cache and TLB
> misses for any I/O, whether via mmap or syscall?
>
The beauty of memory mapped files can be that if the data is already in
the page cache it's just a regular read of memory. If you touch each
page and it's all in memory it's going to be a slower operation that
blocks the CPU as it has to synchronously load each cache line. It's
possible you might be able to touch multiple pages in parallel if you
are clever.

So if the data is the page cache and you just access it regularly
(sequentially) you get all the benefits of the prefetcher. If you go and
touch every page first you will not have the latency of prefetching
hidden from you.

> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.
When you fault a page the kernel has no idea how much you are going to
read. If there is a mismatch then you may end up going back and forth to
the device several times and for spinning disk this is worse. If you
express up front what you want to read either by fadvise/madvise or a
buffered read it can do something "smart". Granted IO scheduling ranges
from middling to non-existent most of the time, and the fadvise/madvise
stuff for this has holes I can't recall right now.

Ariel

On Mon, Oct 10, 2016, at 02:19 PM, Josh Snyder wrote:
> On Sat, Oct 8, 2016 at 9:02 PM, Ariel Weisberg
>  wrote:
> ...
>
> > You could use this to minimize the cost.
> > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>
> That StackOverflow headline is interesting. Based on my reading of
> Hotspot's
> code, it looks like sun.misc.unsafe is used under the hood to perform
> mmapped
> I/O. I need to learn more about Hotspot's implementation before I can
> comment
> further.
>
> > Maybe faster than doing buffered IO. It's a lot of cache and TLB
> > misses
> > with out prefetching though.
>
> Not sure what you mean here. Aren't there going to be cache and TLB
> misses for
> any I/O, whether via mmap or syscall?
>
> > There is a system call to page the memory in which might be
> > better for
> > larger reads. Still no guarantee things stay cached though.
>
> The approaches I've seen just involve something in userspace going
> through and
> touching every desired page. It works, especially if you touch
> pages in
> parallel.
>
> Thanks for the pointers. If I get anywhere with them, I'll be sure to
> let you know.
>
> Josh
>
> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> >> I haven’t studied the read path that carefully, but there might be
> >> a spot at the C* level rather than JVM level where you could
> >> effectively do a JNI touch of the mmap region you’re going to need
> >> next.
> >>
> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson 
> >>> wrote:
> >>>
> >>> We don’t use Azul’s Zing, but it does have the nice feature that
> >>> all threads don’t have to reach safepoints at the same time. That
> >>> said we make heavy use of Cassandra (with off heap memtables - not
> >>> directly related but allows us a lot more GC headroom) and SOLR
> >>> where we switched to mmap because it FAR out performed pread
> >>> variants - in no cases have we noticed long time to safe point
> >>> (then again our IO is lightning fast).
> >>>
>  On Oct 8, 2016, at 1:20 PM, Jonathan Haddad 
>  wrote:
> 
>  Linux automatically uses free memory as cache.  It's not swap.
> 
>  http://www.tldp.org/LDP/lki/lki-4.html
> 
>  On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin
>   wrote:
> > __
> > Sorry, I don't catch something. What page (memory) cache can
> > exist if there is no swap file.
> > Where are those page written/read?
> >
> >
> > Best regards, Vladimir Yudovin,
> > *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> > Cassandra on Azure and SoftLayer.
> > Launch your cluster in minutes.
> > *
> >
> >  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel
> > Weisberg* wrote 
> >> Hi,
> >>
> >> Nope I mean page cache. Linux doesn't call the cache it
> >> maintains using free memory a file cache. It uses free (and
> >> some of the time not so free!) memory to buffer writes and to
> >> cache recently written/read data.
> >>
> >> http://www.tldp.org/LDP/lki/lki-4.html
> >>
> >> When Linux decides it needs free memory it can either evict
> >> stuff from the page cache, flu

Re: JVM safepoints, mmap, and slow disks

2016-10-10 Thread Josh Snyder
That's a great idea. Even if the results were immediately thrown away,
pre-reading in a JNI method would eliminate cache misses with very high
probability. The only thing I'd worry about is the increased overhead of JNI
interfering with the fast path (cache hits). I don't have enough knowledge on
the read path or about JNI latency to comment on whether this concern is "real"
or not.

Josh

On Sat, Oct 8, 2016 at 5:21 PM, Graham Sanderson  wrote:
> I haven’t studied the read path that carefully, but there might be a spot at
> the C* level rather than JVM level where you could effectively do a JNI
> touch of the mmap region you’re going to need next.
>
> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>
> We don’t use Azul’s Zing, but it does have the nice feature that all threads
> don’t have to reach safepoints at the same time. That said we make heavy use
> of Cassandra (with off heap memtables - not directly related but allows us a
> lot more GC headroom) and SOLR where we switched to mmap because it FAR out
> performed pread variants - in no cases have we noticed long time to safe
> point (then again our IO is lightning fast).
>
> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
>
> Linux automatically uses free memory as cache.  It's not swap.
>
> http://www.tldp.org/LDP/lki/lki-4.html
>
> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin 
> wrote:
>>
>> Sorry, I don't catch something. What page (memory) cache can exist if
>> there is no swap file.
>> Where are those page written/read?
>>
>>
>> Best regards, Vladimir Yudovin,
>> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
>> Launch your cluster in minutes.
>>
>>
>>
>>  On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg
>> wrote 
>>
>> Hi,
>>
>> Nope I mean page cache. Linux doesn't call the cache it maintains using
>> free memory a file cache. It uses free (and some of the time not so free!)
>> memory to buffer writes and to cache recently written/read data.
>>
>> http://www.tldp.org/LDP/lki/lki-4.html
>>
>> When Linux decides it needs free memory it can either evict stuff from the
>> page cache, flush dirty pages and then evict, or swap anonymous memory out.
>> When you disable swap you only disable the last behavior.
>>
>> Maybe we are talking at cross purposes? What I meant is that increasing
>> the heap size to reduce GC frequency is a legitimate thing to do and it does
>> have an impact on the performance of the page cache even if you have swap
>> disabled?
>>
>> Ariel
>>
>>
>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>
>> >Page cache is data pending flush to disk and data cached from disk.
>>
>> Do you mean file cache?
>>
>>
>> Best regards, Vladimir Yudovin,
>> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
>> Launch your cluster in minutes.
>>
>>
>>  On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg 
>> wrote 
>>
>> Hi,
>>
>> Page cache is in use even if you disable swap. Swap is anonymous memory,
>> and whatever else the Linux kernel supports paging out. Page cache is data
>> pending flush to disk and data cached from disk.
>>
>> Given how bad the GC pauses are in C* I think it's not the high pole in
>> the tent. Until key things are off heap and C* can run with CMS and get 10
>> millisecond GCs all day long.
>>
>> You can go through tuning and hardware selection try to get more
>> consistent IO pauses and remove outliers as you mention and as a user I
>> think this is your best bet. Generally it's either bad device or filesystem
>> behavior if you get page faults taking more than 200 milliseconds O(G1 gc
>> collection).
>>
>> I think a JVM change to allow safe points around memory mapped file access
>> is really unlikely although I agree it would be great. I think the best hack
>> around it is to code up your memory mapped file access into JNI methods and
>> find some way to get that to work. Right now if you want to create a safe
>> point a JNI method is the way to do it. The problem is that JNI methods and
>> POJOs don't get along well.
>>
>> If you think about it the reason non-memory mapped IO works well is that
>> it's all JNI methods so they don't impact time to safe point. I think there
>> is a tradeoff between tolerance for outliers and performance.
>>
>> I don't know the state of the non-memory mapped path and how reliable that
>> is. If it were reliable and I couldn't tolerate the outliers I would use
>> that. I have to ask though, why are you not able to tolerate the outliers?
>> If you are reading and writing at quorum how is this impacting you?
>>
>> Regards,
>> Ariel
>>
>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>>
>> Hi Josh,
>>
>> >Running with increased heap size would reduce GC frequency, at the cost
>> > of page cache.
>>
>> Actually it's recommended to run C* without virtual memory enabled. So if
>> there is no enough memory JVM fails instead of blocking
>>
>> Best regards, Vladimir Yudovin,
>> Winguzone - Hosted Cloud Cassandra on Azure a

Re: JVM safepoints, mmap, and slow disks

2016-10-10 Thread Josh Snyder
Do you know if there are any publicly available benchmarks on disk_access_mode,
preferably after the fix from CASSANDRA-10249?

If it turns out that syscall I/O is not significantly slower, I'd consider
switching. If I don't know the costs, I think I'd prefer to stick with the
devil I know how to mitigate (i.e. by policing by my block devices) rather than
switching to the devil that is non-standard and undocumented. :)

I may have time to do some benchmarking myself. If so, I'll be sure to inform
the list.

Josh

On Sun, Oct 9, 2016 at 2:39 AM, Benedict Elliott Smith
 wrote:
> The biggest problem with pread was the issue of over reading (reading 64k
> where 4k would suffice), which was significantly improved in 2.2 iirc. I
> don't think the penalty is very significant anymore, and if you are
> experiencing time to safe point issues it's very likely a worthwhile switch
> to flip.
>
>
> On Sunday, 9 October 2016, Graham Sanderson  wrote:
>>
>> I was using the term “touch” loosely to hopefully mean pre-fetch, though I
>> suspect (I think intel has been de-emphasizing) you can still do a sensible
>> prefetch instruction in native code. Even if not you are still better
>> blocking in JNI code - I haven’t looked at the link to see if the correct
>> barriers are enforced by the sun-misc-unsafe method.
>>
>> I do suspect that you’ll see up to about 5-10% sys call overhead if you
>> hit pread.
>>
>> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg  wrote:
>> >
>> > Hi,
>> >
>> > This is starting to get into dev list territory.
>> >
>> > Interesting idea to touch every 4K page you are going to read.
>> >
>> > You could use this to minimize the cost.
>> >
>> > http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>> >
>> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
>> > with out prefetching though.
>> >
>> > There is a system call to page the memory in which might be better for
>> > larger reads. Still no guarantee things stay cached though.
>> >
>> > Ariel
>> >
>> >
>> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> >> I haven’t studied the read path that carefully, but there might be a
>> >> spot at the C* level rather than JVM level where you could effectively do 
>> >> a
>> >> JNI touch of the mmap region you’re going to need next.
>> >>
>> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>> >>>
>> >>> We don’t use Azul’s Zing, but it does have the nice feature that all
>> >>> threads don’t have to reach safepoints at the same time. That said we 
>> >>> make
>> >>> heavy use of Cassandra (with off heap memtables - not directly related 
>> >>> but
>> >>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>> >>> because
>> >>> it FAR out performed pread variants - in no cases have we noticed long 
>> >>> time
>> >>> to safe point (then again our IO is lightning fast).
>> >>>
>>  On Oct 8, 2016, at 1:20 PM, Jonathan Haddad 
>>  wrote:
>> 
>>  Linux automatically uses free memory as cache.  It's not swap.
>> 
>>  http://www.tldp.org/LDP/lki/lki-4.html
>> 
>>  On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin
>>   wrote:
>> > __
>> > Sorry, I don't catch something. What page (memory) cache can exist
>> > if there is no swap file.
>> > Where are those page written/read?
>> >
>> >
>> > Best regards, Vladimir Yudovin,
>> > *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>> > Cassandra on Azure and SoftLayer.
>> > Launch your cluster in minutes.
>> > *
>> >
>> >  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel
>> > Weisberg* wrote 
>> >> Hi,
>> >>
>> >> Nope I mean page cache. Linux doesn't call the cache it maintains
>> >> using free memory a file cache. It uses free (and some of the time 
>> >> not so
>> >> free!) memory to buffer writes and to cache recently written/read 
>> >> data.
>> >>
>> >> http://www.tldp.org/LDP/lki/lki-4.html
>> >>
>> >> When Linux decides it needs free memory it can either evict stuff
>> >> from the page cache, flush dirty pages and then evict, or swap 
>> >> anonymous
>> >> memory out. When you disable swap you only disable the last behavior.
>> >>
>> >> Maybe we are talking at cross purposes? What I meant is that
>> >> increasing the heap size to reduce GC frequency is a legitimate thing 
>> >> to do
>> >> and it does have an impact on the performance of the page cache even 
>> >> if you
>> >> have swap disabled?
>> >>
>> >> Ariel
>> >>
>> >>
>> >> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>  Page cache is data pending flush to disk and data cached from
>>  disk.
>> >>>
>> >>> Do you mean file cache?
>> >>>
>> >>>
>> >>> Best regards, Vladimir Yudovin,
>> >>> *Winguzone[https://winguzone.com

Re: JVM safepoints, mmap, and slow disks

2016-10-10 Thread Josh Snyder
On Sat, Oct 8, 2016 at 9:02 PM, Ariel Weisberg  wrote:
...

> You could use this to minimize the cost.
> http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652

That StackOverflow headline is interesting. Based on my reading of Hotspot's
code, it looks like sun.misc.unsafe is used under the hood to perform mmapped
I/O. I need to learn more about Hotspot's implementation before I can comment
further.

> Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> with out prefetching though.

Not sure what you mean here. Aren't there going to be cache and TLB misses for
any I/O, whether via mmap or syscall?

> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.

The approaches I've seen just involve something in userspace going through and
touching every desired page. It works, especially if you touch pages in
parallel.

Thanks for the pointers. If I get anywhere with them, I'll be sure to
let you know.

Josh

> On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> I haven’t studied the read path that carefully, but there might be a spot at 
>> the C* level rather than JVM level where you could effectively do a JNI 
>> touch of the mmap region you’re going to need next.
>>
>>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>>>
>>> We don’t use Azul’s Zing, but it does have the nice feature that all 
>>> threads don’t have to reach safepoints at the same time. That said we make 
>>> heavy use of Cassandra (with off heap memtables - not directly related but 
>>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>>> because it FAR out performed pread variants - in no cases have we noticed 
>>> long time to safe point (then again our IO is lightning fast).
>>>
 On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:

 Linux automatically uses free memory as cache.  It's not swap.

 http://www.tldp.org/LDP/lki/lki-4.html

 On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin  
 wrote:
> __
> Sorry, I don't catch something. What page (memory) cache can exist if 
> there is no swap file.
> Where are those page written/read?
>
>
> Best regards, Vladimir Yudovin,
> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> *
>
>  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
> Weisberg* wrote 
>> Hi,
>>
>> Nope I mean page cache. Linux doesn't call the cache it maintains using 
>> free memory a file cache. It uses free (and some of the time not so 
>> free!) memory to buffer writes and to cache recently written/read data.
>>
>> http://www.tldp.org/LDP/lki/lki-4.html
>>
>> When Linux decides it needs free memory it can either evict stuff from 
>> the page cache, flush dirty pages and then evict, or swap anonymous 
>> memory out. When you disable swap you only disable the last behavior.
>>
>> Maybe we are talking at cross purposes? What I meant is that increasing 
>> the heap size to reduce GC frequency is a legitimate thing to do and it 
>> does have an impact on the performance of the page cache even if you 
>> have swap disabled?
>>
>> Ariel
>>
>>
>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>> >Page cache is data pending flush to disk and data cached from disk.
>>>
>>> Do you mean file cache?
>>>
>>>
>>> Best regards, Vladimir Yudovin,
>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra 
>>> on Azure and SoftLayer.
>>> Launch your cluster in minutes.*
>>>
>>>
>>>  On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg 
>>> * wrote 
 Hi,

 Page cache is in use even if you disable swap. Swap is anonymous 
 memory, and whatever else the Linux kernel supports paging out. Page 
 cache is data pending flush to disk and data cached from disk.

 Given how bad the GC pauses are in C* I think it's not the high pole 
 in the tent. Until key things are off heap and C* can run with CMS and 
 get 10 millisecond GCs all day long.

 You can go through tuning and hardware selection try to get more 
 consistent IO pauses and remove outliers as you mention and as a user 
 I think this is your best bet. Generally it's either bad device or 
 filesystem behavior if you get page faults taking more than 200 
 milliseconds O(G1 gc collection).

 I think a JVM change to allow safe points around memory mapped file 
 access is really unlikely although I agree it would be great. I think 
 the best hack around it is to code up your memory mapped file access 

Re: JVM safepoints, mmap, and slow disks

2016-10-09 Thread Benedict Elliott Smith
Well, you seem to be assuming:

1) read ahead is done unconditionally, with an equal claim to disk resources
2) read ahead is actually enabled (tuning recommendations are that it be
disabled, or at least drastically reduced, to my knowledge)
3) read ahead happens synchronously (even if you burn some bandwidth, not
waiting the increased latency for all blocks means a faster turn around to
client)

Ignoring all of this, 64kb is 1/3 default read ahead in Linux, so you're
talking a ~50% increase, which is not an amount I would readily dismiss.

On Sunday, 9 October 2016, Ariel Weisberg  wrote:

> Hi,
>
> Even with memory mapped IO the kernel is going to do read ahead. It seems
> like if the issue is reading to much from the device it isn't going to help
> to use memory mapped files or smaller buffered reads. Maybe helps by some
> percentage, but it's still going to read quite a bit extra.
>
> Ariel
>
> On Sun, Oct 9, 2016, at 05:39 AM, Benedict Elliott Smith wrote:
>
> The biggest problem with pread was the issue of over reading (reading 64k
> where 4k would suffice), which was significantly improved in 2.2 iirc. I
> don't think the penalty is very significant anymore, and if you are
> experiencing time to safe point issues it's very likely a worthwhile
> switch to flip.
>
> On Sunday, 9 October 2016, Graham Sanderson  > wrote:
>
> I was using the term “touch” loosely to hopefully mean pre-fetch, though I
> suspect (I think intel has been de-emphasizing) you can still do a sensible
> prefetch instruction in native code. Even if not you are still better
> blocking in JNI code - I haven’t looked at the link to see if the correct
> barriers are enforced by the sun-misc-unsafe method.
>
> I do suspect that you’ll see up to about 5-10% sys call overhead if you
> hit pread.
>
> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg  wrote:
> >
> > Hi,
> >
> > This is starting to get into dev list territory.
> >
> > Interesting idea to touch every 4K page you are going to read.
> >
> > You could use this to minimize the cost.
> > http://stackoverflow.com/questions/36298111/is-it-possible-
> to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
> >
> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> > with out prefetching though.
> >
> > There is a system call to page the memory in which might be better for
> > larger reads. Still no guarantee things stay cached though.
> >
> > Ariel
> >
> >
> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> >> I haven’t studied the read path that carefully, but there might be a
> spot at the C* level rather than JVM level where you could effectively do a
> JNI touch of the mmap region you’re going to need next.
> >>
> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
> >>>
> >>> We don’t use Azul’s Zing, but it does have the nice feature that all
> threads don’t have to reach safepoints at the same time. That said we make
> heavy use of Cassandra (with off heap memtables - not directly related but
> allows us a lot more GC headroom) and SOLR where we switched to mmap
> because it FAR out performed pread variants - in no cases have we noticed
> long time to safe point (then again our IO is lightning fast).
> >>>
>  On Oct 8, 2016, at 1:20 PM, Jonathan Haddad 
> wrote:
> 
>  Linux automatically uses free memory as cache.  It's not swap.
> 
>  http://www.tldp.org/LDP/lki/lki-4.html
> 
>  On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <
> vla...@winguzone.com> wrote:
> > __
> > Sorry, I don't catch something. What page (memory) cache can exist
> if there is no swap file.
> > Where are those page written/read?
> >
> >
> > Best regards, Vladimir Yudovin,
> > *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> > Launch your cluster in minutes.
> > *
> >
> >  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel Weisberg<
> ar...@weisberg.ws>* wrote 
> >> Hi,
> >>
> >> Nope I mean page cache. Linux doesn't call the cache it maintains
> using free memory a file cache. It uses free (and some of the time not so
> free!) memory to buffer writes and to cache recently written/read data.
> >>
> >> http://www.tldp.org/LDP/lki/lki-4.html
> >>
> >> When Linux decides it needs free memory it can either evict stuff
> from the page cache, flush dirty pages and then evict, or swap anonymous
> memory out. When you disable swap you only disable the last behavior.
> >>
> >> Maybe we are talking at cross purposes? What I meant is that
> increasing the heap size to reduce GC frequency is a legitimate thing to do
> and it does have an impact on the performance of the page cache even if you
> have swap disabled?
> >>
> >> Ariel
> >>
> >>
> >> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>  Page cache is data pending flush to disk and data cached from
> disk.
> >>>
> >

Re: JVM safepoints, mmap, and slow disks

2016-10-09 Thread Ariel Weisberg
Hi,

 Even with memory mapped IO the kernel is going to do read ahead. It
 seems like if the issue is reading to much from the device it isn't
 going to help to use memory mapped files or smaller buffered reads.
 Maybe helps by some percentage, but it's still going to read quite a
 bit extra.

Ariel

On Sun, Oct 9, 2016, at 05:39 AM, Benedict Elliott Smith wrote:
> The biggest problem with pread was the issue of over reading (reading
> 64k where 4k would suffice), which was significantly improved in 2.2
> iirc. I don't think the penalty is very significant anymore, and if
> you are experiencing time to safe point issues it's very likely a
> worthwhile switch to flip.
>
> On Sunday, 9 October 2016, Graham Sanderson  wrote:
>> I was using the term “touch” loosely to hopefully mean pre-fetch,
>> though I suspect (I think intel has been de-emphasizing) you can
>> still do a sensible prefetch instruction in native code. Even if not
>> you are still better blocking in JNI code - I haven’t looked at the
>> link to see if the correct barriers are enforced by the sun-misc-
>> unsafe method.
>>
>>  I do suspect that you’ll see up to about 5-10% sys call overhead if
>>  you hit pread.
>>
>>  > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg 
>>  > wrote:
>>  >
>>  > Hi,
>>  >
>>  > This is starting to get into dev list territory.
>>  >
>>  > Interesting idea to touch every 4K page you are going to read.
>>  >
>>  > You could use this to minimize the cost.
>>  > 
>> http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>>  >
>>  > Maybe faster than doing buffered IO. It's a lot of cache and TLB
>>  > misses
>>  > with out prefetching though.
>>  >
>>  > There is a system call to page the memory in which might be
>>  > better for
>>  > larger reads. Still no guarantee things stay cached though.
>>  >
>>  > Ariel
>>  >
>>  >
>>  > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>>  >> I haven’t studied the read path that carefully, but there might
>>  >> be a spot at the C* level rather than JVM level where you could
>>  >> effectively do a JNI touch of the mmap region you’re going to
>>  >> need next.
>>  >>
>>  >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson 
>>  >>> wrote:
>>  >>>
>>  >>> We don’t use Azul’s Zing, but it does have the nice feature that
>>  >>> all threads don’t have to reach safepoints at the same time.
>>  >>> That said we make heavy use of Cassandra (with off heap
>>  >>> memtables - not directly related but allows us a lot more GC
>>  >>> headroom) and SOLR where we switched to mmap because it FAR out
>>  >>> performed pread variants - in no cases have we noticed long time
>>  >>> to safe point (then again our IO is lightning fast).
>>  >>>
>>   On Oct 8, 2016, at 1:20 PM, Jonathan Haddad 
>>   wrote:
>>  
>>   Linux automatically uses free memory as cache.  It's not swap.
>>  
>>   http://www.tldp.org/LDP/lki/lki-4.html
>>  
>>   On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin
>>    wrote:
>>  > __
>>  > Sorry, I don't catch something. What page (memory) cache can
>>  > exist if there is no swap file.
>>  > Where are those page written/read?
>>  >
>>  >
>>  > Best regards, Vladimir Yudovin,
>>  > *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>>  > Cassandra on Azure and SoftLayer.
>>  > Launch your cluster in minutes.
>>  > *
>>  >
>>  >  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel
>>  > Weisberg* wrote 
>>  >> Hi,
>>  >>
>>  >> Nope I mean page cache. Linux doesn't call the cache it
>>  >> maintains using free memory a file cache. It uses free (and
>>  >> some of the time not so free!) memory to buffer writes and to
>>  >> cache recently written/read data.
>>  >>
>>  >> http://www.tldp.org/LDP/lki/lki-4.html
>>  >>
>>  >> When Linux decides it needs free memory it can either evict
>>  >> stuff from the page cache, flush dirty pages and then evict,
>>  >> or swap anonymous memory out. When you disable swap you only
>>  >> disable the last behavior.
>>  >>
>>  >> Maybe we are talking at cross purposes? What I meant is that
>>  >> increasing the heap size to reduce GC frequency is a
>>  >> legitimate thing to do and it does have an impact on the
>>  >> performance of the page cache even if you have swap disabled?
>>  >>
>>  >> Ariel
>>  >>
>>  >>
>>  >> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>   Page cache is data pending flush to disk and data cached
>>   from disk.
>>  >>>
>>  >>> Do you mean file cache?
>>  >>>
>>  >>>
>>  >>> Best regards, Vladimir Yudovin,
>>  >>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
>>  >>> Cassandra on Azure and SoftLayer.
>>  >>> Launch your cluster in minutes.*
>>  >>>
>>  >>>
>>  >>>  O

Re: JVM safepoints, mmap, and slow disks

2016-10-09 Thread Jeff Jirsa
Potentially relevant reading 
https://issues.apache.org/jira/browse/CASSANDRA-10249

 

 

From: Benedict Elliott Smith 
Reply-To: "user@cassandra.apache.org" 
Date: Sunday, October 9, 2016 at 2:39 AM
To: "user@cassandra.apache.org" 
Subject: Re: JVM safepoints, mmap, and slow disks

 

The biggest problem with pread was the issue of over reading (reading 64k where 
4k would suffice), which was significantly improved in 2.2 iirc. I don't think 
the penalty is very significant anymore, and if you are experiencing time to 
safe point issues it's very likely a worthwhile switch to flip.

On Sunday, 9 October 2016, Graham Sanderson  wrote:

I was using the term “touch” loosely to hopefully mean pre-fetch, though I 
suspect (I think intel has been de-emphasizing) you can still do a sensible 
prefetch instruction in native code. Even if not you are still better blocking 
in JNI code - I haven’t looked at the link to see if the correct barriers are 
enforced by the sun-misc-unsafe method.

I do suspect that you’ll see up to about 5-10% sys call overhead if you hit 
pread.

> On Oct 8, 2016, at 11:02 PM, Ariel Weisberg  wrote:
>
> Hi,
>
> This is starting to get into dev list territory.
>
> Interesting idea to touch every 4K page you are going to read.
>
> You could use this to minimize the cost.
> http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
>
> Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> with out prefetching though.
>
> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.
>
> Ariel
>
>
> On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> I haven’t studied the read path that carefully, but there might be a spot at 
>> the C* level rather than JVM level where you could effectively do a JNI 
>> touch of the mmap region you’re going to need next.
>>
>>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>>>
>>> We don’t use Azul’s Zing, but it does have the nice feature that all 
>>> threads don’t have to reach safepoints at the same time. That said we make 
>>> heavy use of Cassandra (with off heap memtables - not directly related but 
>>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>>> because it FAR out performed pread variants - in no cases have we noticed 
>>> long time to safe point (then again our IO is lightning fast).
>>>
>>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
>>>>
>>>> Linux automatically uses free memory as cache.  It's not swap.
>>>>
>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>>
>>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin  
>>>> wrote:
>>>>> __
>>>>> Sorry, I don't catch something. What page (memory) cache can exist if 
>>>>> there is no swap file.
>>>>> Where are those page written/read?
>>>>>
>>>>>
>>>>> Best regards, Vladimir Yudovin,
>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
>>>>> Azure and SoftLayer.
>>>>> Launch your cluster in minutes.
> *
>>>>>
>>>>>  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
>>>>> Weisberg* wrote 
>>>>>> Hi,
>>>>>>
>>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using 
>>>>>> free memory a file cache. It uses free (and some of the time not so 
>>>>>> free!) memory to buffer writes and to cache recently written/read data.
>>>>>>
>>>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>>>>
>>>>>> When Linux decides it needs free memory it can either evict stuff from 
>>>>>> the page cache, flush dirty pages and then evict, or swap anonymous 
>>>>>> memory out. When you disable swap you only disable the last behavior.
>>>>>>
>>>>>> Maybe we are talking at cross purposes? What I meant is that increasing 
>>>>>> the heap size to reduce GC frequency is a legitimate thing to do and it 
>>>>>> does have an impact on the performance of the page cache even if you 
>>>>>> have swap disabled?
>>>>>>
>>>>>> Ariel
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote

Re: JVM safepoints, mmap, and slow disks

2016-10-09 Thread Benedict Elliott Smith
The biggest problem with pread was the issue of over reading (reading 64k
where 4k would suffice), which was significantly improved in 2.2 iirc. I
don't think the penalty is very significant anymore, and if you are
experiencing time to safe point issues it's very likely a worthwhile switch
to flip.

On Sunday, 9 October 2016, Graham Sanderson  wrote:

> I was using the term “touch” loosely to hopefully mean pre-fetch, though I
> suspect (I think intel has been de-emphasizing) you can still do a sensible
> prefetch instruction in native code. Even if not you are still better
> blocking in JNI code - I haven’t looked at the link to see if the correct
> barriers are enforced by the sun-misc-unsafe method.
>
> I do suspect that you’ll see up to about 5-10% sys call overhead if you
> hit pread.
>
> > On Oct 8, 2016, at 11:02 PM, Ariel Weisberg  > wrote:
> >
> > Hi,
> >
> > This is starting to get into dev list territory.
> >
> > Interesting idea to touch every 4K page you are going to read.
> >
> > You could use this to minimize the cost.
> > http://stackoverflow.com/questions/36298111/is-it-
> possible-to-use-sun-misc-unsafe-to-call-c-functions-
> without-jni/36309652#36309652
> >
> > Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> > with out prefetching though.
> >
> > There is a system call to page the memory in which might be better for
> > larger reads. Still no guarantee things stay cached though.
> >
> > Ariel
> >
> >
> > On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> >> I haven’t studied the read path that carefully, but there might be a
> spot at the C* level rather than JVM level where you could effectively do a
> JNI touch of the mmap region you’re going to need next.
> >>
> >>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  > wrote:
> >>>
> >>> We don’t use Azul’s Zing, but it does have the nice feature that all
> threads don’t have to reach safepoints at the same time. That said we make
> heavy use of Cassandra (with off heap memtables - not directly related but
> allows us a lot more GC headroom) and SOLR where we switched to mmap
> because it FAR out performed pread variants - in no cases have we noticed
> long time to safe point (then again our IO is lightning fast).
> >>>
>  On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  > wrote:
> 
>  Linux automatically uses free memory as cache.  It's not swap.
> 
>  http://www.tldp.org/LDP/lki/lki-4.html
> 
>  On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <
> vla...@winguzone.com > wrote:
> > __
> > Sorry, I don't catch something. What page (memory) cache can exist
> if there is no swap file.
> > Where are those page written/read?
> >
> >
> > Best regards, Vladimir Yudovin,
> > *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> > Launch your cluster in minutes.
> > *
> >
> >  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel Weisberg<
> ar...@weisberg.ws >* wrote 
> >> Hi,
> >>
> >> Nope I mean page cache. Linux doesn't call the cache it maintains
> using free memory a file cache. It uses free (and some of the time not so
> free!) memory to buffer writes and to cache recently written/read data.
> >>
> >> http://www.tldp.org/LDP/lki/lki-4.html
> >>
> >> When Linux decides it needs free memory it can either evict stuff
> from the page cache, flush dirty pages and then evict, or swap anonymous
> memory out. When you disable swap you only disable the last behavior.
> >>
> >> Maybe we are talking at cross purposes? What I meant is that
> increasing the heap size to reduce GC frequency is a legitimate thing to do
> and it does have an impact on the performance of the page cache even if you
> have swap disabled?
> >>
> >> Ariel
> >>
> >>
> >> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>  Page cache is data pending flush to disk and data cached from
> disk.
> >>>
> >>> Do you mean file cache?
> >>>
> >>>
> >>> Best regards, Vladimir Yudovin,
> >>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud
> Cassandra on Azure and SoftLayer.
> >>> Launch your cluster in minutes.*
> >>>
> >>>
> >>>  On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg <
> ar...@weisberg.ws >* wrote 
>  Hi,
> 
>  Page cache is in use even if you disable swap. Swap is anonymous
> memory, and whatever else the Linux kernel supports paging out. Page cache
> is data pending flush to disk and data cached from disk.
> 
>  Given how bad the GC pauses are in C* I think it's not the high
> pole in the tent. Until key things are off heap and C* can run with CMS and
> get 10 millisecond GCs all day long.
> 
>  You can go through tuning and hardware selection try to get more
> consistent IO pauses and remove outliers as you mention and as a user I
> think this is you

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Graham Sanderson
I was using the term “touch” loosely to hopefully mean pre-fetch, though I 
suspect (I think intel has been de-emphasizing) you can still do a sensible 
prefetch instruction in native code. Even if not you are still better blocking 
in JNI code - I haven’t looked at the link to see if the correct barriers are 
enforced by the sun-misc-unsafe method.

I do suspect that you’ll see up to about 5-10% sys call overhead if you hit 
pread.

> On Oct 8, 2016, at 11:02 PM, Ariel Weisberg  wrote:
> 
> Hi,
> 
> This is starting to get into dev list territory.
> 
> Interesting idea to touch every 4K page you are going to read.
> 
> You could use this to minimize the cost.
> http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
> 
> Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> with out prefetching though.
> 
> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.
> 
> Ariel
> 
> 
> On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> I haven’t studied the read path that carefully, but there might be a spot at 
>> the C* level rather than JVM level where you could effectively do a JNI 
>> touch of the mmap region you’re going to need next.
>> 
>>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>>> 
>>> We don’t use Azul’s Zing, but it does have the nice feature that all 
>>> threads don’t have to reach safepoints at the same time. That said we make 
>>> heavy use of Cassandra (with off heap memtables - not directly related but 
>>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>>> because it FAR out performed pread variants - in no cases have we noticed 
>>> long time to safe point (then again our IO is lightning fast).
>>> 
 On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
 
 Linux automatically uses free memory as cache.  It's not swap.
 
 http://www.tldp.org/LDP/lki/lki-4.html
 
 On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin  
 wrote:
> __
> Sorry, I don't catch something. What page (memory) cache can exist if 
> there is no swap file.
> Where are those page written/read?
> 
> 
> Best regards, Vladimir Yudovin, 
> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> *
> 
>  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
> Weisberg* wrote  
>> Hi,
>> 
>> Nope I mean page cache. Linux doesn't call the cache it maintains using 
>> free memory a file cache. It uses free (and some of the time not so 
>> free!) memory to buffer writes and to cache recently written/read data.
>> 
>> http://www.tldp.org/LDP/lki/lki-4.html
>> 
>> When Linux decides it needs free memory it can either evict stuff from 
>> the page cache, flush dirty pages and then evict, or swap anonymous 
>> memory out. When you disable swap you only disable the last behavior.
>> 
>> Maybe we are talking at cross purposes? What I meant is that increasing 
>> the heap size to reduce GC frequency is a legitimate thing to do and it 
>> does have an impact on the performance of the page cache even if you 
>> have swap disabled?
>> 
>> Ariel
>> 
>> 
>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
 Page cache is data pending flush to disk and data cached from disk.
>>> 
>>> Do you mean file cache?
>>> 
>>> 
>>> Best regards, Vladimir Yudovin, 
>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra 
>>> on Azure and SoftLayer.
>>> Launch your cluster in minutes.*
>>> 
>>> 
>>>  On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg 
>>> * wrote  
 Hi,
 
 Page cache is in use even if you disable swap. Swap is anonymous 
 memory, and whatever else the Linux kernel supports paging out. Page 
 cache is data pending flush to disk and data cached from disk.
 
 Given how bad the GC pauses are in C* I think it's not the high pole 
 in the tent. Until key things are off heap and C* can run with CMS and 
 get 10 millisecond GCs all day long.
 
 You can go through tuning and hardware selection try to get more 
 consistent IO pauses and remove outliers as you mention and as a user 
 I think this is your best bet. Generally it's either bad device or 
 filesystem behavior if you get page faults taking more than 200 
 milliseconds O(G1 gc collection).
 
 I think a JVM change to allow safe points around memory mapped file 
 access is really unlikely although I agree it would be great. I think 
 the best hack around it is to code up your memory mapped file access 
>>>

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Ariel Weisberg
Hi,

This is starting to get into dev list territory.

Interesting idea to touch every 4K page you are going to read.

You could use this to minimize the cost.
http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652

Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
with out prefetching though.

There is a system call to page the memory in which might be better for
larger reads. Still no guarantee things stay cached though.

Ariel


On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
> I haven’t studied the read path that carefully, but there might be a spot at 
> the C* level rather than JVM level where you could effectively do a JNI touch 
> of the mmap region you’re going to need next.
> 
>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
>> 
>> We don’t use Azul’s Zing, but it does have the nice feature that all threads 
>> don’t have to reach safepoints at the same time. That said we make heavy use 
>> of Cassandra (with off heap memtables - not directly related but allows us a 
>> lot more GC headroom) and SOLR where we switched to mmap because it FAR out 
>> performed pread variants - in no cases have we noticed long time to safe 
>> point (then again our IO is lightning fast).
>> 
>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
>>> 
>>> Linux automatically uses free memory as cache.  It's not swap.
>>> 
>>> http://www.tldp.org/LDP/lki/lki-4.html
>>> 
>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin  
>>> wrote:
 __
 Sorry, I don't catch something. What page (memory) cache can exist if 
 there is no swap file.
 Where are those page written/read?
 
 
 Best regards, Vladimir Yudovin, 
 *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
 Azure and SoftLayer.
 Launch your cluster in minutes.
*
 
  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
 Weisberg* wrote  
> Hi,
> 
> Nope I mean page cache. Linux doesn't call the cache it maintains using 
> free memory a file cache. It uses free (and some of the time not so 
> free!) memory to buffer writes and to cache recently written/read data.
> 
> http://www.tldp.org/LDP/lki/lki-4.html
> 
> When Linux decides it needs free memory it can either evict stuff from 
> the page cache, flush dirty pages and then evict, or swap anonymous 
> memory out. When you disable swap you only disable the last behavior.
> 
> Maybe we are talking at cross purposes? What I meant is that increasing 
> the heap size to reduce GC frequency is a legitimate thing to do and it 
> does have an impact on the performance of the page cache even if you have 
> swap disabled?
> 
> Ariel
> 
> 
> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>> >Page cache is data pending flush to disk and data cached from disk.
>> 
>> Do you mean file cache?
>> 
>> 
>> Best regards, Vladimir Yudovin, 
>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.*
>> 
>> 
>>  On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg 
>> * wrote  
>>> Hi,
>>> 
>>> Page cache is in use even if you disable swap. Swap is anonymous 
>>> memory, and whatever else the Linux kernel supports paging out. Page 
>>> cache is data pending flush to disk and data cached from disk.
>>> 
>>> Given how bad the GC pauses are in C* I think it's not the high pole in 
>>> the tent. Until key things are off heap and C* can run with CMS and get 
>>> 10 millisecond GCs all day long.
>>> 
>>> You can go through tuning and hardware selection try to get more 
>>> consistent IO pauses and remove outliers as you mention and as a user I 
>>> think this is your best bet. Generally it's either bad device or 
>>> filesystem behavior if you get page faults taking more than 200 
>>> milliseconds O(G1 gc collection).
>>> 
>>> I think a JVM change to allow safe points around memory mapped file 
>>> access is really unlikely although I agree it would be great. I think 
>>> the best hack around it is to code up your memory mapped file access 
>>> into JNI methods and find some way to get that to work. Right now if 
>>> you want to create a safe point a JNI method is the way to do it. The 
>>> problem is that JNI methods and POJOs don't get along well.
>>> 
>>> If you think about it the reason non-memory mapped IO works well is 
>>> that it's all JNI methods so they don't impact time to safe point. I 
>>> think there is a tradeoff between tolerance for outliers and 
>>> performance.
>>> 
>>> I don't know the state of the non-memory mapped path and how reliable 
>>> that is. If it were reliable and I couldn't tolerate 

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Graham Sanderson
I haven’t studied the read path that carefully, but there might be a spot at 
the C* level rather than JVM level where you could effectively do a JNI touch 
of the mmap region you’re going to need next.

> On Oct 8, 2016, at 7:17 PM, Graham Sanderson  wrote:
> 
> We don’t use Azul’s Zing, but it does have the nice feature that all threads 
> don’t have to reach safepoints at the same time. That said we make heavy use 
> of Cassandra (with off heap memtables - not directly related but allows us a 
> lot more GC headroom) and SOLR where we switched to mmap because it FAR out 
> performed pread variants - in no cases have we noticed long time to safe 
> point (then again our IO is lightning fast).
> 
>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad > > wrote:
>> 
>> Linux automatically uses free memory as cache.  It's not swap.
>> 
>> http://www.tldp.org/LDP/lki/lki-4.html 
>> 
>> 
>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin > > wrote:
>> Sorry, I don't catch something. What page (memory) cache can exist if there 
>> is no swap file.
>> Where are those page written/read?
>> 
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone  - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>> 
>>  On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg> > wrote  
>> Hi,
>> 
>> Nope I mean page cache. Linux doesn't call the cache it maintains using free 
>> memory a file cache. It uses free (and some of the time not so free!) memory 
>> to buffer writes and to cache recently written/read data.
>> 
>> http://www.tldp.org/LDP/lki/lki-4.html 
>> 
>> 
>> When Linux decides it needs free memory it can either evict stuff from the 
>> page cache, flush dirty pages and then evict, or swap anonymous memory out. 
>> When you disable swap you only disable the last behavior.
>> 
>> Maybe we are talking at cross purposes? What I meant is that increasing the 
>> heap size to reduce GC frequency is a legitimate thing to do and it does 
>> have an impact on the performance of the page cache even if you have swap 
>> disabled?
>> 
>> Ariel
>> 
>> 
>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>> >Page cache is data pending flush to disk and data cached from disk.
>> 
>> Do you mean file cache?
>> 
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone  - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>>  On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg > > wrote  
>> Hi,
>> 
>> Page cache is in use even if you disable swap. Swap is anonymous memory, and 
>> whatever else the Linux kernel supports paging out. Page cache is data 
>> pending flush to disk and data cached from disk.
>> 
>> Given how bad the GC pauses are in C* I think it's not the high pole in the 
>> tent. Until key things are off heap and C* can run with CMS and get 10 
>> millisecond GCs all day long.
>> 
>> You can go through tuning and hardware selection try to get more consistent 
>> IO pauses and remove outliers as you mention and as a user I think this is 
>> your best bet. Generally it's either bad device or filesystem behavior if 
>> you get page faults taking more than 200 milliseconds O(G1 gc collection).
>> 
>> I think a JVM change to allow safe points around memory mapped file access 
>> is really unlikely although I agree it would be great. I think the best hack 
>> around it is to code up your memory mapped file access into JNI methods and 
>> find some way to get that to work. Right now if you want to create a safe 
>> point a JNI method is the way to do it. The problem is that JNI methods and 
>> POJOs don't get along well.
>> 
>> If you think about it the reason non-memory mapped IO works well is that 
>> it's all JNI methods so they don't impact time to safe point. I think there 
>> is a tradeoff between tolerance for outliers and performance.
>> 
>> I don't know the state of the non-memory mapped path and how reliable that 
>> is. If it were reliable and I couldn't tolerate the outliers I would use 
>> that. I have to ask though, why are you not able to tolerate the outliers? 
>> If you are reading and writing at quorum how is this impacting you?
>> 
>> Regards,
>> Ariel
>> 
>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>> Hi Josh,
>> 
>> >Running with increased heap size would reduce GC frequency, at the cost of 
>> >page cache.
>> 
>> Actually it's recommended to run C* without virtual memory enabled. So if 
>> there is no enough memory JVM fails instead of blocking
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone  - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>>  On Fri, 07 Oc

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Graham Sanderson
We don’t use Azul’s Zing, but it does have the nice feature that all threads 
don’t have to reach safepoints at the same time. That said we make heavy use of 
Cassandra (with off heap memtables - not directly related but allows us a lot 
more GC headroom) and SOLR where we switched to mmap because it FAR out 
performed pread variants - in no cases have we noticed long time to safe point 
(then again our IO is lightning fast).

> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
> 
> Linux automatically uses free memory as cache.  It's not swap.
> 
> http://www.tldp.org/LDP/lki/lki-4.html 
> 
> 
> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin  > wrote:
> Sorry, I don't catch something. What page (memory) cache can exist if there 
> is no swap file.
> Where are those page written/read?
> 
> 
> Best regards, Vladimir Yudovin, 
> Winguzone  - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> 
> 
> 
>  On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg > wrote  
> Hi,
> 
> Nope I mean page cache. Linux doesn't call the cache it maintains using free 
> memory a file cache. It uses free (and some of the time not so free!) memory 
> to buffer writes and to cache recently written/read data.
> 
> http://www.tldp.org/LDP/lki/lki-4.html 
> 
> 
> When Linux decides it needs free memory it can either evict stuff from the 
> page cache, flush dirty pages and then evict, or swap anonymous memory out. 
> When you disable swap you only disable the last behavior.
> 
> Maybe we are talking at cross purposes? What I meant is that increasing the 
> heap size to reduce GC frequency is a legitimate thing to do and it does have 
> an impact on the performance of the page cache even if you have swap disabled?
> 
> Ariel
> 
> 
> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
> >Page cache is data pending flush to disk and data cached from disk.
> 
> Do you mean file cache?
> 
> 
> Best regards, Vladimir Yudovin, 
> Winguzone  - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> 
> 
>  On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg  > wrote  
> Hi,
> 
> Page cache is in use even if you disable swap. Swap is anonymous memory, and 
> whatever else the Linux kernel supports paging out. Page cache is data 
> pending flush to disk and data cached from disk.
> 
> Given how bad the GC pauses are in C* I think it's not the high pole in the 
> tent. Until key things are off heap and C* can run with CMS and get 10 
> millisecond GCs all day long.
> 
> You can go through tuning and hardware selection try to get more consistent 
> IO pauses and remove outliers as you mention and as a user I think this is 
> your best bet. Generally it's either bad device or filesystem behavior if you 
> get page faults taking more than 200 milliseconds O(G1 gc collection).
> 
> I think a JVM change to allow safe points around memory mapped file access is 
> really unlikely although I agree it would be great. I think the best hack 
> around it is to code up your memory mapped file access into JNI methods and 
> find some way to get that to work. Right now if you want to create a safe 
> point a JNI method is the way to do it. The problem is that JNI methods and 
> POJOs don't get along well.
> 
> If you think about it the reason non-memory mapped IO works well is that it's 
> all JNI methods so they don't impact time to safe point. I think there is a 
> tradeoff between tolerance for outliers and performance.
> 
> I don't know the state of the non-memory mapped path and how reliable that 
> is. If it were reliable and I couldn't tolerate the outliers I would use 
> that. I have to ask though, why are you not able to tolerate the outliers? If 
> you are reading and writing at quorum how is this impacting you?
> 
> Regards,
> Ariel
> 
> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
> Hi Josh,
> 
> >Running with increased heap size would reduce GC frequency, at the cost of 
> >page cache.
> 
> Actually it's recommended to run C* without virtual memory enabled. So if 
> there is no enough memory JVM fails instead of blocking
> 
> Best regards, Vladimir Yudovin, 
> Winguzone  - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> 
> 
>  On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder > wrote  
> Hello cassandra-users, 
> 
> I'm investigating an issue with JVMs taking a while to reach a safepoint. I'd 
> like the list's input on confirming my hypothesis and finding mitigations. 
> 
> My hypothesis is that slow block devices are causing Cassandra's JVM to pause 
> completely while attempting to reach a safepoint. 
> 
> Background: 
>

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Jonathan Haddad
Linux automatically uses free memory as cache.  It's not swap.

http://www.tldp.org/LDP/lki/lki-4.html

On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin 
wrote:

> Sorry, I don't catch something. What page (memory) cache can exist if
> there is no swap file.
> Where are those page written/read?
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone  - Hosted Cloud Cassandra on
> Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel Weisberg >* wrote 
>
> Hi,
>
> Nope I mean page cache. Linux doesn't call the cache it maintains using
> free memory a file cache. It uses free (and some of the time not so free!)
> memory to buffer writes and to cache recently written/read data.
>
> http://www.tldp.org/LDP/lki/lki-4.html
>
> When Linux decides it needs free memory it can either evict stuff from the
> page cache, flush dirty pages and then evict, or swap anonymous memory out.
> When you disable swap you only disable the last behavior.
>
> Maybe we are talking at cross purposes? What I meant is that increasing
> the heap size to reduce GC frequency is a legitimate thing to do and it
> does have an impact on the performance of the page cache even if you have
> swap disabled?
>
> Ariel
>
>
> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>
> >Page cache is data pending flush to disk and data cached from disk.
>
> Do you mean file cache?
>
>
> Best regards, Vladimir Yudovin,
>
>
> *Winguzone  - Hosted Cloud Cassandra on
> Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg
> >* wrote 
>
> Hi,
>
> Page cache is in use even if you disable swap. Swap is anonymous memory,
> and whatever else the Linux kernel supports paging out. Page cache is data
> pending flush to disk and data cached from disk.
>
> Given how bad the GC pauses are in C* I think it's not the high pole in
> the tent. Until key things are off heap and C* can run with CMS and get 10
> millisecond GCs all day long.
>
> You can go through tuning and hardware selection try to get more
> consistent IO pauses and remove outliers as you mention and as a user I
> think this is your best bet. Generally it's either bad device or filesystem
> behavior if you get page faults taking more than 200 milliseconds O(G1 gc
> collection).
>
> I think a JVM change to allow safe points around memory mapped file access
> is really unlikely although I agree it would be great. I think the best
> hack around it is to code up your memory mapped file access into JNI
> methods and find some way to get that to work. Right now if you want to
> create a safe point a JNI method is the way to do it. The problem is that
> JNI methods and POJOs don't get along well.
>
> If you think about it the reason non-memory mapped IO works well is that
> it's all JNI methods so they don't impact time to safe point. I think there
> is a tradeoff between tolerance for outliers and performance.
>
> I don't know the state of the non-memory mapped path and how reliable that
> is. If it were reliable and I couldn't tolerate the outliers I would use
> that. I have to ask though, why are you not able to tolerate the outliers?
> If you are reading and writing at quorum how is this impacting you?
>
> Regards,
> Ariel
>
> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>
> Hi Josh,
>
> >Running with increased heap size would reduce GC frequency, at the cost
> of page cache.
>
> Actually it's recommended to run C* without virtual memory enabled. So if
> there is no enough memory JVM fails instead of blocking
>
> Best regards, Vladimir Yudovin,
>
> *Winguzone  - Hosted Cloud Cassandra on
> Azure and SoftLayer.Launch your cluster in minutes.*
>
>
>  On Fri, 07 Oct 2016 21:06:24 -0400 *Josh Snyder >* wrote 
>
> Hello cassandra-users,
>
> I'm investigating an issue with JVMs taking a while to reach a safepoint.
> I'd
> like the list's input on confirming my hypothesis and finding mitigations.
>
> My hypothesis is that slow block devices are causing Cassandra's JVM to
> pause
> completely while attempting to reach a safepoint.
>
> Background:
>
> Hotspot occasionally performs maintenance tasks that necessitate stopping
> all
> of its threads. Threads running JITed code occasionally read from a given
> safepoint page. If Hotspot has initiated a safepoint, reading from that
> page
> essentially catapults the thread into purgatory until the safepoint
> completes
> (the mechanism behind this is pretty cool). Threads performing syscalls or
> executing native code do this check upon their return into the JVM.
>
> In this way, during the safepoint Hotspot can be sure that all of its
> threads
> are either patiently waiting for safepoint completion or in a system call.
>
> Cassandra makes heavy use of mmapped reads in normal operation. When doing
> mmapped reads, the JVM execute

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Vladimir Yudovin
Sorry, I don't catch something. What page (memory) cache can exist if there is 
no swap file.
Where are those page written/read?

Best regards, Vladimir Yudovin, 
Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




 On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg 
wrote  

Hi,

 

 Nope I mean page cache. Linux doesn't call the cache it maintains using free 
memory a file cache. It uses free (and some of the time not so free!) memory to 
buffer writes and to cache recently written/read data.

 

 http://www.tldp.org/LDP/lki/lki-4.html

 

 When Linux decides it needs free memory it can either evict stuff from the 
page cache, flush dirty pages and then evict, or swap anonymous memory out. 
When you disable swap you only disable the last behavior.

 

 Maybe we are talking at cross purposes? What I meant is that increasing the 
heap size to reduce GC frequency is a legitimate thing to do and it does have 
an impact on the performance of the page cache even if you have swap disabled?

 

 Ariel

 

 

 On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:

 >Page cache is data pending flush to disk and data cached from disk.

 

 Do you mean file cache?

 

 

 Best regards, Vladimir Yudovin, 

 Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.

 
 

 

  On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg 
 wrote  

 
 Hi,

 

 Page cache is in use even if you disable swap. Swap is anonymous memory, and 
whatever else the Linux kernel supports paging out. Page cache is data pending 
flush to disk and data cached from disk.

 

 Given how bad the GC pauses are in C* I think it's not the high pole in the 
tent. Until key things are off heap and C* can run with CMS and get 10 
millisecond GCs all day long.

 

 You can go through tuning and hardware selection try to get more consistent IO 
pauses and remove outliers as you mention and as a user I think this is your 
best bet. Generally it's either bad device or filesystem behavior if you get 
page faults taking more than 200 milliseconds O(G1 gc collection).

 

 I think a JVM change to allow safe points around memory mapped file access is 
really unlikely although I agree it would be great. I think the best hack 
around it is to code up your memory mapped file access into JNI methods and 
find some way to get that to work. Right now if you want to create a safe point 
a JNI method is the way to do it. The problem is that JNI methods and POJOs 
don't get along well.

 

 If you think about it the reason non-memory mapped IO works well is that it's 
all JNI methods so they don't impact time to safe point. I think there is a 
tradeoff between tolerance for outliers and performance.

 

 I don't know the state of the non-memory mapped path and how reliable that is. 
If it were reliable and I couldn't tolerate the outliers I would use that. I 
have to ask though, why are you not able to tolerate the outliers? If you are 
reading and writing at quorum how is this impacting you?

 

 Regards,

 Ariel

 

 On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:

 Hi Josh,

 

 >Running with increased heap size would reduce GC frequency, at the cost of 
page cache.

 

 Actually it's recommended to run C* without virtual memory enabled. So if 
there is no enough memory JVM fails instead of blocking

 

 Best regards, Vladimir Yudovin, 

 Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.
 
 

 

  On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder 
wrote  

 
 Hello cassandra-users, 

 

 I'm investigating an issue with JVMs taking a while to reach a safepoint. I'd 

 like the list's input on confirming my hypothesis and finding mitigations. 

 

 My hypothesis is that slow block devices are causing Cassandra's JVM to pause 

 completely while attempting to reach a safepoint. 

 

 Background: 

 

 Hotspot occasionally performs maintenance tasks that necessitate stopping all 

 of its threads. Threads running JITed code occasionally read from a given 

 safepoint page. If Hotspot has initiated a safepoint, reading from that page 

 essentially catapults the thread into purgatory until the safepoint completes 

 (the mechanism behind this is pretty cool). Threads performing syscalls or 

 executing native code do this check upon their return into the JVM. 

 

 In this way, during the safepoint Hotspot can be sure that all of its threads 

 are either patiently waiting for safepoint completion or in a system call. 

 

 Cassandra makes heavy use of mmapped reads in normal operation. When doing 

 mmapped reads, the JVM executes userspace code to effect a read from a file. 
On 

 the fast path (when the page needed is already mapped into the process), this 

 instruction is very fast. When the page is not cached, the CPU triggers a page 

 fault and asks the OS to go

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Ariel Weisberg
Hi,

Nope I mean page cache. Linux doesn't call the cache it maintains using
free memory a file cache. It uses free (and some of the time not so
free!) memory to buffer writes and to cache recently written/read data.

http://www.tldp.org/LDP/lki/lki-4.html

When Linux decides it needs free memory it can either evict stuff from
the page cache, flush dirty pages and then evict, or swap anonymous
memory out. When you disable swap you only disable the last behavior.

Maybe we are talking at cross purposes? What I meant is that increasing
the heap size to reduce GC frequency is a legitimate thing to do and it
does have an impact on the performance of the page cache even if you
have swap disabled?

Ariel


On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
> >Page cache is data pending flush to disk and data cached from disk.
>
> Do you mean file cache?
>
>
> Best regards, Vladimir Yudovin,
> *Winguzone[1] - Hosted Cloud Cassandra on Azure and SoftLayer. Launch
> your cluster in minutes.
*
>
>
>  On Sat, 08 Oct 2016 13:40:19 -0400 *Ariel Weisberg
> * wrote 
>> Hi,
>>
>> Page cache is in use even if you disable swap. Swap is anonymous
>> memory, and whatever else the Linux kernel supports paging out. Page
>> cache is data pending flush to disk and data cached from disk.
>>
>> Given how bad the GC pauses are in C* I think it's not the high pole
>> in the tent. Until key things are off heap and C* can run with CMS
>> and get 10 millisecond GCs all day long.
>>
>> You can go through tuning and hardware selection try to get more
>> consistent IO pauses and remove outliers as you mention and as a user
>> I think this is your best bet. Generally it's either bad device or
>> filesystem behavior if you get page faults taking more than 200
>> milliseconds O(G1 gc collection).
>>
>> I think a JVM change to allow safe points around memory mapped file
>> access is really unlikely although I agree it would be great. I think
>> the best hack around it is to code up your memory mapped file access
>> into JNI methods and find some way to get that to work. Right now if
>> you want to create a safe point a JNI method is the way to do it. The
>> problem is that JNI methods and POJOs don't get along well.
>>
>> If you think about it the reason non-memory mapped IO works well is
>> that it's all JNI methods so they don't impact time to safe point. I
>> think there is a tradeoff between tolerance for outliers and
>> performance.
>>
>> I don't know the state of the non-memory mapped path and how reliable
>> that is. If it were reliable and I couldn't tolerate the outliers I
>> would use that. I have to ask though, why are you not able to
>> tolerate the outliers? If you are reading and writing at quorum how
>> is this impacting you?
>>
>> Regards,
>> Ariel
>>
>> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
>>> Hi Josh,
>>>
>>> >Running with increased heap size would reduce GC frequency, at the
>>> >cost of page cache.
>>>
>>> Actually  it's recommended to run C* without virtual memory enabled.
>>> So if there  is no enough memory JVM fails instead of blocking
>>>
>>> Best regards, Vladimir Yudovin,
>>> *Winguzone[2] - Hosted Cloud Cassandra on Azure and SoftLayer.
>>> Launch your cluster in minutes.*
>>>
>>>
>>>  On Fri, 07 Oct 2016 21:06:24 -0400 *Josh
>>> Snyder* wrote 
 Hello cassandra-users,

 I'm investigating an issue with JVMs taking a while to reach a
 safepoint.  I'd
 like the list's input on confirming my hypothesis and finding
 mitigations.

 My hypothesis is that slow block devices are causing Cassandra's
 JVM to pause
 completely while attempting to reach a safepoint.

 Background:

 Hotspot occasionally performs maintenance tasks that necessitate
 stopping all
 of its threads. Threads running JITed code occasionally read from a
 given
 safepoint page. If Hotspot has initiated a safepoint, reading from
 that page
 essentially catapults the thread into purgatory until the safepoint
 completes
 (the mechanism behind this is pretty cool). Threads performing
 syscalls or
 executing native code do this check upon their return into the JVM.

 In this way, during the safepoint Hotspot can be sure that all of
 its threads
 are either patiently waiting for safepoint completion or in a
 system call.

 Cassandra makes heavy use of mmapped reads in normal operation.
 When doing
 mmapped reads, the JVM executes userspace code to effect a read
 from a file. On
 the fast path (when the page needed is already mapped into the
 process), this
 instruction is very fast. When the page is not cached, the CPU
 triggers a page
 fault and asks the OS to go fetch the page. The JVM doesn't even
 realize that
 anything interesting is happening: to it, the thread is just
 executing a mov
 instruction that happens to take a while.

 The OS

Re: Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Vladimir Yudovin
>Page cache is data pending flush to disk and data cached from disk.

Do you mean file cache?


Best regards, Vladimir Yudovin, 
Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




 On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg 
 wrote  

Hi,

 

 Page cache is in use even if you disable swap. Swap is anonymous memory, and 
whatever else the Linux kernel supports paging out. Page cache is data pending 
flush to disk and data cached from disk.

 

 Given how bad the GC pauses are in C* I think it's not the high pole in the 
tent. Until key things are off heap and C* can run with CMS and get 10 
millisecond GCs all day long.

 

 You can go through tuning and hardware selection try to get more consistent IO 
pauses and remove outliers as you mention and as a user I think this is your 
best bet. Generally it's either bad device or filesystem behavior if you get 
page faults taking more than 200 milliseconds O(G1 gc collection).
 

 I think a JVM change to allow safe points around memory mapped file access is 
really unlikely although I agree it would be great. I think the best hack 
around it is to code up your memory mapped file access into JNI methods and 
find some way to get that to work. Right now if you want to create a safe point 
a JNI method is the way to do it. The problem is that JNI methods and POJOs 
don't get along well.

 

 If you think about it the reason non-memory mapped IO works well is that it's 
all JNI methods so they don't impact time to safe point. I think there is a 
tradeoff between tolerance for outliers and performance.

 

 I don't know the state of the non-memory mapped path and how reliable that is. 
If it were reliable and I couldn't tolerate the outliers I would use that. I 
have to ask though, why are you not able to tolerate the outliers? If you are 
reading and writing at quorum how is this impacting you?

 

 Regards,

 Ariel
 

 On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:

 Hi Josh,

 

 >Running with increased heap size would reduce GC frequency, at the cost of 
page cache.

 

 Actually it's recommended to run C* without virtual memory enabled. So if 
there is no enough memory JVM fails instead of blocking

 

 Best regards, Vladimir Yudovin, 

 Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.

 
 

 

  On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder 
wrote  

 
 Hello cassandra-users, 

  

 I'm investigating an issue with JVMs taking a while to reach a safepoint. I'd 

 like the list's input on confirming my hypothesis and finding mitigations. 

  

 My hypothesis is that slow block devices are causing Cassandra's JVM to pause 

 completely while attempting to reach a safepoint. 

  

 Background: 

  

 Hotspot occasionally performs maintenance tasks that necessitate stopping all 

 of its threads. Threads running JITed code occasionally read from a given 

 safepoint page. If Hotspot has initiated a safepoint, reading from that page 

 essentially catapults the thread into purgatory until the safepoint completes 

 (the mechanism behind this is pretty cool). Threads performing syscalls or 

 executing native code do this check upon their return into the JVM. 

  

 In this way, during the safepoint Hotspot can be sure that all of its threads 

 are either patiently waiting for safepoint completion or in a system call. 

  

 Cassandra makes heavy use of mmapped reads in normal operation. When doing 

 mmapped reads, the JVM executes userspace code to effect a read from a file. 
On 

 the fast path (when the page needed is already mapped into the process), this 

 instruction is very fast. When the page is not cached, the CPU triggers a page 

 fault and asks the OS to go fetch the page. The JVM doesn't even realize that 

 anything interesting is happening: to it, the thread is just executing a mov 

 instruction that happens to take a while. 

  

 The OS, meanwhile, puts the thread in question in the D state (assuming Linux, 

 here) and goes off to find the desired page. This may take microseconds, this 

 may take milliseconds, or it may take seconds (or longer). When I/O occurs 

 while the JVM is trying to enter a safepoint, every thread has to wait for the 

 laggard I/O to complete. 

  

 If you log safepoints with the right options [1], you can see these 
occurrences 

 in the JVM output: 

  

 > # SafepointSynchronize::begin: Timeout detected: 

 > # SafepointSynchronize::begin: Timed out while spinning to reach a 
safepoint. 

 > # SafepointSynchronize::begin: Threads which did not reach the safepoint: 

 > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 
tid=0x7f8785bb1f30 nid=0x4e14 runnable [0x] 

 > java.lang.Thread.State: RUNNABLE 

 > 

 > # SafepointSynchronize::begin: (End of list) 

 > vmop [threads: total initially_running wait_to_block] [time: spin block 
sync cle

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Ariel Weisberg
Hi,

Page cache is in use even if you disable swap. Swap is anonymous memory,
and whatever else the Linux kernel supports paging out. Page cache is
data pending flush to disk and data cached from disk.

Given how bad the GC pauses are in C* I think it's not the high pole in
the tent. Until key things are off heap and C* can run with CMS and get
10 millisecond GCs all day long.

You can go through tuning and hardware selection try to get more
consistent IO pauses and remove outliers as you mention and as a user I
think this is your best bet. Generally it's either bad device or
filesystem behavior if you get page faults taking more than 200
milliseconds O(G1 gc collection).

I think a JVM change to allow safe points around memory mapped file
access is really unlikely although I agree it would be great. I think
the best hack around it is to code up your memory mapped file access
into JNI methods and find some way to get that to work. Right now if you
want to create a safe point a JNI method is the way to do it. The
problem is that JNI methods and POJOs don't get along well.

If you think about it the reason non-memory mapped IO works well is that
it's all JNI methods so they don't impact time to safe point. I think
there is a tradeoff between tolerance for outliers and performance.

I don't know the state of the non-memory mapped path and how reliable
that is. If it were reliable and I couldn't tolerate the outliers I
would use that. I have to ask though, why are you not able to tolerate
the outliers? If you are reading and writing at quorum how is this
impacting you?

Regards,
Ariel

On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
> Hi Josh,
>
> >Running with increased heap size would reduce GC frequency, at the
> >cost of page cache.
>
> Actually  it's recommended to run C* without virtual memory enabled.
> So if there  is no enough memory JVM fails instead of blocking
>
> Best regards, Vladimir Yudovin,
> *Winguzone[1] - Hosted Cloud Cassandra on Azure and SoftLayer. Launch
> your cluster in minutes.
*
>
>
>  On Fri, 07 Oct 2016 21:06:24 -0400 *Josh
> Snyder* wrote 
>> Hello cassandra-users,
>>
>> I'm investigating an issue with JVMs taking a while to reach a
>> safepoint.  I'd
>> like the list's input on confirming my hypothesis and finding
>> mitigations.
>>
>> My hypothesis is that slow block devices are causing Cassandra's JVM
>> to pause
>> completely while attempting to reach a safepoint.
>>
>> Background:
>>
>> Hotspot occasionally performs maintenance tasks that necessitate
>> stopping all
>> of its threads. Threads running JITed code occasionally read from
>> a given
>> safepoint page. If Hotspot has initiated a safepoint, reading from
>> that page
>> essentially catapults the thread into purgatory until the safepoint
>> completes
>> (the mechanism behind this is pretty cool). Threads performing
>> syscalls or
>> executing native code do this check upon their return into the JVM.
>>
>> In this way, during the safepoint Hotspot can be sure that all of its
>> threads
>> are either patiently waiting for safepoint completion or in a
>> system call.
>>
>> Cassandra makes heavy use of mmapped reads in normal operation.
>> When doing
>> mmapped reads, the JVM executes userspace code to effect a read from
>> a file. On
>> the fast path (when the page needed is already mapped into the
>> process), this
>> instruction is very fast. When the page is not cached, the CPU
>> triggers a page
>> fault and asks the OS to go fetch the page. The JVM doesn't even
>> realize that
>> anything interesting is happening: to it, the thread is just
>> executing a mov
>> instruction that happens to take a while.
>>
>> The OS, meanwhile, puts the thread in question in the D state
>> (assuming Linux,
>> here) and goes off to find the desired page. This may take
>> microseconds, this
>> may take milliseconds, or it may take seconds (or longer). When
>> I/O occurs
>> while the JVM is trying to enter a safepoint, every thread has to
>> wait for the
>> laggard I/O to complete.
>>
>> If you log safepoints with the right options [1], you can see these
>> occurrences
>> in the JVM output:
>>
>> > # SafepointSynchronize::begin: Timeout detected:
>> > # SafepointSynchronize::begin: Timed out while spinning to reach a
>> > # safepoint.
>> > # SafepointSynchronize::begin: Threads which did not reach the
>> > # safepoint:
>> > # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0
>> > # tid=0x7f8785bb1f30 nid=0x4e14 runnable [0x]
>> >java.lang.Thread.State: RUNNABLE
>> >
>> > # SafepointSynchronize::begin: (End of list)
>> >  vmop[threads: total initially_running
>> >  wait_to_block][time: spin block sync cleanup vmop]
>> >  page_trap_count
>> > 58099.941: G1IncCollectionPause [ 447  1
>> > 1]  [  3304 0  3305 1   190]  1
>>
>> If that safepoint happens to be a garbage collection (which this one
>> was), you

Re: JVM safepoints, mmap, and slow disks

2016-10-07 Thread Vladimir Yudovin
Hi Josh,

>Running with increased heap size would reduce GC frequency, at the cost of 
page cache.

Actually it's recommended to run C* without virtual memory enabled. So if there 
is no enough memory JVM fails instead of blocking

Best regards, Vladimir Yudovin, 
Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer.
Launch your cluster in minutes.




 On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder 
wrote  

Hello cassandra-users, 
 
I'm investigating an issue with JVMs taking a while to reach a safepoint. I'd 
like the list's input on confirming my hypothesis and finding mitigations. 
 
My hypothesis is that slow block devices are causing Cassandra's JVM to pause 
completely while attempting to reach a safepoint. 
 
Background: 
 
Hotspot occasionally performs maintenance tasks that necessitate stopping all 
of its threads. Threads running JITed code occasionally read from a given 
safepoint page. If Hotspot has initiated a safepoint, reading from that page 
essentially catapults the thread into purgatory until the safepoint completes 
(the mechanism behind this is pretty cool). Threads performing syscalls or 
executing native code do this check upon their return into the JVM. 
 
In this way, during the safepoint Hotspot can be sure that all of its threads 
are either patiently waiting for safepoint completion or in a system call. 
 
Cassandra makes heavy use of mmapped reads in normal operation. When doing 
mmapped reads, the JVM executes userspace code to effect a read from a file. On 
the fast path (when the page needed is already mapped into the process), this 
instruction is very fast. When the page is not cached, the CPU triggers a page 
fault and asks the OS to go fetch the page. The JVM doesn't even realize that 
anything interesting is happening: to it, the thread is just executing a mov 
instruction that happens to take a while. 
 
The OS, meanwhile, puts the thread in question in the D state (assuming Linux, 
here) and goes off to find the desired page. This may take microseconds, this 
may take milliseconds, or it may take seconds (or longer). When I/O occurs 
while the JVM is trying to enter a safepoint, every thread has to wait for the 
laggard I/O to complete. 
 
If you log safepoints with the right options [1], you can see these occurrences 
in the JVM output: 
 
> # SafepointSynchronize::begin: Timeout detected: 
> # SafepointSynchronize::begin: Timed out while spinning to reach a 
safepoint. 
> # SafepointSynchronize::begin: Threads which did not reach the safepoint: 
> # "SharedPool-Worker-5" #468 daemon prio=5 os_prio=0 
tid=0x7f8785bb1f30 nid=0x4e14 runnable [0x] 
> java.lang.Thread.State: RUNNABLE 
> 
> # SafepointSynchronize::begin: (End of list) 
> vmop [threads: total initially_running wait_to_block] [time: spin block 
sync cleanup vmop] page_trap_count 
> 58099.941: G1IncCollectionPause [ 447 1 1 ] [ 3304 0 3305 1 190 ] 1 
 
If that safepoint happens to be a garbage collection (which this one was), you 
can also see it in GC logs: 
 
> 2016-10-07T13:19:50.029+: 58103.440: Total time for which application 
threads were stopped: 3.4971808 seconds, Stopping threads took: 3.3050644 
seconds 
 
In this way, JVM safepoints become a powerful weapon for transmuting a single 
thread's slow I/O into the entire JVM's lockup. 
 
Does all of the above sound correct? 
 
Mitigations: 
 
1) don't tolerate block devices that are slow 
 
This is easy in theory, and only somewhat difficult in practice. Tools like 
perf and iosnoop [2] can do pretty good jobs of letting you know when a block 
device is slow. 
 
It is sad, though, because this makes running Cassandra on mixed hardware (e.g. 
fast SSD and slow disks in a JBOD) quite unappetizing. 
 
2) have fewer safepoints 
 
Two of the biggest sources of safepoints are garbage collection and revocation 
of biased locks. Evidence points toward biased locking being unhelpful for 
Cassandra's purposes, so turning it off (-XX:-UseBiasedLocking) is a quick way 
to eliminate one source of safepoints. 
 
Garbage collection, on the other hand, is unavoidable. Running with increased 
heap size would reduce GC frequency, at the cost of page cache. But sacrificing 
page cache would increase page fault frequency, which is another thing we're 
trying to avoid! I don't view this as a serious option. 
 
3) use a different IO strategy 
 
Looking at the Cassandra source code, there appears to be an un(der)documented 
configuration parameter called disk_access_mode. It appears that changing this 
to 'standard' would switch to using pread() and pwrite() for I/O, instead of 
mmap. I imagine there would be a throughput penalty here for the case when 
pages are in the disk cache. 
 
Is this a serious option? It seems far too underdocumented to be thought of as 
a contender. 
 
4) modify the JVM 
 
This is a longer term option. For the purposes of safepoints, perhaps the JVM 
could treat reads from an mmapped file in