Re: RFD: 8252768: Fast, asynchronous heap dumps

Volker Simonis Fri, 04 Sep 2020 08:13:22 -0700

On Fri, Sep 4, 2020 at 12:01 PM Thomas Stüfe <[email protected]> wrote:
>
>
>
> On Fri, Sep 4, 2020 at 11:23 AM Volker Simonis <[email protected]> 
> wrote:
>>
>> On Thu, Sep 3, 2020 at 8:26 PM Thomas Stüfe <[email protected]> wrote:
>> >
>> > Hi Volker,
>> >
>> > On Thu, Sep 3, 2020 at 7:41 PM Volker Simonis <[email protected]> 
>> > wrote:
>> >>
>> >> Hi Thomas,
>> >>
>> >> Thanks for sharing your concerns. Please find my anserwers inline:
>> >>
>> >> On Thu, Sep 3, 2020 at 6:58 PM Thomas Stüfe <[email protected]> 
>> >> wrote:
>> >> >
>> >> > Hi Volker,
>> >> >
>> >> > hah, that is a cool idea :)
>> >> >
>> >> > Lets see what could go wrong:
>> >> >
>> >> > So, child process is forked, but in the child only the forking thread 
>> >> > survives, right? The forking thread then proceeds to dump. What happens 
>> >> > when, during dumping, it needs a lock owned by one of the non-running 
>> >> > threads? Could also be something non-obvious, like a lock drawn by one 
>> >> > of the lower level subsystems, e.g. memory allocation (and then NMT), 
>> >> > UL etc.
>> >>
>> >> I haven't completely analyzed all the possible code paths yet but the
>> >> good thing is that the child process is at a safepoint and as you
>> >> correctly mentioned it only has a single running thread. I think if a
>> >> VM operation at a safepoint would require a lock owned by another
>> >> thread, it would dead-lock anyway, even without forking, wouldn't it?
>> >> Normally, VM_HeapDumper::doit() runs at a safepoint without blocking
>> >> so I currently don't see why it shouldn't be able to run without
>> >> blocking in a forked process?
>> >>
>> >
>> > You are right, I overlooked that the parent VM is at a safepoint at fork 
>> > time.
>> >
>> >>
>> >> >
>> >> > --
>> >> >
>> >> > In the child, I would be afraid of running any kind of cleanup code 
>> >> > when exiting, since that may somehow modify state in the parent (e.g. 
>> >> > via explicitly shared memory, or whatever third party native code may 
>> >> > be up to). So I would use _exit(), not exit(), to avoid running any 
>> >> > stray onexit()/atexit() handlers.
>> >> >
>> >>
>> >> Thanks, that's a good point. I'll change that.
>> >
>> >
>> >>
>> >>
>> >>
>> >> > Of course, then you need to make sure the dump is flushed and the file 
>> >> > handle is closed before exiting.
>> >> >
>> >> > --
>> >> >
>> >> > Depending on the overcommit settings fork() may fail with ENOMEM, 
>> >> > regardless of copy-on-write.
>> >> >
>> >>
>> >> Not sure this is related to overcommit settings. According to the
>> >> man-page, fork() only fails with "ENOMEM" if fork() "failed to
>> >> allocate the necessary kernel structures because memory is tight". But
>> >> a failing fork is actually no problem at all. Currently, if fork()
>> >> fails, I just fall back to normal, synchronous dumping. Of course this
>> >> could be made configurable such that a failing asynchronous dump
>> >> wouldn't result in a synchronous dump with its long safepoint timeout
>> >> but instead completely skip the dump altogether.
>> >>
>> >> > --
>> >> >
>> >> > If the parent process is, at the time of the fork, touching a lot of 
>> >> > pages, and the child takes its sweet time writing the dump, total 
>> >> > memory usage will go up, right? Compared to the original, non-async 
>> >> > variant.
>> >> >
>> >>
>> >> Yes, that's right. The child is not writing more than a few kb, but
>> >> the total memory consumption of the system will increase by the amount
>> >> of common pages which are changed by the parent while the child is
>> >> dumping.
>> >>
>> >
>> > Plus costs for page tables and vma structures; even with copy-on-write 
>> > this is not free. But I do not think that would be a reason not to do it.
>>
>> Yes, you're actually trading memory for speed :)
>>
>> >
>> >>
>> >> > --
>> >> >
>> >> > We will now have a second java process popping up, existing for some 
>> >> > seconds, then vanishing. Outside tooling might be confused. OTOH the 
>> >> > same happens when forking via Runtime.exec, but there this state only 
>> >> > persists for some microseconds, until the first exec() call.
>> >>
>> >> That's also right. But the new process is at a safepoint and will exit
>> >> right from there. So at least other tools won't be able to attach to
>> >> the new process. It won't be even detectable by tools like jps.
>> >>
>> >> >
>> >> > --
>> >> >
>> >> > UL in child: this log output now gets mixed in asynchronously with the 
>> >> > parent's log? I would probably avoid logging in the child process. 
>> >> > Also, as stated above, I am not sure if UL uses locks internally, which 
>> >> > may hang.
>> >>
>> >> As I wrote in my mail, the UL in the child is currently only there for
>> >> debugging purposes. It would be removed in a final version.
>> >
>> >
>> > Sorry, I missed that. Note that UL may still be active though, we may 
>> > inherit active logging from the parent and the code you call may contain 
>> > UL logging points. To be safe you may just deactivate logging at the start 
>> > of the child's life.
>>
>> I thought about that already but didn't know how to achieve that. I've
>> found "LogConfiguration::disable_logging()" now. Do you know if it is
>> safe to call that? I'll give it a try...
>
>
> I wonder whether a more generic approach in addition to switching off UL may 
> be safer:
>
> We want to prevent writing to any descriptor we inherited from the parent.


Not really. We still want to dump the heap into the file which the
parent process has opened for us :)

> It is unlikely that this happens since we kind of control what code runs in 
> the child, but there may be accidental writes, UL is one example. Another 
> example may be if someone instrumented the VM, so we run with some code 
> "under us" without knowing.
>
> We could just close all file descriptors at the beginning of the child's 
> life. Like we do in the jspawnhelper. Worst thing that happens is that trips 
> off coding in the child which really wanted to write to the file. Either the 
> code is able to handle this gracefully, or it will assert, but then we'll 
> know. In any case we have not disturbed the parent, we only risk an error in 
> the child process.
>
> Lets look at UL: if we close the underlying file descriptor, I believe it 
> would not even notice. Looks like it just soldiers on, e.g. 
> LogFileOutput::write(), just takes the negative return of jio_fprintf() and 
> counts that. I think this is actually a bug, this should be fixed.
>
> ..Thomas
>
>
>
>

Re: RFD: 8252768: Fast, asynchronous heap dumps

Reply via email to