Ok - I can successfully recreate this on the public cloud - I'll send you some details separately.
On Mon, Apr 11, 2016 at 9:08 PM, Nigel Magnay <[email protected]> wrote: > Sure - I'm going to try and get a test container up on the Joyent cloud to > at least see if it does the same thing there. > > Interestingly, it looks like it's timing related. If I attach a java > debugger to a debug socket when running the build, it doesn't seem to fail. > > > On Mon, Apr 11, 2016 at 8:28 PM, Elijah Zupancic <[email protected]> > wrote: > >> Hi Nigel, >> >> I'm trying to verify this bug. I wasn't able to get your Dockerfile to >> build because it was missing associated files. Is there a way for you to >> build the Docker image and push it to Docker hub? If so, could you then try >> to distill the repro steps down to a single command that I can execute >> using docker exec? >> >> Thanks, >> Elijah >> >> On Mon, Apr 11, 2016 at 7:54 AM, Nigel Magnay <[email protected]> >> wrote: >> >>> Hm - sounds related. I updated the GZ to 20160411T120144Z, but the >>> docker container still exhibits the behaviour (unless there are other >>> updates I need to apply?) >>> >>> On Mon, Apr 11, 2016 at 2:44 PM, Jerry Jelinek <[email protected] >>> > wrote: >>> >>>> It is possible that you are hitting https://smartos.org/bugview/OS-5311, >>>> which I just fixed on Friday. To summarize the problem: >>>> >>>> The bug is that the active contract template is only setup on the >>>> initial LWP in the process, so any time that LWP forks, the child will be >>>> in a different contract, as configured. However, if any other LWP initiates >>>> the fork, then that LWP has a NULL active contract template, so the child >>>> will be in the same contract. >>>> >>>> Due to the how we setup the application process for a Docker zone, if >>>> the application forks a child, it must be in a different contract. If it's >>>> not, then when the child exits, it will cause all of the other processes in >>>> the same contract to die and the zone will halt. >>>> >>>> Jerry >>>> >>>> >>>> On Mon, Apr 11, 2016 at 6:49 AM, Nigel Magnay <[email protected]> >>>> wrote: >>>> >>>>> Ok, so I have traced a little further >>>>> >>>>> Part of the maven build shells out to run git (via the plugin code at >>>>> [2]; the last lines of log show this before the mysterious death: >>>>> >>>>> [INFO] Executing: /bin/sh -c cd >>>>> '/home/jenkins/realtime/commons/csw-commons-utils' && 'git' 'rev-parse' >>>>> '--verify' 'HEAD' >>>>> [INFO] Working directory: >>>>> /home/jenkins/realtime/commons/csw-commons-utils >>>>> [INFO] Storing buildNumber: a20f2793de8f449ff5c638479e73dab4a884936c >>>>> at timestamp: 1460376264112 >>>>> >>>>> The output is showing that the git process returned (the info output >>>>> showing the returned info). However, in the dtrace log, I can see this: >>>>> >>>>> 937/1: lx_emulate(7fffff08eca0, 231, [8d, 0, 8d, ffffffffffffff90, 3c, >>>>> e7]) >>>>> 937/1: lx_exit_common(LX_ET_EXIT_GROUP, 141) >>>>> >>>>> Which I'm assuming is the git process returning an error from exit(), >>>>> and googling around 141 seems to indicate a pipe being closed (e.g: [1]) >>>>> >>>>> Sometimes the build does not die at that point - and the corresponding >>>>> part of the log did not show 141, but instead a 0 >>>>> >>>>> 1264/1: lx_exit_common(LX_ET_EXIT_GROUP, 0) >>>>> >>>>> If I remove the part of the build that calls git, it completes >>>>> successfully, so it certainly seems to be related to this. >>>>> >>>>> What I don't really understand is why the child process exiting is >>>>> causing the parent to mysteriously exit - but not immediately ? >>>>> >>>>> >>>>> [1] https://bugs.launchpad.net/mksh/+bug/1532621 >>>>> [2] >>>>> https://github.com/mojohaus/buildnumber-maven-plugin/blob/buildnumber-maven-plugin-1.3/src/main/java/org/codehaus/mojo/build/CreateMojo.java >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Apr 11, 2016 at 10:37 AM, Nigel Magnay <[email protected] >>>>> > wrote: >>>>> >>>>>> So uname of the GZ is >>>>>> SunOS headnode 5.11 joyent_20160317T000105Z i86pc i386 i86pc >>>>>> >>>>>> The docker container is >>>>>> Linux ff4a058f3f5e 3.13.0 BrandZ virtual linux x86_64 GNU/Linux >>>>>> >>>>>> I've attached the Dockerfile, though it's not that exciting. I seem >>>>>> to get failures quite often when the docker api chooses the HN to >>>>>> provision >>>>>> a node rather than a CN, but I'm not totally sure about that. I don't >>>>>> think >>>>>> it's running out of memory - vmstat in the GZ shows >>>>>> >>>>>> I've attached an output from the tail of truss on the GZ for the java >>>>>> process (bear in mind I know nothing currently about dtrace, but I will >>>>>> go >>>>>> away and try to read up) - I assume though that I'll see segfaults as a >>>>>> natural product of the java memory management. >>>>>> >>>>>> The only thing that stands out to my eye in the trace is >>>>>> vforkx(0) = 65788 >>>>>> >>>>>> which seems an strange return value if I'm reading it correctly? >>>>>> >>>>>> >>>>>> On Sat, Apr 9, 2016 at 11:16 AM, Nigel Magnay <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> I'll dig out the Dockerfile on monday - it's not particularly >>>>>>> complex. >>>>>>> >>>>>>> Basically the (jenkins) build process asks triton to instantiate a >>>>>>> (java:8) docker image, the uses ssh to connect and invoke maven. I can >>>>>>> connect 'by hand' and also make it die - though it may be that it's only >>>>>>> doing this on some nodes. >>>>>>> >>>>>>> The most likely explanation is I've either misunderstood or >>>>>>> misconfigured something. There would seem to be enough free memory and >>>>>>> swap >>>>>>> on the host whilst it's running, but might there be other limits >>>>>>> (processes?) that may have been breached causing it to be killed? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 8, 2016 at 10:20 PM, Elijah Zupancic < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Nigel, >>>>>>>> >>>>>>>> Could you please give us some additional information. Could you >>>>>>>> tell us the platform image version. You can find it by doing a uname >>>>>>>> -a or >>>>>>>> if you are within a docker container a /native/bin/uname -a >>>>>>>> >>>>>>>> Also, with regards to the Docker build process are you using Docker >>>>>>>> build with Triton or are you using Docker build with Linux Docker and >>>>>>>> then >>>>>>>> running the built image on Triton? >>>>>>>> >>>>>>>> Also, if you Dockerfile isn't sensitive could you please share it >>>>>>>> with us? I can attempt to reproduce. Another way to narrow down >>>>>>>> problems is >>>>>>>> to try to run it in the Joyent public cloud and to see if it dies >>>>>>>> there. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Elijah >>>>>>>> >>>>>>>> On Fri, Apr 8, 2016 at 6:11 AM, Nigel Magnay < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi - >>>>>>>>> >>>>>>>>> I've successfully built a Triton cloud, and am using it to >>>>>>>>> provision docker containers to build our java software. >>>>>>>>> >>>>>>>>> The java container I use is derived from "java:8" docker image. >>>>>>>>> >>>>>>>>> Mostly it works. However I am getting mysterious failures where >>>>>>>>> the build process (maven) simply exits half way through with no >>>>>>>>> error. I >>>>>>>>> have a test system that is spun up that, mostly, consistently fails >>>>>>>>> in the >>>>>>>>> same place. >>>>>>>>> >>>>>>>>> dmesg from inside the container yields nothing. Irritatinly, if I >>>>>>>>> do 'strace -ff -o me ...', the build continues past the point it was >>>>>>>>> failing. If I try to monitor the process from a second connection - >>>>>>>>> sometimes the build continues and *that* connection just dies. >>>>>>>>> >>>>>>>>> I was able to connect briefly with truss from the GZ, but I'm not >>>>>>>>> sure it tells me very much. >>>>>>>>> >>>>>>>>> >>>>>>>>> Could I have hit some process accounting limit? I don't think it's >>>>>>>>> memory (the process is set to a fairly low java Xmx limit). Where >>>>>>>>> else can >>>>>>>>> I look to figure this out? >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -Elijah >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> *smartos-discuss* | Archives >>>> <https://www.listbox.com/member/archive/184463/=now> >>>> <https://www.listbox.com/member/archive/rss/184463/22541698-24d6dc34> >>>> | Modify <https://www.listbox.com/member/?&> Your Subscription >>>> <http://www.listbox.com> >>>> >>> >>> >> >> >> -- >> -Elijah >> *smartos-discuss* | Archives >> <https://www.listbox.com/member/archive/184463/=now> >> <https://www.listbox.com/member/archive/rss/184463/22541698-24d6dc34> | >> Modify >> <https://www.listbox.com/member/?&> >> Your Subscription <http://www.listbox.com> >> > > ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
