[ 
https://issues.apache.org/jira/browse/AURORA-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526955#comment-14526955
 ] 

Bill Farner commented on AURORA-1303:
-------------------------------------

Thanks for reporting!  Are you able to reproduce this in our vagrant image?

> Thermos runner broken with non-root account
> -------------------------------------------
>
>                 Key: AURORA-1303
>                 URL: https://issues.apache.org/jira/browse/AURORA-1303
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>    Affects Versions: 0.7.0
>            Reporter: Ovidiu Predescu
>
> This happens with the latest code from github.
> I'm trying to schedule the hello_world example using a non-root role. The 
> thermos_runner crashes when it tries to write the checkpoint in the 
> fetch_package process.
> It looks like what is happening is the runner is executing as the non-root 
> user, but the checkpoint is owned by root.
> Unfortunately the error handling in Aurora is not very good. The exception 
> thrown by the runner is silently swallowed, and the fetch_package process is 
> running without showing any failures in the log files. I was able to figure 
> out what's going on by manually running the command.
> As a workaround I added user 'ovidiu' to group 'root', since the directory 
> containing the checkpoint has 'rwx' permissions for the group.
> This is the command:
> /usr/bin/python2.7 
> /var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex
>  --setuid=ovidiu 
> --thermos_json=/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/task.json
>  
> --sandbox=/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/sandbox
>  --log_dir=. 
> --task_id=1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1
>  --log_to_disk=DEBUG --checkpoint_root=/var/run/thermos --hostname=m1a.dc
> And here is the output:
> Writing log files to disk in .
> ERROR] Found existing runner, cannot take control.
> ERROR] Unknown exception: Unable to open checkpoint 
> /var/run/thermos/checkpoints/1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runner
> ERROR] Traceback (most recent call last):
> ERROR]   File 
> "/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex/apache/thermos/bin/thermos_runner.py",
>  line 176, in proxy_main
> ERROR]   File 
> "/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex/apache/thermos/core/runner.py",
>  line 859, in run
> ERROR]     with self.control(force):
> ERROR]   File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
> ERROR]     return self.gen.next()
> ERROR]   File 
> "/var/lib/mesos/slaves/20150502-132057-838930604-5050-17297-S23/frameworks/20150502-132057-838930604-5050-17297-0000/executors/thermos-1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runs/68c1af87-c531-424f-9fdb-0840cde02815/thermos_runner.pex/apache/thermos/core/runner.py",
>  line 552, in control
> ERROR]     raise self.PermissionError('Unable to open checkpoint %s' % 
> ckpt_file)
> ERROR] PermissionError: Unable to open checkpoint 
> /var/run/thermos/checkpoints/1430629905212-ovidiu-devel-hello_world-0-bc87c672-9cb2-4e4b-84c1-2b7d0e8726c1/runner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to