from:"Stephan Erb \(JIRA\)"

[jira] [Resolved] (AURORA-1949) PreemptionVictimFilterImpl comparator violates transitivity causing exceptions

2019-07-24 Thread Stephan Erb (JIRA)



 [ 
https://issues.apache.org/jira/browse/AURORA-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1949.
-
Resolution: Fixed
  Assignee: Stephan Erb  (was: Jordan Ly)

> PreemptionVictimFilterImpl comparator violates transitivity causing exceptions
> --
>
> Key: AURORA-1949
> URL: https://issues.apache.org/jira/browse/AURORA-1949
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Jordan Ly
>Assignee: Stephan Erb
>Priority: Critical
>
> The PreemptionVictimFilterImpl uses a comparator to sort ResourceBags in 
> order to preempt the biggest tasks first when searching for a victim. 
> However, the current implementation causes an exception which causes the 
> Scheduler to fail:
> {noformat}
> SEVERE: Service PreemptorService [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeLo(TimSort.java:777)
> at java.util.TimSort.mergeAt(TimSort.java:514)
> at java.util.TimSort.mergeCollapse(TimSort.java:441)
> at java.util.TimSort.sort(TimSort.java:245)
> at java.util.Arrays.sort(Arrays.java:1438)
> at 
> com.google.common.collect.Ordering.immutableSortedCopy(Ordering.java:882)
> at 
> org.apache.aurora.scheduler.preemptor.PreemptionVictimFilter$PreemptionVictimFilterImpl.filterPreemptionVictims(PreemptionVictimFilter.java:210)
> at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:178)
> at 
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
> at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
> at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> at 
> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)
> at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
> at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)
> at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> at 
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:205)
> at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)
> at 
> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Looking at the code, it seems it violates transitivity:
> {code:java}
> @VisibleForTesting
> static final Ordering ORDER = new Ordering() {
>   @Override
>   public int compare(ResourceBag left, ResourceBag right) {
> Set types = ImmutableSet.builder()
> .addAll(left.streamResourceVectors().map(e -> 
> e.getKey()).iterator())
> .addAll(right.streamResourceVectors().map(e -> 
> e.getKey()).iterator())
> .build();
> boolean allZero = true;
> boolean allGreaterOrEqual = true;
> boolean allLessOrEqual = true;
> for (ResourceType type : types) {
>   int compare = left.valueOf(type).compareTo(right.valueOf(type));
>   if (compare != 0) {
> allZero = false;
>   }
>   if (compare < 0) {
> allGreaterOrEqual = false;
>   }
>   if (compare > 0) {
> allLessOrEqual = false;
>   }
> }
> if (allZero) {
>   return 0;
> }
> if (allGreaterOrEqual) {
>   return 1;
> }
> if (allLessOrEqual) {
>   return -1;
> }
> return 0;
>   }
> };
> {code}
> The example below illustrates the error:
> {noformat}
> Resource:X Y Z
> Bag A:   2 0 2
> Bag B:   1 2 1
> Bag C:   2 2 1
> {noformat}
> We can see that A = B, B < C, and C = A which

[jira] [Commented] (AURORA-1949) PreemptionVictimFilterImpl comparator violates transitivity causing exceptions

2019-07-21 Thread Stephan Erb (JIRA)



[ 
https://issues.apache.org/jira/browse/AURORA-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889813#comment-16889813
 ] 

Stephan Erb commented on AURORA-1949:
-

As the issue is still unsolved, I have aimed for a simple & deterministic 
approach: [https://github.com/apache/aurora/pull/61]

> PreemptionVictimFilterImpl comparator violates transitivity causing exceptions
> --
>
> Key: AURORA-1949
> URL: https://issues.apache.org/jira/browse/AURORA-1949
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Jordan Ly
>Assignee: Jordan Ly
>Priority: Critical
>
> The PreemptionVictimFilterImpl uses a comparator to sort ResourceBags in 
> order to preempt the biggest tasks first when searching for a victim. 
> However, the current implementation causes an exception which causes the 
> Scheduler to fail:
> {noformat}
> SEVERE: Service PreemptorService [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeLo(TimSort.java:777)
> at java.util.TimSort.mergeAt(TimSort.java:514)
> at java.util.TimSort.mergeCollapse(TimSort.java:441)
> at java.util.TimSort.sort(TimSort.java:245)
> at java.util.Arrays.sort(Arrays.java:1438)
> at 
> com.google.common.collect.Ordering.immutableSortedCopy(Ordering.java:882)
> at 
> org.apache.aurora.scheduler.preemptor.PreemptionVictimFilter$PreemptionVictimFilterImpl.filterPreemptionVictims(PreemptionVictimFilter.java:210)
> at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:178)
> at 
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
> at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
> at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> at 
> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)
> at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
> at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)
> at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> at 
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:205)
> at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)
> at 
> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Looking at the code, it seems it violates transitivity:
> {code:java}
> @VisibleForTesting
> static final Ordering ORDER = new Ordering() {
>   @Override
>   public int compare(ResourceBag left, ResourceBag right) {
> Set types = ImmutableSet.builder()
> .addAll(left.streamResourceVectors().map(e -> 
> e.getKey()).iterator())
> .addAll(right.streamResourceVectors().map(e -> 
> e.getKey()).iterator())
> .build();
> boolean allZero = true;
> boolean allGreaterOrEqual = true;
> boolean allLessOrEqual = true;
> for (ResourceType type : types) {
>   int compare = left.valueOf(type).compareTo(right.valueOf(type));
>   if (compare != 0) {
> allZero = false;
>   }
>   if (compare < 0) {
> allGreaterOrEqual = false;
>   }
>   if (compare > 0) {
> allLessOrEqual = false;
>   }
> }
> if (allZero) {
>   return 0;
> }
> if (allGreaterOrEqual) {
>   return 1;
> }
> if (allLessOrEqual) {
>   return -1;
> }
> return 0;
>   }
> };
> {code}
> The example below illustrates the error:
> {noformat}
> Resource:X Y Z
> Bag A:   2 0 2
> Bag

[jira] [Commented] (AURORA-1949) PreemptionVictimFilterImpl comparator violates transitivity causing exceptions

2018-06-21 Thread Stephan Erb (JIRA)



[ 
https://issues.apache.org/jira/browse/AURORA-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519805#comment-16519805
 ] 

Stephan Erb commented on AURORA-1949:
-

Recent version of the stacktrace we have been bumping into:
{code:java}
 Jun 20, 2018 2:07:26 PM 
com.google.common.util.concurrent.ServiceManager$ServiceListener failed
 SEVERE: Service PreemptorService [FAILED] has failed in the RUNNING state.
 java.lang.IllegalArgumentException: Comparison method violates its general 
contract!
 at java.util.TimSort.mergeHi(TimSort.java:899)
 at java.util.TimSort.mergeAt(TimSort.java:516)
 at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
 at java.util.TimSort.sort(TimSort.java:254)
 at java.util.Arrays.sort(Arrays.java:1512)
 at java.util.ArrayList.sort(ArrayList.java:1462)
 at java.util.stream.SortedOps$RefSortingSink.end(SortedOps.java:387)
 at java.util.stream.Sink$ChainedReference.end(Sink.java:258)
 at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
 at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
 at 
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
 at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
 at 
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
 at 
org.apache.aurora.scheduler.preemptor.PreemptionVictimFilter$PreemptionVictimFilterImpl.filterPreemptionVictims(PreemptionVictimFilter.java:194)
 at 
org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:185)
 at 
org.apache.aurora.scheduler.storage.mem.MemStorage.read(MemStorage.java:90)
 at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
 at 
org.apache.aurora.scheduler.storage.durability.DurableStorage.read(DurableStorage.java:232)
 at 
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:125)
 at 
org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:141)
 at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
 at 
org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:184)
 at 
com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)
 at 
com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
 at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 E0620 14:07:26.743 [PreemptorService RUNNING, 
GuavaUtils$LifecycleShutdownListener] Service: PreemptorService [FAILED] failed 
unexpectedly. Triggering shutdown.
{code}

> PreemptionVictimFilterImpl comparator violates transitivity causing exceptions
> --
>
> Key: AURORA-1949
> URL: https://issues.apache.org/jira/browse/AURORA-1949
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Jordan Ly
>Assignee: Jordan Ly
>Priority: Critical
>
> The PreemptionVictimFilterImpl uses a comparator to sort ResourceBags in 
> order to preempt the biggest tasks first when searching for a victim. 
> However, the current implementation causes an exception which causes the 
> Scheduler to fail:
> {noformat}
> SEVERE: Service PreemptorService [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeLo(TimSort.java:777)
> at java.util.TimSort.mergeAt(TimSort.java:514)
> at java.util.TimSort.mergeCollapse(TimSort.java:441)
> at java.util.TimSort.sort(TimSort.java:245)
> at java.util.Arrays.sort(Arrays.java:1438)
> at 
> com.google.common.collect.Ordering.immutableSortedCopy(Ordering.java:882)
> at 
> org.apache.aurora.scheduler.preemptor.PreemptionVictimFilter$PreemptionVictimFilterImpl.filterPreemptionVictims(PreemptionVictimFilter.java:210)
> at 
>

[jira] [Commented] (AURORA-1958) Improve Vagrant setup with vagrant-hostmanager

2018-02-25 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16376054#comment-16376054
 ] 

Stephan Erb commented on AURORA-1958:
-

Hi Rogier,

sorry for the super late response. The problem with this plugin is that it 
requires root privileges on the host. I am not a great fan of this.

The aurora client config allows to set a [proxy 
URL|https://github.com/apache/aurora/blob/master/docs/reference/client-cluster-configuration.md#proxy_url].
 In the vagrant environment one could change the config to:
{code:java}
[{
  "name": "devcluster",
  "zk": "192.168.33.7",
  "scheduler_zk_path": "/aurora/scheduler",
  "auth_mechanism": "UNAUTHENTICATED",
  "slave_run_directory": "latest",
  "slave_root": "/var/lib/mesos",
  "proxy_url": "http://192.168.33.7:8081/;
}]
{code}
This would then lead to the generation of URLs working outside and inside of 
the VM:
{code:java}
vagrant@aurora:~/aurora$ aurora job create devcluster/www-data/prod/hello 
examples/jobs/hello_world.aurora
 INFO] Creating job hello
 INFO] Checking status of devcluster/www-data/prod/hello
Job create succeeded: job 
url=http://192.168.33.7:8081/scheduler/www-data/prod/hello
{code}
This would at least solve the immediate issue with the URLs generated by the 
Aurora client.

A completely different alternative, that should also solve the Mesos case, 
would be to change the hostname of the VM to localhost and ensure via 
portmapping that "localhost:8081" means the same thing inside and outside of 
the VM. This is how we currently do it for the packaging tests (see [this 
vagrant ssh 
command|https://github.com/apache/aurora-packaging/blob/master/test/test-artifact.sh#L28]).
 I have not tested this, but I [guess the config could also be moved to the 
Vagrantfile|https://www.vagrantup.com/docs/networking/forwarded_ports.html] so 
that it works out of the box.

Would any of those ideas work for you?

> Improve Vagrant setup with vagrant-hostmanager
> --
>
> Key: AURORA-1958
> URL: https://issues.apache.org/jira/browse/AURORA-1958
> Project: Aurora
>  Issue Type: Task
>  Components: Usability
>Affects Versions: 0.19.0
> Environment: Vagrant setup
>Reporter: Rogier Dikkes
>Priority: Trivial
>  Labels: newbie, vagrant
> Fix For: 0.20.0
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> Currently the vagrant devcluster is setup with aurora.local in the /etc/hosts 
> file of the guest virtual machine, however the /etc/hosts file on the host 
> machine where Vagrant is running on does not get changed. Result is that 
> clients outside of the Vagrant environment have issues connecting to the 
> devcluster, also the url within the devcluster environment do not work when 
> you use it from your browser. An example of this is the url of the aurora 
> framework in the Mesos master page. 
> Found vagrant-hostmanager which is easy to implement, it generates the 
> /etc/hosts and removes entries upon destroying the Vagrant setup. It also has 
> the ability to add entries within the Vagrant environment, for now i left 
> that out of the scope of this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AURORA-1971) Access to the job name in aurora configuration?

2018-02-09 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358591#comment-16358591
 ] 

Stephan Erb commented on AURORA-1971:
-

What you are observing is the behaviour described here 
[https://github.com/apache/aurora/blob/master/docs/reference/configuration-templating.md#mustaches-within-structurals].
  

You cannot see the name of the job as it is shadowed by the name of the process 
(see [https://github.com/wickman/pystachio#object-scopes] for details on 
scoping). A possible workaround could be to define a custom pystachio variable 
and use it in both places:
{code:java}
hello = Process(
  name='my_process_name',
  cmdline="""
while true; do
  echo {{job_name}}
  sleep 10
done
  """)

task = SequentialTask(
  processes=[hello],
  resources=Resources(cpu = 1.0, ram = 128*MB, disk = 128*MB))

jobs = [
   Service(
  task=task,
  cluster='devcluster',
  role = 'www-data',
  environment = 'prod',
  name = '{{job_name}}'
   ).bind(job_name='hello')
]
{code}



> Access to the job name in aurora configuration?
> ---
>
> Key: AURORA-1971
> URL: https://issues.apache.org/jira/browse/AURORA-1971
> Project: Aurora
>  Issue Type: Story
>Reporter: Allan Feid
>Priority: Minor
>
> I see there's a few different variables exposed in the pystachio 
> configurations (environment, role, task.name, mesos), however I have not been 
> able to figure out how to extract the job name. It seems a job's name 
> defaults to task.name but in my case these are not the same. The use case for 
> this is to simply export environment variables that give running processes 
> access to their environment, role, and job names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (AURORA-1972) Flapping Excutor HealthChecker Test

2018-02-09 Thread Stephan Erb (JIRA)

Stephan Erb created AURORA-1972:
---

 Summary: Flapping Excutor HealthChecker Test
 Key: AURORA-1972
 URL: https://issues.apache.org/jira/browse/AURORA-1972
 Project: Aurora
  Issue Type: Bug
  Components: Executor
Reporter: Stephan Erb


We currently have a set of flapping HealthChecking test that prevent our builds 
from passing (e.g. see https://reviews.apache.org/r/65565/).

{code}
  FAILURES 
  
TestThreadedHealthCheckerWithDefaults.test_run_unhealthy_after_callback

 self = 

 mock_sleep = 

 
@mock.patch('apache.aurora.executor.common.health_checker.time.sleep', 
spec=time.sleep)
 def test_run_unhealthy_after_callback(self, 
mock_sleep):
   mock_sleep.return_value = None
   health_status = [(True, None), (True, None), (False, 
'failure-4'), (False, 'failure-5')]
   self.health.side_effect = lambda: 
health_status.pop(0)
   mock_is_set = mock.Mock(spec=threading._Event.is_set)
   liveness = [False, False, False, False, True]
   mock_is_set.side_effect = lambda: liveness.pop(0)
   
self.health_checker.threaded_health_checker.dead.is_set = mock_is_set
   self.health_checker.threaded_health_checker.run()
 > assert mock_sleep.call_count == 4
 E AssertionError: assert 9403 == 4
 E  +  where 9403 = .call_count

 
.pants.d/pyprep/sources/365105c9a0472d6a1d7576426d316fe2aa7dcc77/apache/aurora/executor/common/test_health_checker.py:1292:
 AssertionError
{code}
Please notice the huge difference between actual and expected calls.



This works as expected:
{code}
./pants --cache-ignore --no-test-pytest-fast test.pytest  src/test/python::
{code}

This triggers the problem with traces as the one posted above:
{code}
./pants --cache-ignore --no-test-pytest-fast test.pytest  src/test/python::
{code}

The flapping seems to be dependent on how pants executes the tests. This seems 
to have a side effect on how {{time.sleep}} mocking is performed.  

The relevant [test 
class|https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/test/python/apache/aurora/executor/common/test_health_checker.py#L1033]
 and the [failing 
tests|https://github.com/apache/aurora/blob/c85bffdd6f68312261697eee868d57069adda434/src/test/python/apache/aurora/executor/common/test_health_checker.py#L1265-L1280]
 in particular are somewhat low-quality as they make heavy use of mocking.

How do we want to proceed here? 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (AURORA-1233) primary_port warning does not appear to respect portmap

2018-02-01 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1233.
-
Resolution: Fixed

> primary_port warning does not appear to respect portmap
> ---
>
> Key: AURORA-1233
> URL: https://issues.apache.org/jira/browse/AURORA-1233
> Project: Aurora
>  Issue Type: Bug
>  Components: Client
>Reporter: Kevin Sweeney
>Assignee: Stephan Erb
>Priority: Minor
>
> From an internal bug report:
> {noformat}
> *Announcer specified primary port as 'thrift' but no processes have bound 
> that port.
> If you would like to utilize this port, you should listen on 
> {{thermos.ports\[thrift]}}
> from some Process bound to your task.*
> However, we *are* using that port in our Process.
> '-fetcher_port={{thermos.ports\[thrift]}}',
> It seems that when the primary port is statically linked, it causes this 
> warning.
> announce = Announcer(primary_port = 'thrift', portmap = {'thrift': 10001, 
> 'http': 8080, 'aurora': 'http', 'health': 'http'}
> If I don't statically link it, then it runs as normal.
> FYI: Everything works as expected. The only issue is that warning/error is 
> displayed in error.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (AURORA-1233) primary_port warning does not appear to respect portmap

2018-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb reassigned AURORA-1233:
---

Assignee: Stephan Erb

> primary_port warning does not appear to respect portmap
> ---
>
> Key: AURORA-1233
> URL: https://issues.apache.org/jira/browse/AURORA-1233
> Project: Aurora
>  Issue Type: Bug
>  Components: Client
>Reporter: Kevin Sweeney
>Assignee: Stephan Erb
>Priority: Minor
>
> From an internal bug report:
> {noformat}
> *Announcer specified primary port as 'thrift' but no processes have bound 
> that port.
> If you would like to utilize this port, you should listen on 
> {{thermos.ports\[thrift]}}
> from some Process bound to your task.*
> However, we *are* using that port in our Process.
> '-fetcher_port={{thermos.ports\[thrift]}}',
> It seems that when the primary port is statically linked, it causes this 
> warning.
> announce = Announcer(primary_port = 'thrift', portmap = {'thrift': 10001, 
> 'http': 8080, 'aurora': 'http', 'health': 'http'}
> If I don't statically link it, then it runs as normal.
> FYI: Everything works as expected. The only issue is that warning/error is 
> displayed in error.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AURORA-1964) Move Vagrant setup from Trusty to Xenial

2018-01-19 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331922#comment-16331922
 ] 

Stephan Erb commented on AURORA-1964:
-

[~rdelvalle] a simple workaround for the packer Thrift issue would be to not 
include Thrift at all. With the current pants setup, we will download a 
pre-compiled Thrift automatically. It is therefore no longer needed to have a 
system version installed.

> Move Vagrant setup from Trusty to Xenial
> 
>
> Key: AURORA-1964
> URL: https://issues.apache.org/jira/browse/AURORA-1964
> Project: Aurora
>  Issue Type: Task
>Reporter: Renan DelValle
>Assignee: Renan DelValle
>Priority: Major
>
> We're really behind the curve on this one as the next LTS will be released in 
> April.
> The move is made difficult by the change in init systems between Trusty and 
> Xenial.
> Furthermore, our recent upgrade to Thrift 0.10.0 has caused some issues with 
> our Packer set up as the deb packages for 0.10.0 are not in the correct 
> repository. Latest version in the repository is 0.9.3: 
> http://dl.bintray.com/apache/thrift/debian/dists/
> Making Packer fail at: 
> https://github.com/apache/aurora/blob/master/build-support/packer/build.sh#L118
> [~jfarrell] any chance you can help us unblock this by releasing official 
> packages?
> Otherwise, we could compile the 0.10.0 from scratch in our packer process but 
> that might balloon the image size somewhat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AURORA-1962) Incorrect parsing of empty strings into list command line options

2017-12-23 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16302469#comment-16302469
 ] 

Stephan Erb commented on AURORA-1962:
-

Workaround for our packaging scripts https://reviews.apache.org/r/64824

> Incorrect parsing of empty strings into list command line options
> -
>
> Key: AURORA-1962
> URL: https://issues.apache.org/jira/browse/AURORA-1962
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.19.0
>Reporter: Bill Farner
>
> When the scheduler parses a command line option like 
> {{-thermos_executor_resources=}}, which maps to {{List}}, the result 
> is equivalent to {{[""]}} (list of size 1 containing an empty string), while 
> we would expect {{[]}} (an empty list).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1961) Aurora build is flaky

2017-12-12 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287552#comment-16287552
 ] 

Stephan Erb commented on AURORA-1961:
-

Regarding Issue c): The package {{libkrb5-dev}} is missing on the new Xenial 
build slaves. This leads to the pykerberos build error above
{code}
# Trusty:
ii  krb5-locales
1.12+dfsg-2ubuntu5.2   all  Internationalization 
support for MIT Kerberos
ii  krb5-multidev   
1.12+dfsg-2ubuntu5.2   amd64Development files for 
MIT Kerberos without Heimdal conflict
ii  libgssapi-krb5-2:amd64  
1.12+dfsg-2ubuntu5.2   amd64MIT Kerberos runtime 
libraries - krb5 GSS-API Mechanism
ii  libkrb5-26-heimdal:amd64
1.6~git20131207+dfsg-1ubuntu1.2amd64Heimdal Kerberos - 
libraries
ii  libkrb5-3:amd64 
1.12+dfsg-2ubuntu5.2   amd64MIT Kerberos runtime 
libraries
ii  libkrb5-dev 
1.12+dfsg-2ubuntu5.2   amd64Headers and development 
libraries for MIT Kerberos
ii  libkrb5support0:amd64  

# # Xenial
ii  krb5-locales
1.13.2+dfsg-5ubuntu2   all  Internationalization 
support for MIT Kerberos
ii  libgssapi-krb5-2:amd64  
1.13.2+dfsg-5ubuntu2   amd64MIT Kerberos runtime 
libraries - krb5 GSS-API Mechanism
ii  libkrb5-26-heimdal:amd64
1.7~git20150920+dfsg-4ubuntu1.16.04.1  amd64Heimdal Kerberos - 
libraries
ii  libkrb5-3:amd64 
1.13.2+dfsg-5ubuntu2   amd64MIT Kerberos runtime 
libraries
ii  libkrb5support0:amd64   
1.13.2+dfsg-5ubuntu2   amd64MIT Kerberos runtime 
libraries - 
{code}

As a workaround I have pinned {{AuroraBot}} to trusty build slaves for now. 
Before I file an issue with Apache Infra: [~jsirois] any idea why this error 
only shows up for the new pants version?

> Aurora build is flaky
> -
>
> Key: AURORA-1961
> URL: https://issues.apache.org/jira/browse/AURORA-1961
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>
> The current Aurora build is flaky. This is causing unnecessary headache for 
> our contributors.
> *Affected patches right now:*
> * https://reviews.apache.org/r/64341/ by [~jingc]
> * https://reviews.apache.org/r/64290/ by [~jsirois]
> *Observed issues:*
> a) Build prints hundreds of lines of spotbugs output rather than capping it 
> at 40 lines as configured in Jenkins.
> {code}
> Pass 2: Analyzing classes (0 / 59) - 00% complete
> Pass 2: Analyzing classes (1 / 59) - 01% complete
> Pass 2: Analyzing classes (2 / 59) - 03% complete
> Pass 2: Analyzing classes (3 / 59) - 05% complete
> {code}
> There is an [upstream issue|https://github.com/spotbugs/spotbugs/issues/506] 
> mean to reduce the output itself. In any case, our build script should 
> properly tail only the last lines of it.
> b) Failing webhooks test. [The attempted 
> fix|https://github.com/apache/aurora/commit/ef24c2ce355e857c4fcce531b4f16028a6c6e75d]
>  does not seem to work 
> {code}
> org.apache.aurora.scheduler.events.WebhookTest > 
> testTaskChangedWithOldStateError FAILED
> java.lang.AssertionError at WebhookTest.java:193
> {code}
> c) Kerberos development header missing on some build slaces
> {code}
>Invalidated 1 target. Failed to install 
> pykerberos-1.1.14 (caused by: NonZeroExit("received exit code 1 during 
> execution of `[u'/usr/bin/python2.7', '-', 'bdist_wheel', 
> '--dist-dir=/tmp/tmpoky0pr']` while trying to execute 
> `[u'/usr/bin/python2.7', '-', 'bdist_wheel', '--dist-dir=/tmp/tmpoky0pr']`",)
> ):
> stdout:
> running bdist_wheel
> running build
> running build_ext
> building 'kerberos' extension
> creating build
> creating build/temp.linux-x86_64-2.7
> creating build/temp.linux-x86_64-2.7/src
> x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall 
> -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g 
> -fstack-protector-strong -Wformat -Werror=format-security -fPIC 
> -I/usr/include/python2.7 -c src/kerberos.c -o 
> build/temp.linux-x86_64-2.7/src/kerberos.o
> stderr:
> In file included from src/kerberos.c:19:0:
> src/kerberosbasic.h:17:27: fatal error: gssapi/gssapi.h: No such file or 
> directory
> compilation terminated.
> error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
> 07:19:10 00:02

[jira] [Commented] (AURORA-1961) Aurora build is flaky

2017-12-11 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286528#comment-16286528
 ] 

Stephan Erb commented on AURORA-1961:
-

Convert carriage returns to newlines in reviews 
https://reviews.apache.org/r/64508/

> Aurora build is flaky
> -
>
> Key: AURORA-1961
> URL: https://issues.apache.org/jira/browse/AURORA-1961
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>
> The current Aurora build is flaky. This is causing unnecessary headache for 
> our contributors.
> *Affected patches right now:*
> * https://reviews.apache.org/r/64341/ by [~jingc]
> * https://reviews.apache.org/r/64290/ by [~jsirois]
> *Observed issues:*
> a) Build prints hundreds of lines of spotbugs output rather than capping it 
> at 40 lines as configured in Jenkins.
> {code}
> Pass 2: Analyzing classes (0 / 59) - 00% complete
> Pass 2: Analyzing classes (1 / 59) - 01% complete
> Pass 2: Analyzing classes (2 / 59) - 03% complete
> Pass 2: Analyzing classes (3 / 59) - 05% complete
> {code}
> There is an [upstream issue|https://github.com/spotbugs/spotbugs/issues/506] 
> mean to reduce the output itself. In any case, our build script should 
> properly tail only the last lines of it.
> b) Failing webhooks test. [The attempted 
> fix|https://github.com/apache/aurora/commit/ef24c2ce355e857c4fcce531b4f16028a6c6e75d]
>  does not seem to work 
> {code}
> org.apache.aurora.scheduler.events.WebhookTest > 
> testTaskChangedWithOldStateError FAILED
> java.lang.AssertionError at WebhookTest.java:193
> {code}
> c) Kerberos development header missing on some build slaces
> {code}
>Invalidated 1 target. Failed to install 
> pykerberos-1.1.14 (caused by: NonZeroExit("received exit code 1 during 
> execution of `[u'/usr/bin/python2.7', '-', 'bdist_wheel', 
> '--dist-dir=/tmp/tmpoky0pr']` while trying to execute 
> `[u'/usr/bin/python2.7', '-', 'bdist_wheel', '--dist-dir=/tmp/tmpoky0pr']`",)
> ):
> stdout:
> running bdist_wheel
> running build
> running build_ext
> building 'kerberos' extension
> creating build
> creating build/temp.linux-x86_64-2.7
> creating build/temp.linux-x86_64-2.7/src
> x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall 
> -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g 
> -fstack-protector-strong -Wformat -Werror=format-security -fPIC 
> -I/usr/include/python2.7 -c src/kerberos.c -o 
> build/temp.linux-x86_64-2.7/src/kerberos.o
> stderr:
> In file included from src/kerberos.c:19:0:
> src/kerberosbasic.h:17:27: fatal error: gssapi/gssapi.h: No such file or 
> directory
> compilation terminated.
> error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
> 07:19:10 00:02   [complete]
>FAILURE
> Exception caught: ()
> Exception message: Package 
> SourcePackage(u'file:///home/jenkins/jenkins-slave/workspace/AuroraBot/.pants.d/python-setup/resolved_requirements/CPython-2.7.12/pykerberos-1.1.14.tar.gz')
>  is not translateable by ChainedTranslator(WheelTranslator, EggTranslator, 
> SourceTranslator)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1961) Aurora build is flaky

2017-12-11 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286463#comment-16286463
 ] 

Stephan Erb commented on AURORA-1961:
-

I am starting with the less critical but most annoying one: a)

> Aurora build is flaky
> -
>
> Key: AURORA-1961
> URL: https://issues.apache.org/jira/browse/AURORA-1961
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>
> The current Aurora build is flaky. This is causing unnecessary headache for 
> our contributors.
> *Affected patches right now:*
> * https://reviews.apache.org/r/64341/ by [~jingc]
> * https://reviews.apache.org/r/64290/ by [~jsirois]
> *Observed issues:*
> a) Build prints hundreds of lines of spotbugs output rather than capping it 
> at 40 lines as configured in Jenkins.
> {code}
> Pass 2: Analyzing classes (0 / 59) - 00% complete
> Pass 2: Analyzing classes (1 / 59) - 01% complete
> Pass 2: Analyzing classes (2 / 59) - 03% complete
> Pass 2: Analyzing classes (3 / 59) - 05% complete
> {code}
> There is an [upstream issue|https://github.com/spotbugs/spotbugs/issues/506] 
> mean to reduce the output itself. In any case, our build script should 
> properly tail only the last lines of it.
> b) Failing webhooks test. [The attempted 
> fix|https://github.com/apache/aurora/commit/ef24c2ce355e857c4fcce531b4f16028a6c6e75d]
>  does not seem to work 
> {code}
> org.apache.aurora.scheduler.events.WebhookTest > 
> testTaskChangedWithOldStateError FAILED
> java.lang.AssertionError at WebhookTest.java:193
> {code}
> c) Kerberos development header missing on some build slaces
> {code}
>Invalidated 1 target. Failed to install 
> pykerberos-1.1.14 (caused by: NonZeroExit("received exit code 1 during 
> execution of `[u'/usr/bin/python2.7', '-', 'bdist_wheel', 
> '--dist-dir=/tmp/tmpoky0pr']` while trying to execute 
> `[u'/usr/bin/python2.7', '-', 'bdist_wheel', '--dist-dir=/tmp/tmpoky0pr']`",)
> ):
> stdout:
> running bdist_wheel
> running build
> running build_ext
> building 'kerberos' extension
> creating build
> creating build/temp.linux-x86_64-2.7
> creating build/temp.linux-x86_64-2.7/src
> x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall 
> -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g 
> -fstack-protector-strong -Wformat -Werror=format-security -fPIC 
> -I/usr/include/python2.7 -c src/kerberos.c -o 
> build/temp.linux-x86_64-2.7/src/kerberos.o
> stderr:
> In file included from src/kerberos.c:19:0:
> src/kerberosbasic.h:17:27: fatal error: gssapi/gssapi.h: No such file or 
> directory
> compilation terminated.
> error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
> 07:19:10 00:02   [complete]
>FAILURE
> Exception caught: ()
> Exception message: Package 
> SourcePackage(u'file:///home/jenkins/jenkins-slave/workspace/AuroraBot/.pants.d/python-setup/resolved_requirements/CPython-2.7.12/pykerberos-1.1.14.tar.gz')
>  is not translateable by ChainedTranslator(WheelTranslator, EggTranslator, 
> SourceTranslator)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (AURORA-1961) Aurora build is flaky

2017-12-11 Thread Stephan Erb (JIRA)

Stephan Erb created AURORA-1961:
---

 Summary: Aurora build is flaky
 Key: AURORA-1961
 URL: https://issues.apache.org/jira/browse/AURORA-1961
 Project: Aurora
  Issue Type: Bug
Reporter: Stephan Erb


The current Aurora build is flaky. This is causing unnecessary headache for our 
contributors.

*Affected patches right now:*

* https://reviews.apache.org/r/64341/ by [~jingc]
* https://reviews.apache.org/r/64290/ by [~jsirois]

*Observed issues:*

a) Build prints hundreds of lines of spotbugs output rather than capping it at 
40 lines as configured in Jenkins.
{code}
Pass 2: Analyzing classes (0 / 59) - 00% complete
Pass 2: Analyzing classes (1 / 59) - 01% complete
Pass 2: Analyzing classes (2 / 59) - 03% complete
Pass 2: Analyzing classes (3 / 59) - 05% complete
{code}
There is an [upstream issue|https://github.com/spotbugs/spotbugs/issues/506] 
mean to reduce the output itself. In any case, our build script should properly 
tail only the last lines of it.

b) Failing webhooks test. [The attempted 
fix|https://github.com/apache/aurora/commit/ef24c2ce355e857c4fcce531b4f16028a6c6e75d]
 does not seem to work 
{code}
org.apache.aurora.scheduler.events.WebhookTest > 
testTaskChangedWithOldStateError FAILED
java.lang.AssertionError at WebhookTest.java:193
{code}

c) Kerberos development header missing on some build slaces
{code}
   Invalidated 1 target. Failed to install 
pykerberos-1.1.14 (caused by: NonZeroExit("received exit code 1 during 
execution of `[u'/usr/bin/python2.7', '-', 'bdist_wheel', 
'--dist-dir=/tmp/tmpoky0pr']` while trying to execute `[u'/usr/bin/python2.7', 
'-', 'bdist_wheel', '--dist-dir=/tmp/tmpoky0pr']`",)
):
stdout:
running bdist_wheel
running build
running build_ext
building 'kerberos' extension
creating build
creating build/temp.linux-x86_64-2.7
creating build/temp.linux-x86_64-2.7/src
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes 
-fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g 
-fstack-protector-strong -Wformat -Werror=format-security -fPIC 
-I/usr/include/python2.7 -c src/kerberos.c -o 
build/temp.linux-x86_64-2.7/src/kerberos.o

stderr:
In file included from src/kerberos.c:19:0:
src/kerberosbasic.h:17:27: fatal error: gssapi/gssapi.h: No such file or 
directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1



07:19:10 00:02   [complete]
   FAILURE
Exception caught: ()

Exception message: Package 
SourcePackage(u'file:///home/jenkins/jenkins-slave/workspace/AuroraBot/.pants.d/python-setup/resolved_requirements/CPython-2.7.12/pykerberos-1.1.14.tar.gz')
 is not translateable by ChainedTranslator(WheelTranslator, EggTranslator, 
SourceTranslator)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1949) PreemptionVictimFilterImpl comparator violates transitivity causing exceptions

2017-12-10 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285405#comment-16285405
 ] 

Stephan Erb commented on AURORA-1949:
-

[~jordanly] are you seeing this regularly or is there a known workaround? 

> PreemptionVictimFilterImpl comparator violates transitivity causing exceptions
> --
>
> Key: AURORA-1949
> URL: https://issues.apache.org/jira/browse/AURORA-1949
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Jordan Ly
>Assignee: Jordan Ly
>Priority: Critical
>
> The PreemptionVictimFilterImpl uses a comparator to sort ResourceBags in 
> order to preempt the biggest tasks first when searching for a victim. 
> However, the current implementation causes an exception which causes the 
> Scheduler to fail:
> {noformat}
> SEVERE: Service PreemptorService [FAILED] has failed in the RUNNING state.
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeLo(TimSort.java:777)
> at java.util.TimSort.mergeAt(TimSort.java:514)
> at java.util.TimSort.mergeCollapse(TimSort.java:441)
> at java.util.TimSort.sort(TimSort.java:245)
> at java.util.Arrays.sort(Arrays.java:1438)
> at 
> com.google.common.collect.Ordering.immutableSortedCopy(Ordering.java:882)
> at 
> org.apache.aurora.scheduler.preemptor.PreemptionVictimFilter$PreemptionVictimFilterImpl.filterPreemptionVictims(PreemptionVictimFilter.java:210)
> at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:178)
> at 
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
> at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
> at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> at 
> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)
> at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
> at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)
> at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> at 
> org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:205)
> at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)
> at 
> com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Looking at the code, it seems it violates transitivity:
> {code:java}
> @VisibleForTesting
> static final Ordering ORDER = new Ordering() {
>   @Override
>   public int compare(ResourceBag left, ResourceBag right) {
> Set types = ImmutableSet.builder()
> .addAll(left.streamResourceVectors().map(e -> 
> e.getKey()).iterator())
> .addAll(right.streamResourceVectors().map(e -> 
> e.getKey()).iterator())
> .build();
> boolean allZero = true;
> boolean allGreaterOrEqual = true;
> boolean allLessOrEqual = true;
> for (ResourceType type : types) {
>   int compare = left.valueOf(type).compareTo(right.valueOf(type));
>   if (compare != 0) {
> allZero = false;
>   }
>   if (compare < 0) {
> allGreaterOrEqual = false;
>   }
>   if (compare > 0) {
> allLessOrEqual = false;
>   }
> }
> if (allZero) {
>   return 0;
> }
> if (allGreaterOrEqual) {
>   return 1;
> }
> if (allLessOrEqual) {
>   return -1;
> }
> return 0;
>   }
> };
> {code}
> The example below illustrates the error:
> {noformat}
> Resource:X Y Z
> Bag A:   2 0 2
> Bag B:   1 2 1
> Bag C:   2 2 1
> {noformat}
> We

[jira] [Resolved] (AURORA-1412) Conflicting `slow_query_log_threshold` arguments for aurora-scheduler

2017-12-10 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1412.
-
Resolution: Fixed
  Assignee: Bill Farner

Solved via 
https://github.com/apache/aurora/commit/94276046606da4e1491ee3d0e0c29cd3649a82e6

> Conflicting `slow_query_log_threshold` arguments for aurora-scheduler
> -
>
> Key: AURORA-1412
> URL: https://issues.apache.org/jira/browse/AURORA-1412
> Project: Aurora
>  Issue Type: Story
>Affects Versions: 0.10.0
> Environment: Ubuntu 14.04, OpenJDK8, Mesos 0.22.1. Running inside 
> Docker container.
>Reporter: Anthony Seure
>Assignee: Bill Farner
>Priority: Minor
>
> When starting the {{aurora-scheduler}}, a WARNING from 
> {{com.twitter.common.args.ArgScanner}} is yelled. The issue is that the 
> {{slow_query_log_threshold}} option is used to configure both:
>  - 
> {{org.apache.aurora.scheduler.storage.db.DbModule.slow_query_log_threshold}}
>  - 
> {{org.apache.aurora.scheduler.storage.mem.MemTaskStore.slow_query_log_threshold}}
> Even if the parameter is used to set them both with the same value, this 
> warning should be avoided by giving explicit and different names for these 
> two options.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (AURORA-1380) Upgrade to guice 4.0

2017-12-09 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb reassigned AURORA-1380:
---

Resolution: Fixed
  Assignee: Bill Farner

Resolved in https://reviews.apache.org/r/64362/

> Upgrade to guice 4.0
> 
>
> Key: AURORA-1380
> URL: https://issues.apache.org/jira/browse/AURORA-1380
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Kevin Sweeney
>Assignee: Bill Farner
>Priority: Critical
>
> Guice 4.0 has been released. Among the new features, probably the most 
> significant is Java 8 support - in Guice 3.0 stack traces are obfuscated by 
> https://github.com/google/guice/issues/757. As our code expands use of 
> lambdas and method references this will become even more critical.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (AURORA-1471) Reconsider use of shiro-guice

2017-12-09 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1471.
-
Resolution: Fixed
  Assignee: Bill Farner

Guice update (with necessary shiro adjustments) was completed in 
https://reviews.apache.org/r/64362/

> Reconsider use of shiro-guice
> -
>
> Key: AURORA-1471
> URL: https://issues.apache.org/jira/browse/AURORA-1471
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler, Security
>Reporter: Kevin Sweeney
>Assignee: Bill Farner
>
> shiro-guice is a wrapper around shiro core, and unfortunately uses the guice 
> SPI in a way that broke between 3.0 and 4.0 (see SHIRO-493). Consider the 
> possibility of removing use of its guice wrapper in favor of standard 
> guice-servlet idioms and configuring the shiro-core object graph directly.
> Alternatively investigate contributing a patch upstream.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (AURORA-1249) Upgrade 3rdparty python dependencies

2017-12-09 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1249.
-
Resolution: Fixed
  Assignee: Stephan Erb

Solved via https://reviews.apache.org/r/64382/

> Upgrade 3rdparty python dependencies
> 
>
> Key: AURORA-1249
> URL: https://issues.apache.org/jira/browse/AURORA-1249
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Executor
>Reporter: Kevin Sweeney
>Assignee: Stephan Erb
>Priority: Minor
>  Labels: newbie
>
> Some of our thirdparty dependencies have updates - we should consider 
> incorporating them:
> {noformat}
> (pycharm.venv)~aurora git aurora/. kts/32559
> % pip-review
> No update information found for apache.gen.aurora
> No update information found for apache.gen.thermos
> coverage==4.0a5 is available (you have 3.7.1)
> futures==2.2.0 is available (you have 2.1.6)
> kazoo==2.0 is available (you have 1.3.1)
> mesos.interface==0.22.0 is available (you have 0.21.1)
> pex==0.8.6 is available (you have 0.8.2)
> protobuf==3.0.0-alpha-1 is available (you have 2.6.1)
> psutil==2.2.1 is available (you have 2.1.3)
> pytest==2.7.0 is available (you have 2.6.4)
> requests==2.6.0 is available (you have 2.3.0)
> thrift==0.9.2 is available (you have 0.9.1)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1750) Expose Aurora task metadata to thermos task

2017-12-09 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284730#comment-16284730
 ] 

Stephan Erb commented on AURORA-1750:
-

RB patch https://reviews.apache.org/r/64341/

> Expose Aurora task metadata to thermos task
> ---
>
> Key: AURORA-1750
> URL: https://issues.apache.org/jira/browse/AURORA-1750
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Min Cai
>Priority: Minor
>
> Much like how we expose mesos hostname, aurora instance number, etc to 
> thermos I think we should be able to expose Aurora task metadata to thermos 
> tasks.
> I don't forsee complexity or harm for this, but it allows users to plumb more 
> information into the task. For example, one could encode a 'package version' 
> or 'build pipeline' or 'audit pipeline' metadata into the task. The task 
> could then expose this to others or act differently if required.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Reopened] (AURORA-942) Explore using a replicated log on top of ZooKeeper

2017-11-29 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb reopened AURORA-942:


https://reviews.apache.org/r/64126/

> Explore using a replicated log on top of ZooKeeper
> --
>
> Key: AURORA-942
> URL: https://issues.apache.org/jira/browse/AURORA-942
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Bill Farner
>Assignee: Bill Farner
>Priority: Minor
>
> The scheduler uses the replicated log implementation provided by mesos 
> (native libmesos.so).  It would be interesting to compare this against a 
> replacement that sllows us to:
> - shed code to implement backups and recovery
> - remove one use of a dynamically-linked native library
> - use a store that allows non-leaders to read, for faster recovery and 
> serving from non-active members
> - avoid the need for periodic failover (we currently have to do this to 
> induce compaction in LevelDB and minimize log replay time)
> At first glance, it seems like it would be relatively straightforward to come 
> up with a Log implementation \[1\] that persists transactions as nodes in 
> ZooKeeper.  This would enable all the above results.
> \[1\] 
> https://github.com/apache/incubator-aurora/blob/10da38a3a0ad6ebbee055c26adc3ed3437ec3930/src/main/java/org/apache/aurora/scheduler/log/Log.java#L26



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (AURORA-942) Explore using a replicated log on top of ZooKeeper

2017-11-29 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb reassigned AURORA-942:
--

Assignee: Bill Farner

> Explore using a replicated log on top of ZooKeeper
> --
>
> Key: AURORA-942
> URL: https://issues.apache.org/jira/browse/AURORA-942
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Bill Farner
>Assignee: Bill Farner
>Priority: Minor
>
> The scheduler uses the replicated log implementation provided by mesos 
> (native libmesos.so).  It would be interesting to compare this against a 
> replacement that sllows us to:
> - shed code to implement backups and recovery
> - remove one use of a dynamically-linked native library
> - use a store that allows non-leaders to read, for faster recovery and 
> serving from non-active members
> - avoid the need for periodic failover (we currently have to do this to 
> induce compaction in LevelDB and minimize log replay time)
> At first glance, it seems like it would be relatively straightforward to come 
> up with a Log implementation \[1\] that persists transactions as nodes in 
> ZooKeeper.  This would enable all the above results.
> \[1\] 
> https://github.com/apache/incubator-aurora/blob/10da38a3a0ad6ebbee055c26adc3ed3437ec3930/src/main/java/org/apache/aurora/scheduler/log/Log.java#L26



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1955) thermos should exit on irrecoverable errors to avoid zombies

2017-11-02 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235548#comment-16235548
 ] 

Stephan Erb commented on AURORA-1955:
-

Patch has been committed to master.

> thermos should exit on irrecoverable errors to avoid zombies
> 
>
> Key: AURORA-1955
> URL: https://issues.apache.org/jira/browse/AURORA-1955
> Project: Aurora
>  Issue Type: Bug
>  Components: Thermos
>Reporter: Mohit Jaggi
>Assignee: Stephan Erb
>Priority: Major
>
> We found several zombie executors on a cluster. Thermos logs indicate 
> reaching system limits while trying to shutdown(?). Mesos agent is unable to 
> get status of this container from docker daemon (docker inspect fails). 
> Shouldn't thermos exit in such a case?
> {code}
>  22 WARNING: Your kernel does not support swap limit capabilities, memory 
> limited without swap.
>  23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
>  24 Writing log files to disk in /mnt/mesos/sandbox
>  25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
>  26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent 
> b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
>  27 Writing log files to disk in /mnt/mesos/sandbox
>  28 Traceback (most recent call last):
>  29   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
>  30 self.__real_run(*args, **kw)
>  31   File "apache/thermos/monitoring/resource.py", line 243, in run
>  32   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
>  line 79, in wait
>  33 thread.start()
>  34   File "/usr/lib/python2.7/threading.py", line 745, in start
>  35 _start_new_thread(self.__bootstrap, ())
>  36 thread.error: can't start new thread
>  37 ERROR] Failed to stop health checkers:
>  38 ERROR] Traceback (most recent call last):
>  39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
>  40 propagate_deadline(self._chained_checker.stop, 
> timeout=self.STOP_TIMEOUT)
>  41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  42 return deadline(*args, daemon=True, propagate=True, **kw)
>  43   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>  44 AnonymousThread().start()
>  45   File "/usr/lib/python2.7/threading.py", line 745, in start
>  46 _start_new_thread(self.__bootstrap, ())
>  47 error: can't start new thread
> 48
>  49 ERROR] Failed to stop runner:
> 50 ERROR] Traceback (most recent call last):
>  51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
>  52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
>  53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
> propagate_deadline
>  54 return deadline(*args, daemon=True, propagate=True, **kw)
>  55   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
>  line 61, in deadline
>  56 AnonymousThread().start()
>  57   File "/usr/lib/python2.7/threading.py", line 745, in start
>  58 _start_new_thread(self.__bootstrap, ())
>  59 error: can't start new thread
>  60
>  61 Traceback (most recent call last):
>  62   File 
> "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
>  63 self.__real_run(*args, **kw)
>  64   File "apache/aurora/executor/status_manager.py", line 62, in run
>  65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
>  66   File 
> "/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
>  line 56, in defer
>  67 deferred.start()
>  68   File "/usr/lib/python2.7/threading.py", line 745, in start
>  69 _start_new_thread(self.__bootstrap, ())
>  70 thread.error: can't start new thread
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (AURORA-1955) thermos should exit on irrecoverable errors to avoid zombies

2017-10-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1955:

Description: 
We found several zombie executors on a cluster. Thermos logs indicate reaching 
system limits while trying to shutdown(?). Mesos agent is unable to get status 
of this container from docker daemon (docker inspect fails). Shouldn't thermos 
exit in such a case?

{code}
 22 WARNING: Your kernel does not support swap limit capabilities, memory 
limited without swap.
 23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
subsystem.)
 24 Writing log files to disk in /mnt/mesos/sandbox
 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent 
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
 27 Writing log files to disk in /mnt/mesos/sandbox
 28 Traceback (most recent call last):
 29   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run
 30 self.__real_run(*args, **kw)
 31   File "apache/thermos/monitoring/resource.py", line 243, in run
 32   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/event_muxer.py",
 line 79, in wait
 33 thread.start()
 34   File "/usr/lib/python2.7/threading.py", line 745, in start
 35 _start_new_thread(self.__bootstrap, ())
 36 thread.error: can't start new thread
 37 ERROR] Failed to stop health checkers:
 38 ERROR] Traceback (most recent call last):
 39   File "apache/aurora/executor/aurora_executor.py", line 209, in _shutdown
 40 propagate_deadline(self._chained_checker.stop, 
timeout=self.STOP_TIMEOUT)
 41   File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline
 42 return deadline(*args, daemon=True, propagate=True, **kw)
 43   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
 line 61, in deadline
 44 AnonymousThread().start()
 45   File "/usr/lib/python2.7/threading.py", line 745, in start
 46 _start_new_thread(self.__bootstrap, ())
 47 error: can't start new thread
48
 49 ERROR] Failed to stop runner:
50 ERROR] Traceback (most recent call last):
 51   File "apache/aurora/executor/aurora_executor.py", line 217, in _shutdown
 52 propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
 53   File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline
 54 return deadline(*args, daemon=True, propagate=True, **kw)
 55   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deadline.py",
 line 61, in deadline
 56 AnonymousThread().start()
 57   File "/usr/lib/python2.7/threading.py", line 745, in start
 58 _start_new_thread(self.__bootstrap, ())
 59 error: can't start new thread
 60
 61 Traceback (most recent call last):
 62   File 
"/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run
 63 self.__real_run(*args, **kw)
 64   File "apache/aurora/executor/status_manager.py", line 62, in run
 65   File "apache/aurora/executor/aurora_executor.py", line 235, in _shutdown
 66   File 
"/root/.pex/install/twitter.common.concurrent-0.3.7-py2-none-any.whl.f1ab836a5554c86d07fa3f075905c95fb20c78dd/twitter.common.concurrent-0.3.7-py2-none-any.whl/twitter/common/concurrent/deferred.py",
 line 56, in defer
 67 deferred.start()
 68   File "/usr/lib/python2.7/threading.py", line 745, in start
 69 _start_new_thread(self.__bootstrap, ())
 70 thread.error: can't start new thread
{code}

  was:
We found several zombie executors on a cluster. Thermos logs indicate reaching 
system limits while trying to shutdown(?). Mesos agent is unable to get status 
of this container from docker daemon (docker inspect fails). Shouldn't thermos 
exit in such a case?


 22 WARNING: Your kernel does not support swap limit capabilities, memory 
limited without swap.
 23 twitter.common.app debug: Initializing: twitter.common.log (Logging 
subsystem.)
 24 Writing log files to disk in /mnt/mesos/sandbox
 25 I1023 19:04:32.261165 7 exec.cpp:162] Version: 1.2.0
 26 I1023 19:04:32.26487042 exec.cpp:237] Executor registered on agent 
b4fff262-c925-4edf-a2ef-2a5bbe89c42b-S3295
 27 Writing log files to disk in

[jira] [Resolved] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted

2017-10-23 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1669.
-
Resolution: Fixed
  Assignee: Bill Farner  (was: John Sirois)

> Kill twitter/commons ZK libs when Curator replacements are vetted
> -
>
> Key: AURORA-1669
> URL: https://issues.apache.org/jira/browse/AURORA-1669
> Project: Aurora
>  Issue Type: Task
>Reporter: John Sirois
>Assignee: Bill Farner
>
> Once we have reports from production users that the Curator zk plumbing 
> introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag 
> should be deprecated and then the flag and commons code killed.  If the 
> vetting happens before the next release ({{0.14.0}}), we can dispense with a 
> deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted

2017-10-23 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215965#comment-16215965
 ] 

Stephan Erb commented on AURORA-1669:
-

https://reviews.apache.org/r/62652/ has landed

> Kill twitter/commons ZK libs when Curator replacements are vetted
> -
>
> Key: AURORA-1669
> URL: https://issues.apache.org/jira/browse/AURORA-1669
> Project: Aurora
>  Issue Type: Task
>Reporter: John Sirois
>Assignee: John Sirois
>
> Once we have reports from production users that the Curator zk plumbing 
> introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag 
> should be deprecated and then the flag and commons code killed.  If the 
> vetting happens before the next release ({{0.14.0}}), we can dispense with a 
> deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1895) Expose stats on ZooKeeperClient connection state

2017-10-23 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215956#comment-16215956
 ] 

Stephan Erb commented on AURORA-1895:
-

https://reviews.apache.org/r/62652/ has landed, dropping legacy ZK

> Expose stats on ZooKeeperClient connection state
> 
>
> Key: AURORA-1895
> URL: https://issues.apache.org/jira/browse/AURORA-1895
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
>
> Expose stats on the connection state of the {{ZooKeeperClient}} in 
> {{CommonsServiceDiscoveryModule}}. This can be through the ZooKeeper client 
> [Watcher|https://zookeeper.apache.org/doc/r3.4.8/api/org/apache/zookeeper/Watcher.html]
>  interface.
> [AURORA-1838] exposed ZooKeeper stats for {{CuratorServiceDiscoveryModule}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (AURORA-319) Allow job environments other than prod, devel, test or staging

2017-10-23 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-319.

Resolution: Fixed
  Assignee: (was: Benjamin Staffin)

> Allow job environments other than prod, devel, test or staging
> --
>
> Key: AURORA-319
> URL: https://issues.apache.org/jira/browse/AURORA-319
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: Jay Buffington
>Priority: Minor
>
> git commit bcabfce6 introduced limitations on what job environments must be 
> named.  Are these names arbitrary or is there some policy attached to these 
> names?  
> I suspect they are not arbitrary because I believe aurora will preempt tasks 
> that are not in the prod environment to make room for jobs that are.
> What would the ramifications be of removing the "_validate_environment_name" 
> function from src/main/python/apache/aurora/client/config.py ?  Are there 
> reasons why this was introduced in the first place?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-319) Allow job environments other than prod, devel, test or staging

2017-10-23 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215853#comment-16215853
 ] 

Stephan Erb commented on AURORA-319:


This has now been submitted https://reviews.apache.org/r/62692/

Thanks for the contribution!

> Allow job environments other than prod, devel, test or staging
> --
>
> Key: AURORA-319
> URL: https://issues.apache.org/jira/browse/AURORA-319
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: Jay Buffington
>Assignee: Benjamin Staffin
>Priority: Minor
>
> git commit bcabfce6 introduced limitations on what job environments must be 
> named.  Are these names arbitrary or is there some policy attached to these 
> names?  
> I suspect they are not arbitrary because I believe aurora will preempt tasks 
> that are not in the prod environment to make room for jobs that are.
> What would the ramifications be of removing the "_validate_environment_name" 
> function from src/main/python/apache/aurora/client/config.py ?  Are there 
> reasons why this was introduced in the first place?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1952) race condition in offers by agent id map (and potentially others) caused(probably) a crash

2017-10-19 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211824#comment-16211824
 ] 

Stephan Erb commented on AURORA-1952:
-

I would prefer to prevent the race condition in the first place. If this turns 
out to be difficult, another angle to look at that problem would be: How can we 
prevent it from causing problems if it occurs?

If yes, it would be rather trivial (at least it seems like it on a quick 
glance) to switch the `PendingTaskProcessor` from a uniqueIndex to a MultiMap. 
The only necessary change would then be in `filterPreemptionVictims` which 
currently requests an `Optional` but ends up translating it to a set 
in most cases anyway.  

> race condition in offers by agent id map (and potentially others) 
> caused(probably) a crash
> --
>
> Key: AURORA-1952
> URL: https://issues.apache.org/jira/browse/AURORA-1952
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.18.0
> Environment: nothing special
>Reporter: Mohit Jaggi
>Assignee: Mohit Jaggi
> Fix For: 0.18.0
>
>
> Crashed here
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145
> due to duplicates in map. Most likely a concurrency issue. [~wfarner] pointed 
> out the following code:
> I'm looking at this chunk here, where a concurrent map would not help.
> {code:java}
>   Optional sameSlave = 
> hostOffers.get(offer.getOffer().getAgentId());
>   if (sameSlave.isPresent()) {
> // If there are existing offers for the slave, decline all of them so 
> the master can
> // compact all of those offers into a single offer and send them back.
> LOG.info("Returning offers for " + 
> offer.getOffer().getAgentId().getValue()
> + " for compaction.");
> decline(offer.getOffer().getId());
> removeAndDecline(sameSlave.get().getOffer().getId());
>   } else {
> hostOffers.add(offer);
> {code}
> - logs --
> {code:java}
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00 PM 
> com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service 
> PreemptorService [FAILED] has failed in the RUNNING state.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: 
> java.lang.IllegalArgumentException: Multiple entries with same key: 
> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value: 
> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
> ...
> ...
> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: , 
> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal, 
> attributes=[IAttribute{name=host, values=[compute606-dca1]}, 
> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]}, 
> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE, 
> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple values 
> under a key, use Multimaps.index.
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:146)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
> org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
>

[jira] [Updated] (AURORA-1952) race condition in offers by agent id map (and potentially others) caused(probably) a crash

2017-10-19 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1952:

Description: 
Crashed here
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145

due to duplicates in map. Most likely a concurrency issue. [~wfarner] pointed 
out the following code:
I'm looking at this chunk here, where a concurrent map would not help.
{code:java}
  Optional sameSlave = 
hostOffers.get(offer.getOffer().getAgentId());
  if (sameSlave.isPresent()) {
// If there are existing offers for the slave, decline all of them so 
the master can
// compact all of those offers into a single offer and send them back.
LOG.info("Returning offers for " + 
offer.getOffer().getAgentId().getValue()
+ " for compaction.");
decline(offer.getOffer().getId());
removeAndDecline(sameSlave.get().getOffer().getId());
  } else {
hostOffers.add(offer);
{code}
- logs --
{code:java}
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 6:09:00 PM 
com.google.common.util.concurrent.ServiceManager$ServiceListener failed
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service 
PreemptorService [FAILED] has failed in the RUNNING state.
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: 
java.lang.IllegalArgumentException: Multiple entries with same key: 
1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id {
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value: 
"f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202"
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: }
...
...
ep 28 18:09:00 machine1163 aurora-scheduler[14266]: , 
hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal, 
attributes=[IAttribute{name=host, values=[compute606-dca1]}, 
IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]}, 
IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE, 
slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple values 
under a key, use Multimaps.index.
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
com.google.common.collect.Maps.uniqueIndex(Maps.java:1251)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
com.google.common.collect.Maps.uniqueIndex(Maps.java:1208)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.lambda$run$0(PendingTaskProcessor.java:146)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStorage.java:147)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.scheduler.storage.log.LogStorage.read(LogStorage.java:562)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.scheduler.storage.CallOrderEnforcingStorage.read(CallOrderEnforcingStorage.java:113)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.run(PendingTaskProcessor.java:135)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.common.inject.TimedInterceptor.invoke(TimedInterceptor.java:83)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
org.apache.aurora.scheduler.preemptor.PreemptorModule$PreemptorService.runOneIteration(PreemptorModule.java:161)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:188)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
com.google.common.util.concurrent.Callables$4.run(Callables.java:122)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Sep 28 18:09:00

[jira] [Commented] (AURORA-1944) Aurora is unable to elect leader after losing ZK for an extended period of time

2017-10-08 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196200#comment-16196200
 ] 

Stephan Erb commented on AURORA-1944:
-

As far as I know this should be resolved by 
https://github.com/apache/aurora/commit/dfd06771a5e4c63f2e3407cdf3bbb20201a7fbc1.

> Aurora is unable to elect leader after losing ZK for an extended period of 
> time
> ---
>
> Key: AURORA-1944
> URL: https://issues.apache.org/jira/browse/AURORA-1944
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
> Environment: Running on 0.17.0
>Reporter: Renan DelValle
> Attachments: aurora-0.log, aurora-1.log, aurora-2.log
>
>
> Using Apache Curator as the Zookeeper library causes an issue where Aurora is 
> unable to elect a leader if Zookeeper loses quorum for an extended period of 
> time.
> Scheduler seems to crash around:
> {{W0802 14:01:14.436 [TaskEventBatchWorker, SchedulerLifecycle] Failed to 
> leave leadership: 
> org.apache.aurora.common.zookeeper.SingletonService$LeaveException: Failed to 
> abdicate leadership of group at /aurora/scheduler}}
> When the init system brings the scheduler back up, it is unable to elect a 
> leader if ZK is still down.
> Specifically, the redirect monitor fails:
> {{E0802 14:09:37.063 [RedirectMonitor STARTING, 
> GuavaUtils$LifecycleShutdownListener] Service: RedirectMonitor [FAILED] 
> failed unexpectedly. Triggering shutdown.}}
> Leading to every scheduler showing the following:
> {{W0802 14:16:34.646 [qtp576711849-43, LeaderRedirect] No serviceGroupMonitor 
> in host set, will not redirect despite not being leader.}}
> Once the scheduler enters this state, it is unable to snap out of it until it 
> is manually restarted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1781) Sandbox taskfs setup fails (groupadd error)

2017-09-21 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175542#comment-16175542
 ] 

Stephan Erb commented on AURORA-1781:
-

Now that you mention it, I realize that I have observed something similar as 
well. Using Debian 7 (with backport kernel 3.16) we where not able to launch 
containers featuring newer versions such as Debian 8. Only after the upgrade to 
Debian 8 (with backport kernel 4.x) the problem disappeared.

> Sandbox taskfs setup fails (groupadd error)
> ---
>
> Key: AURORA-1781
> URL: https://issues.apache.org/jira/browse/AURORA-1781
> Project: Aurora
>  Issue Type: Bug
>  Components: Docker, Executor
>Affects Versions: 0.16.0
>Reporter: Justin Venus
>
> I hit what smells like a permission issue w/ `/etc/group` when trying to use 
> a docker-image (unified containerizer setup) with mesos-1.0.0. and 
> aurora-0.16.0-rc2.  I cannot reproduce issue w/ mesos-0.28.2 and aurora-015.0.
> {code}
> Failed to initialize sandbox: Failed to create group in sandbox for task 
> image: Command '['groupadd', '-R', 
> '/var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs',
>  '-g', '99', 'nobody']' returned non-zero exit status 10
> {code}
> {code}
> [root@mesos-master01of2 taskfs]# pwd
> /var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs
> [root@mesos-master01of2 taskfs]# groupadd -R $PWD -g 99 nobody
> groupadd: cannot lock /etc/group; try again later.
> {code}
> Maybe related to AURORA-1761
> I'm running CoreOS with the mesos-agent (and thermos) inside docker.  Here is 
> the gist of how it's started.
> {code}
> /usr/bin/sh -c "exec /usr/bin/docker run \
> --name=mesos_slave \
> --net=host \
> --pid=host \
> --privileged \
> -v /sys:/sys \
> -v /usr/bin/docker:/usr/bin/docker:ro \
> -v /var/lib/docker:/var/lib/docker \
> -v /var/run/docker.sock:/root/docker.sock \
> -v /run/systemd/system:/run/systemd/system \
> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
> -v /sys/fs/cgroup:/sys/fs/cgroup \
> -v /var/lib/mesos:/var/lib/mesos \
> -e MESOS_CONTAINERIZERS=docker,mesos \
> -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \
> -e MESOS_WORK_DIR=/var/lib/mesos \
> -e MESOS_LOGGING_LEVEL=INFO \
> -e AMAZON_REGION=us-office-2 \
> -e AVAILABILITY_ZONE=us-office-2b \
> -e MESOS_ATTRIBUTES=\"platform:linux;host:$(hostname);rack:us-office-2b\" 
> \
> -e MESOS_CLUSTER=ZeroZero \
> -e MESOS_DOCKER_SOCKET=/root/docker.sock \
> -e 
> MESOS_MASTER=zk://10.150.150.224:2181,10.150.150.225:2181,10.150.150.226:2181/mesos
>  \
> -e MESOS_LOG_DIR=/var/log/mesos \
> -e 
> MESOS_ISOLATION=\"filesystem/linux,cgroups/cpu,cgroups/mem,docker/runtime\" \
> -e MESOS_IMAGE_PROVIDERS=docker \
> -e MESOS_IMAGE_PROVISIONER_BACKEND=copy \
> -e MESOS_DOCKER_REGISTRY=http://docker-registry:31000 \
> -e MESOS_DOCKER_STORE_DIR=/var/lib/mesos/docker \
> --entrypoint=/usr/sbin/mesos-slave \
> docker-registry.thebrighttag.com:31000/mesos:latest \
> --no-systemd_enable_support \
> || rm -f /var/lib/mesos/meta/slaves/latest"
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1944) Aurora is unable to elect leader after losing ZK for an extended period of time

2017-08-06 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115938#comment-16115938
 ] 

Stephan Erb commented on AURORA-1944:
-

On a first glance, there appear to be two problems and we should probably fix 
both of them:

* When ZK is down at startup, {{CuratorServiceGroupMonitor.start}} fails with 
{{Failed to begin monitoring service group.}}. The start is not re-tried in any 
form but instead a shutdown is triggered.
* The scheduler shutdown fails with {{State transition from CONSTRUCTED to 
STOPPED is not allowed.}}

With this patch applied, unit tests, e2e tests, and scheduler start with ZK 
down seem all to work on first sight:
{code}
diff --git 
a/src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceGroupMonitor.java
 
b/src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceGroupMonitor.java
index eba56be..4551b44 100644
--- 
a/src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceGroupMonitor.java
+++ 
b/src/main/java/org/apache/aurora/scheduler/discovery/CuratorServiceGroupMonitor.java
@@ -69,10 +69,7 @@ class CuratorServiceGroupMonitor implements 
ServiceGroupMonitor {
   @Override
   public void start() throws MonitorException {
 try {
-  // NB: This blocks on an initial group population to emulate legacy 
ServerSetMonitor behavior;
-  // asynchronous population is an option using NORMAL or 
POST_INITIALIZED_EVENT StartModes
-  // though.
-  groupCache.start(PathChildrenCache.StartMode.BUILD_INITIAL_CACHE);
+  groupCache.start(PathChildrenCache.StartMode.NORMAL);
 } catch (Exception e) {
   throw new MonitorException("Failed to begin monitoring service group.", 
e);
 }
{code}


[~jsirois] what do you think?


> Aurora is unable to elect leader after losing ZK for an extended period of 
> time
> ---
>
> Key: AURORA-1944
> URL: https://issues.apache.org/jira/browse/AURORA-1944
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
> Environment: Running on 0.17.0
>Reporter: Renan DelValle
> Attachments: aurora-0.log, aurora-1.log, aurora-2.log
>
>
> Using Apache Curator as the Zookeeper library causes an issue where Aurora is 
> unable to elect a leader if Zookeeper loses quorum for an extended period of 
> time.
> Scheduler seems to crash around:
> {{W0802 14:01:14.436 [TaskEventBatchWorker, SchedulerLifecycle] Failed to 
> leave leadership: 
> org.apache.aurora.common.zookeeper.SingletonService$LeaveException: Failed to 
> abdicate leadership of group at /aurora/scheduler}}
> When the init system brings the scheduler back up, it is unable to elect a 
> leader if ZK is still down.
> Specifically, the redirect monitor fails:
> {{E0802 14:09:37.063 [RedirectMonitor STARTING, 
> GuavaUtils$LifecycleShutdownListener] Service: RedirectMonitor [FAILED] 
> failed unexpectedly. Triggering shutdown.}}
> Leading to every scheduler showing the following:
> {{W0802 14:16:34.646 [qtp576711849-43, LeaderRedirect] No serviceGroupMonitor 
> in host set, will not redirect despite not being leader.}}
> Once the scheduler enters this state, it is unable to snap out of it until it 
> is manually restarted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1837) Improve implicit task history pruning

2017-08-04 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114319#comment-16114319
 ] 

Stephan Erb commented on AURORA-1837:
-

Is this ticket and the associated RB still valid? In AURORA-1929 we have merged 
a task history pruning improvement by [~kaih] that might have improved the 
implicit pruning as well.

> Improve implicit task history pruning
> -
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1939) Thermos landing (host) page reports incorrect CPU rates when it is busy

2017-07-23 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16097798#comment-16097798
 ] 

Stephan Erb commented on AURORA-1939:
-

This is now on master. Thanks for the patch!

{code}
commit cdc5b8efd5bb86d38f73cca6d91903078b120333
Author: Reza Motamedi reza.motam...@gmail.com
Date:   Sat Jul 22 20:28:50 2017 +0200

Remove psutil's oneshot

After a lot of testing on busy machines, I realized that psutil's oneshot is
not threadsafe. I contacted the developer however, have not recevied a conceret
fix.

Please read https://issues.apache.org/jira/browse/AURORA-1939 and
https://github.com/giampaolo/psutil/issues/1110 for more information.

These inconsistencies disappear after removing oneshot.

Reviewed at https://reviews.apache.org/r/61016/

src/main/python/apache/thermos/monitoring/process_collector_psutil.py | 23 
+++
 1 file changed, 11 insertions(+), 12 deletions(-)
{code}

> Thermos landing (host) page reports incorrect CPU rates when it is busy
> ---
>
> Key: AURORA-1939
> URL: https://issues.apache.org/jira/browse/AURORA-1939
> Project: Aurora
>  Issue Type: Bug
>Reporter: Reza Motamedi
>Assignee: Reza Motamedi
>Priority: Minor
>
> Thermos Observer uses `psutil` to monitor resource consumption of Thermos 
> Processes. On a busy machine, I have noticed negative CPU values when 
> visiting the Thermos landing page.
> In my test I reproduced this by starting many processes that constantly 
> create short lived children. This indicates that in time between 
> `process_collector_psutil` looks up the Process children and the time it 
> calculates the CPU time the pid of the child is actually reused by another 
> much younger process, which leads to negative CPU times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1938) Aurora failed without log detail

2017-06-26 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063380#comment-16063380
 ] 

Stephan Erb commented on AURORA-1938:
-

The current snippet you posted does not tell us why Aurora thinks the storage 
is not ready. Normally those messages point to problems with the replicated 
log, or maybe connectivity issues between your Aurora schedulers. 

The log lines indicates that Aurora cannot even properly connect to the 
ZooKeeper ensemble. This is a prerequisite for a working cluster as well.
{code}2017-06-20 
17:38:58,527:1(0x7f13511fc700):ZOO_ERROR@handle_socket_error_msg@1697: Socket 
[10.176.128.91:2181] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
{code} 

How many Aurora schedulers do you have? 3 or 5?  Would be great to have the 
full log of those (if you feel comfortable sharing those). 


> Aurora failed without log detail
> 
>
> Key: AURORA-1938
> URL: https://issues.apache.org/jira/browse/AURORA-1938
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.13.0
>Reporter: Luc Nguyen
> Fix For: 0.13.0
>
> Attachments: Error_1.txt, Error_2.txt
>
>
> Aurora failed without log detail
> We also had a backup for Aurora as well. However, the Aurora backup was also 
> failed.
> It was bother us that there was no log which showing the failure in detail.
> Was there anyone running the same problem?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1938) Aurora failed without log detail

2017-06-23 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060890#comment-16060890
 ] 

Stephan Erb commented on AURORA-1938:
-

Would you be willing to share your log files? 

My current guess would be that there where either problems with ZK or with the 
network connectivity in general. A look in the log files of the individual 
masters might help to support or dismiss this theory. 

> Aurora failed without log detail
> 
>
> Key: AURORA-1938
> URL: https://issues.apache.org/jira/browse/AURORA-1938
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.13.0
>Reporter: Luc Nguyen
> Fix For: 0.13.0
>
>
> Aurora failed without log detail
> We also had a backup for Aurora as well. However, the Aurora backup was also 
> failed.
> It was bother us that there was no log which showing the failure in detail.
> Was there anyone running the same problem?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (AURORA-1927) Allow to define mesos health_check.

2017-05-24 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023097#comment-16023097
 ] 

Stephan Erb commented on AURORA-1927:
-

Aurora has had health checks for quite some time before they where added to 
Mesos as well. It might therefore make sense to leverage some of the new Mesos 
primitives also in Aurora.

In any case, at least for now, you can run the Aurora shellchecker instead of 
the http checker: It should allow you to do all the stuff you have mentioned 
above 
https://github.com/apache/aurora/blob/master/docs/reference/configuration.md#shellhealthchecker-objects
 


> Allow to define mesos health_check.
> ---
>
> Key: AURORA-1927
> URL: https://issues.apache.org/jira/browse/AURORA-1927
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Haralds Ulmanis
>
> Idea behind is to allow define mesos health checks.
> We are running plenty of applications (in home made apps, we have health 
> ports etc.). Some of them do not have dedicated health port or rest/web 
> interface at all. Sometimes it is enough to check TCP port .. sometimes maybe 
> run some command to determine health status.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Closed] (AURORA-1916) Incompatibility with mesos 1.2

2017-04-06 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb closed AURORA-1916.
---
Resolution: Duplicate

This has been fixed on master as part of AURORA-1882. For details, see 
https://reviews.apache.org/r/55951/. 

Either you have to wait for the new release or build latest master. 

> Incompatibility with mesos 1.2
> --
>
> Key: AURORA-1916
> URL: https://issues.apache.org/jira/browse/AURORA-1916
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.17.0
> Environment: Ubuntu 16.04, Mesos 1.2
>Reporter: Kostiantyn Bokhan
>
> The list of mesos-containerizer arguments has been changed since 1.2:
> {code}
> /usr/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --[no-]help  Prints this help message (default: false)
>   --launch_info=VALUE  
>   --namespace_mnt_target=VALUE The target 'pid' of the process whose 
> mount namespace we'd like
>to enter before executing the command.
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
> {code}
> Mesos 1.1.0:
> {code}
> /usr/libexec/mesos/mesos-containerizer launch --help
> Usage: launch [options]
>   --capabilities=VALUE Capabilities the command can use.
>   --command=VALUE  The command to execute.
>   --environment=VALUE  The environment variables for the command.
>   --[no-]help  Prints this help message (default: false)
>   --pipe_read=VALUEThe read end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pipe_write=VALUE   The write end of the control pipe. This is 
> a file descriptor 
>on Posix, or a handle on Windows. It's 
> caller's responsibility 
>to make sure the file descriptor or the 
> handle is inherited 
>properly in the subprocess. It's used to 
> synchronize with the 
>parent process. If not specified, no 
> synchronization will happen.
>   --pre_exec_commands=VALUEThe additional preparation commands to 
> execute before
>executing the command.
>   --rootfs=VALUE   Absolute path to the container root 
> filesystem. The command will be 
>interpreted relative to this path
>   --runtime_directory=VALUEThe runtime directory for the container 
> (used for checkpointing)
>   --[no-]unshare_namespace_mnt Whether to launch the command in a new 
> mount namespace. (default: false)
>   --user=VALUE The user to change to.
>   --working_directory=VALUEThe working directory for the command. It 
> has to be an absolute path 
>w.r.t. the root filesystem used for the 
> command.
> {code}
> It causes the next error:
> {code}
> Failed to parse the flags: Failed to load unknown flag 'command'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (AURORA-1909) Thermos Health Check fails for MesosContainerizer if `--nosetuid-health-checks` is set

2017-04-05 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1909.
-
Resolution: Fixed

> Thermos Health Check fails for MesosContainerizer if 
> `--nosetuid-health-checks` is set
> --
>
> Key: AURORA-1909
> URL: https://issues.apache.org/jira/browse/AURORA-1909
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Charles Raimbert
>Assignee: Charles Raimbert
>  Labels: easyfix
>
> With MesosContainerizer, the sandbox is of type FileSystemImageSandbox and 
> the health check is performed using a "mesos-containerizer launch" process, 
> but there is actually a code bug in the way of getting the user under which 
> to run the health check process:
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
> {code}
> health_check_user = (os.getusername() if self._nosetuid_health_checks
> else assigned_task.task.job.role)
> {code}
> If the Aurora scheduler is configured with `--nosetuid-health-checks` then 
> "os.getusername()" is executed, but the python "os" module does not present a 
> "getusername()" function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1909) Thermos Health Check fails for MesosContainerizer if `--nosetuid-health-checks` is set

2017-04-05 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956568#comment-15956568
 ] 

Stephan Erb commented on AURORA-1909:
-

This is now on master. Thanks for your contribution!
{code}
commit 7678d194f918143d5e8d771796e7dfbaabc931e7
Author: Charles Raimbert 
Date:   Wed Apr 5 11:25:03 2017 +0200

Fix Thermos Health Check for MesosContainerizer with 
`--nosetuid-health-checks`

With MesosContainerizer, the health check is performed using a 
"mesos-containerizer
launch" process, but there is actually a code bug in the way of getting the 
user
under which to run the health check process:

https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
```
health_check_user = (os.getusername() if self._nosetuid_health_checks
else assigned_task.task.job.role)
```

If the scheduler is configured with `--nosetuid-health-checks` then 
"os.getusername()"
is executed, but the "os" python module does not present any 
"getusername()" function,
which leads the Thermos execution to abort as follow:
```
D0323 01:08:15.453372 16 aurora_executor.py:159] Task started.
E0323 01:08:15.571124 16 aurora_executor.py:121] Traceback (most recent 
call last):
File "apache/aurora/executor/aurora_executor.py", line 119, in _run
self._start_status_manager(driver, assigned_task)
File "apache/aurora/executor/aurora_executor.py", line 168, in 
_start_status_manager
status_checker = status_provider.from_assigned_task(assigned_task, 
self._sandbox)
File "apache/aurora/executor/common/health_checker.py", line 370, in 
from_assigned_task
health_check_user = (os.getusername() if self._nosetuid_health_checks
AttributeError: 'module' object has no attribute 'getusername'
```

Following the existing unit testing pattern from test_health_checker.py, a 
test case
was added to cover the `--nosetuid-health-checks` case for 
MesosContainerizer.

Bugs closed: AURORA-1909

Reviewed at https://reviews.apache.org/r/58167/

 src/main/python/apache/aurora/executor/common/health_checker.py  |   3 ++-
 src/test/python/apache/aurora/executor/common/test_health_checker.py | 185 
++---
 2 files changed, 120 insertions(+), 68 deletions(-)
{code}

> Thermos Health Check fails for MesosContainerizer if 
> `--nosetuid-health-checks` is set
> --
>
> Key: AURORA-1909
> URL: https://issues.apache.org/jira/browse/AURORA-1909
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Charles Raimbert
>Assignee: Charles Raimbert
>  Labels: easyfix
>
> With MesosContainerizer, the sandbox is of type FileSystemImageSandbox and 
> the health check is performed using a "mesos-containerizer launch" process, 
> but there is actually a code bug in the way of getting the user under which 
> to run the health check process:
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
> {code}
> health_check_user = (os.getusername() if self._nosetuid_health_checks
> else assigned_task.task.job.role)
> {code}
> If the Aurora scheduler is configured with `--nosetuid-health-checks` then 
> "os.getusername()" is executed, but the python "os" module does not present a 
> "getusername()" function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1909) Thermos Health Check fails for MesosContainerizer if `--nosetuid-health-checks` is set

2017-04-04 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15954725#comment-15954725
 ] 

Stephan Erb commented on AURORA-1909:
-

RB: https://reviews.apache.org/r/58167/ 

> Thermos Health Check fails for MesosContainerizer if 
> `--nosetuid-health-checks` is set
> --
>
> Key: AURORA-1909
> URL: https://issues.apache.org/jira/browse/AURORA-1909
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Charles Raimbert
>Assignee: Charles Raimbert
>  Labels: easyfix
>
> With MesosContainerizer, the sandbox is of type FileSystemImageSandbox and 
> the health check is performed using a "mesos-containerizer launch" process, 
> but there is actually a code bug in the way of getting the user under which 
> to run the health check process:
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
> {code}
> health_check_user = (os.getusername() if self._nosetuid_health_checks
> else assigned_task.task.job.role)
> {code}
> If the Aurora scheduler is configured with `--nosetuid-health-checks` then 
> "os.getusername()" is executed, but the python "os" module does not present a 
> "getusername()" function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1909) Thermos Health Check fails for MesosContainerizer if `--nosetuid-health-checks` is set

2017-03-24 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940062#comment-15940062
 ] 

Stephan Erb commented on AURORA-1909:
-

Regardless of the bug itself, I am wondering why you are using the 
{{nosetuid_health_checks}} option. It sounds like a very severe security risk 
to me if you allow arbitrary users to run their health checks as root. This 
might be acceptable in the DockerContainerizer which (as far as I know) uses 
user namespaces, but this is not the case for the MesosContainerizer.

> Thermos Health Check fails for MesosContainerizer if 
> `--nosetuid-health-checks` is set
> --
>
> Key: AURORA-1909
> URL: https://issues.apache.org/jira/browse/AURORA-1909
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Reporter: Charles Raimbert
>Assignee: Charles Raimbert
>  Labels: easyfix
>
> With MesosContainerizer, the sandbox is of type FileSystemImageSandbox and 
> the health check is performed using a "mesos-containerizer launch" process, 
> but there is actually a code bug in the way of getting the user under which 
> to run the health check process:
> https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/health_checker.py#L370
> {code}
> health_check_user = (os.getusername() if self._nosetuid_health_checks
> else assigned_task.task.job.role)
> {code}
> If the Aurora scheduler is configured with `--nosetuid-health-checks` then 
> "os.getusername()" is executed, but the python "os" module does not present a 
> "getusername()" function.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-21 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935416#comment-15935416
 ] 

Stephan Erb commented on AURORA-1908:
-

Are you talking about this code here 
https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/scheduler/filter/SchedulingFilterImpl.java#L120?
 It seems like we are already returning early.



> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{SchedulerFilterImpl.filter}} there are 4 kinds of {{Veto}}es that can be 
> returned. 3 out of the 4 {{Veto}}es apply to the entire host (namely 
> {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, {{LIMIT_NOT_SATISFIED}} 
> or {{CONSTRAINT_MISMATCH}}). In this case we can short-circuit and return 
> early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1907) Thermos unresponsive on hosts with many active task

2017-03-19 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931728#comment-15931728
 ] 

Stephan Erb commented on AURORA-1907:
-

First patch submitted:

{code}
commit b8f72d1461c2f13f1f73c13211b428f60596c11e
Author: Stephan Erb 
Date:   Sun Mar 19 16:01:50 2017 +0100

Use Process.oneshot() in latest psutils for faster stats retrieval.

Without the Process.oneshot() decorator stats retrieval can lead to
multiple reads of the same `/proc` filesystem values. The oneshot
decorator enables caching to speed this up. It has been added in
psutils 5.0.

Oneshot docs: https://pythonhosted.org/psutil/#psutil.Process.oneshot
Changelog: https://github.com/giampaolo/psutil/blob/master/HISTORY.rst#520

Bugs closed: AURORA-1907

Reviewed at https://reviews.apache.org/r/57732/

 3rdparty/python/requirements.txt  |  2 +-
 src/main/python/apache/thermos/monitoring/process_collector_psutil.py | 23 
---
 2 files changed, 13 insertions(+), 12 deletions(-)
{code}

> Thermos unresponsive on hosts with many active task
> ---
>
> Key: AURORA-1907
> URL: https://issues.apache.org/jira/browse/AURORA-1907
> Project: Aurora
>  Issue Type: Story
>  Components: Observer
>Reporter: Stephan Erb
>Assignee: Stephan Erb
>
> We have noticed that on hosts with lots of active tasks (~100) and many 
> terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% 
> CPU but does not render any HTTP requests.
> Dumping {{/threads}} indicates we might be blocked by the hundret 
> {{TaskResourceMonitor}} threads trying to read values from {{/proc}}:
> {code}
> # Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] 
> [TID=45241], 140682825963264)
>   File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap
> self.__bootstrap_inner()
>   File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
> self.run()
>   File: 
> "/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
>  line 115, in identified
> return instancemethod(self, *args, **kwargs)
>   File: 
> "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
> self.__real_run(*args, **kw)
>   File: "apache/thermos/monitoring/resource.py", line 204, in run
> collector.sample()
>   File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in 
> sample
> for child in parent.children(recursive=True)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 326, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 861, in children
> table[p.ppid()].append(p)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 545, in ppid
> return self._proc.ppid()
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 962, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1459, in ppid
> return int(self._parse_stat_file()[2])
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1001, in _parse_stat_file
> return [name] + fields_after_name
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1907) Thermos unresponsive on hosts with many active task

2017-03-17 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930269#comment-15930269
 ] 

Stephan Erb commented on AURORA-1907:
-

First low-hanging fruit https://reviews.apache.org/r/57732. I will need to dig 
deeper as well.

> Thermos unresponsive on hosts with many active task
> ---
>
> Key: AURORA-1907
> URL: https://issues.apache.org/jira/browse/AURORA-1907
> Project: Aurora
>  Issue Type: Story
>  Components: Observer
>Reporter: Stephan Erb
>Assignee: Stephan Erb
>
> We have noticed that on hosts with lots of active tasks (~100) and many 
> terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% 
> CPU but does not render any HTTP requests.
> Dumping {{/threads}} indicates we might be blocked by the hundret 
> {{TaskResourceMonitor}} threads trying to read values from {{/proc}}:
> {code}
> # Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] 
> [TID=45241], 140682825963264)
>   File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap
> self.__bootstrap_inner()
>   File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
> self.run()
>   File: 
> "/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
>  line 115, in identified
> return instancemethod(self, *args, **kwargs)
>   File: 
> "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
> self.__real_run(*args, **kw)
>   File: "apache/thermos/monitoring/resource.py", line 204, in run
> collector.sample()
>   File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in 
> sample
> for child in parent.children(recursive=True)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 326, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 861, in children
> table[p.ppid()].append(p)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 545, in ppid
> return self._proc.ppid()
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 962, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1459, in ppid
> return int(self._parse_stat_file()[2])
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1001, in _parse_stat_file
> return [name] + fields_after_name
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (AURORA-1907) Thermos unresponsive on hosts with many active task

2017-03-17 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb reassigned AURORA-1907:
---

Assignee: Stephan Erb

> Thermos unresponsive on hosts with many active task
> ---
>
> Key: AURORA-1907
> URL: https://issues.apache.org/jira/browse/AURORA-1907
> Project: Aurora
>  Issue Type: Story
>  Components: Observer
>Reporter: Stephan Erb
>Assignee: Stephan Erb
>
> We have noticed that on hosts with lots of active tasks (~100) and many 
> terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% 
> CPU but does not render any HTTP requests.
> Dumping {{/threads}} indicates we might be blocked by the hundret 
> {{TaskResourceMonitor}} threads trying to read values from {{/proc}}:
> {code}
> # Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] 
> [TID=45241], 140682825963264)
>   File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap
> self.__bootstrap_inner()
>   File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
> self.run()
>   File: 
> "/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
>  line 115, in identified
> return instancemethod(self, *args, **kwargs)
>   File: 
> "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
> self.__real_run(*args, **kw)
>   File: "apache/thermos/monitoring/resource.py", line 204, in run
> collector.sample()
>   File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in 
> sample
> for child in parent.children(recursive=True)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 326, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 861, in children
> table[p.ppid()].append(p)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
>  line 545, in ppid
> return self._proc.ppid()
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 962, in wrapper
> return fun(self, *args, **kwargs)
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1459, in ppid
> return int(self._parse_stat_file()[2])
>   File: 
> "/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
>  line 1001, in _parse_stat_file
> return [name] + fields_after_name
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (AURORA-1907) Thermos unresponsive on hosts with many active task

2017-03-17 Thread Stephan Erb (JIRA)

Stephan Erb created AURORA-1907:
---

 Summary: Thermos unresponsive on hosts with many active task
 Key: AURORA-1907
 URL: https://issues.apache.org/jira/browse/AURORA-1907
 Project: Aurora
  Issue Type: Story
  Components: Observer
Reporter: Stephan Erb


We have noticed that on hosts with lots of active tasks (~100) and many 
terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% 
CPU but does not render any HTTP requests.

Dumping {{/threads}} indicates we might be blocked by the hundret 
{{TaskResourceMonitor}} threads trying to read values from {{/proc}}:
{code}
# Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] 
[TID=45241], 140682825963264)
  File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap
self.__bootstrap_inner()
  File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
  File: 
"/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
 line 115, in identified
return instancemethod(self, *args, **kwargs)
  File: 
"/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run
self.__real_run(*args, **kw)
  File: "apache/thermos/monitoring/resource.py", line 204, in run
collector.sample()
  File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in 
sample
for child in parent.children(recursive=True)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
 line 326, in wrapper
return fun(self, *args, **kwargs)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
 line 861, in children
table[p.ppid()].append(p)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
 line 545, in ppid
return self._proc.ppid()
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
 line 962, in wrapper
return fun(self, *args, **kwargs)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
 line 1459, in ppid
return int(self._parse_stat_file()[2])
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
 line 1001, in _parse_stat_file
return [name] + fields_after_name
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1707) Remove deprecated resource fields in TaskConfig and ResourceAggregate

2017-03-13 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15922964#comment-15922964
 ] 

Stephan Erb commented on AURORA-1707:
-

Submitted patch towards the deprecation

{code}
commit 2f08d9110eb180968b2633523f1c2386874790e9
Author: Nicolás Donatucci 
Date:   Mon Mar 13 21:59:42 2017 +0100

Change Resource Validation in ConfigurationManager so that it validates the 
Resource Set instead of deprecated fields

The Resource validation in ConfigurationManager is now done against the 
Resource set instead of the NumCpus, RamMb and DiskMb fields.

Related Issue: AURORA-1707

Reviewed at https://reviews.apache.org/r/56395/

 
src/main/java/org/apache/aurora/scheduler/configuration/ConfigurationManager.java
 | 44 +---
 src/main/java/org/apache/aurora/scheduler/storage/log/ThriftBackfill.java  
   | 16 ++--
 src/main/python/apache/aurora/config/thrift.py 
   |  2 +-
 
src/test/java/org/apache/aurora/scheduler/configuration/ConfigurationManagerTest.java
 | 11 ---
 
src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
|  4 
 5 files changed, 28 insertions(+), 49 deletions(-)
{code}

> Remove deprecated resource fields in TaskConfig and ResourceAggregate
> -
>
> Key: AURORA-1707
> URL: https://issues.apache.org/jira/browse/AURORA-1707
> Project: Aurora
>  Issue Type: Task
>Reporter: Maxim Khutornenko
>Assignee: Nicolas Donatucci
>
> Remove individual resource fields in TaskConfig and ResourceAggregate 
> replaced by the new {{Resource}} struct.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1899) Expose per role metrics around Thrift activity

2017-03-03 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893992#comment-15893992
 ] 

Stephan Erb commented on AURORA-1899:
-

When you say that a client causes havoc, what exactly do you have in mind? Are 
they sending too many or too large requests?

> Expose per role metrics around Thrift activity
> --
>
> Key: AURORA-1899
> URL: https://issues.apache.org/jira/browse/AURORA-1899
> Project: Aurora
>  Issue Type: Task
>Reporter: David McLaughlin
>
> It's currently pretty easy for a single client to cause havoc on an Aurora 
> cluster. We triage most of these issues by grepping the Scheduler logs for 
> Thrift API calls and finding patterns around role names. 
> Figuring out what changed would be a lot easier if we could take the current 
> Thrift API metrics and export an additional metric for each one that is 
> scoped by the role. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (AURORA-1894) Inline preemption filter in PreemptionVictimFilterImpl

2017-02-17 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871481#comment-15871481
 ] 

Stephan Erb commented on AURORA-1894:
-

I am surprised that the JVM did not optimize this properly. 

How where you profiling here? Simple sampling or true profiling that had to 
alter the byte code to inject instrumentation?

> Inline preemption filter in PreemptionVictimFilterImpl
> --
>
> Key: AURORA-1894
> URL: https://issues.apache.org/jira/browse/AURORA-1894
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
>
> Profiling preemption logic, I can see 
> {{PreemptionVictimFilterImpl#preemptionFilter()}} is producing ~200K/sec 
> lambda objects to be used by {{filterPreemptionVictims()}}:
> {code:title=PreemptionVictimFilterImpl.filterPreemptionVictims()}
>   FluentIterable preemptableTasks = 
> FluentIterable.from(possibleVictims)
>   .filter(preemptionFilter(pendingTask));
> {code}
> Inline this logic (refactor to loop) to remove the need to create these 
> short-lived lambda objects.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (AURORA-1872) Binary distributions on Ubuntu 16.04

2017-02-07 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-1872.
-
   Resolution: Fixed
 Assignee: Renan DelValle
Fix Version/s: 0.17.0

{code}
commit 2b19410bf014d6d756772e846a97e4da61f51a92
Author: Renan DelValle 
Date:   Wed Feb 8 08:56:28 2017 +0100

Adding support for Ubuntu Xenial packages

Added builder and test environment for Xenial as well as updated 
instructions
on how to test it. Added distribution to release-candidate script.

Bugs closed: AURORA-1872

Reviewed at https://reviews.apache.org/r/52437/

 build-support/release/release-candidate  |  1 +
 builder/deb/debian-jessie/Dockerfile |  1 +
 builder/deb/debian-jessie/build.sh   |  9 +
 builder/deb/ubuntu-trusty/Dockerfile |  1 +
 builder/deb/ubuntu-trusty/build.sh   |  9 +
 builder/deb/ubuntu-xenial/Dockerfile | 55 
+++
 builder/deb/ubuntu-xenial/build.sh   | 51 
+++
 specs/debian/aurora-executor.thermos.default |  3 +++
 specs/debian/aurora-executor.thermos.service | 13 +++--
 specs/debian/aurora-pants.ini|  3 ---
 specs/debian/aurora-scheduler.service|  6 +-
 specs/debian/aurora-scheduler.startup.sh | 40 

 specs/debian/rules   |  9 -
 test/deb/ubuntu-xenial/README.md | 67 
+++
 test/deb/ubuntu-xenial/Vagrantfile   | 14 ++
 test/deb/ubuntu-xenial/provision.sh  | 17 +
 16 files changed, 284 insertions(+), 15 deletions(-)
{code}

> Binary distributions on Ubuntu 16.04
> 
>
> Key: AURORA-1872
> URL: https://issues.apache.org/jira/browse/AURORA-1872
> Project: Aurora
>  Issue Type: Task
>Reporter: Bing-Qian Luan
>Assignee: Renan DelValle
>Priority: Minor
>  Labels: build
> Fix For: 0.17.0
>
>
> Request for pre-compiled packages for Ubuntu 16.04 (Xenial) on 
> https://bintray.com/apache/aurora. (Since Xenial has been released at July 
> 21, 2016)
> I have several Ubuntu 16.04 boxes in my organization, so it would be great if 
> there is a pre-compiled package for it.
> Regards,
> BQ



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1788) vagrant up does not properly configure network adapters

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1788:

Fix Version/s: 0.17.0

> vagrant up does not properly configure network adapters
> ---
>
> Key: AURORA-1788
> URL: https://issues.apache.org/jira/browse/AURORA-1788
> Project: Aurora
>  Issue Type: Bug
>Reporter: Andrew Jorgensen
>Assignee: Andrew Jorgensen
> Fix For: 0.17.0
>
>
> I am not sure of the specifics of why this happens but on vagrant 1.8.6 the 
> network interface does not come up correctly and the private_network is 
> attached to the eth0 nat interface rather than the host-only interface. I 
> tried a number of different parameters but none of them were able to 
> configure the network appropriately. This change manually configures the 
> static ip so that it is connected to the correct adapter. Without this change 
> I could not access the aurora web interface when running vagrant up.
> I've created a patch here: https://reviews.apache.org/r/52609/
> This is what the configuration looks like when run off master:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state 
> DOWN group default
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> here is what it is supposed to look like:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:fe7c:4e72/64 scope link
>valid_lft forever preferred_lft forever
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> Steps to reproduce:
> 1. Update to vagrant 1.8.6 (unsure if previous versions are affected as well)
> 2. Run `vagrant up`
> 3. Try to visit http://192.168.33.7:8081
> Expected outcome:
> I expect that following the steps in 
> http://aurora.apache.org/documentation/latest/getting-started/vagrant/ I 
> would be able to visit the web interface for aurora.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1224:

Fix Version/s: 0.17.0

> Add a new "min_consecutive_health_checks" setting in .aurora config
> ---
>
> Key: AURORA-1224
> URL: https://issues.apache.org/jira/browse/AURORA-1224
> Project: Aurora
>  Issue Type: Task
>  Components: Client, Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
> Fix For: 0.17.0
>
>
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1786) -zk_session_timeout option does not work

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1786:

Fix Version/s: 0.17.0

> -zk_session_timeout option does not work
> 
>
> Key: AURORA-1786
> URL: https://issues.apache.org/jira/browse/AURORA-1786
> Project: Aurora
>  Issue Type: Bug
>Reporter: David Robinson
> Fix For: 0.17.0
>
>
> Looks like the -zk_session_timeout option has no affect. I've set 
> -zk_session_timeout="60mins" to attempt to work around ZK session timeouts 
> (due to GC pauses caused by TaskHistoryPruner pruning a huge number of 
> inactive tasks), but the default 30 seconds seems to always be used.
> {noformat}
> I0929 22:36:10.804 [main, ArgScanner:411] zk_chroot_path: null 
> I0929 22:36:10.804 [main, ArgScanner:411] zk_digest_credentials: : 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_endpoints: [zk.example.com:2181] 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_in_proc: false 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_session_timeout: (30, mins) 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_use_curator: true 
> {noformat}
> {noformat}
> I0929 22:48:37.678 [AsyncProcessor-3, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:37.738 [AsyncProcessor-5, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> 2016-09-29 
> 22:48:37,794:47040(0x7f07f4c3c940):ZOO_WARN@zookeeper_interest@1570: Exceeded 
> deadline by 12ms
> I0929 22:48:37.805 [AsyncProcessor-0, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:37.814 [AsyncProcessor-6, MemTaskStore:148] Query took 588 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:37.867 [AsyncProcessor-1, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:37.873 [AsyncProcessor-2, MemTaskStore:148] Query took 304 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:37.875 [AsyncProcessor-7, MemTaskStore:148] Query took 289 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:37.886 [AsyncProcessor-4, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:38.045 [AsyncProcessor-3, MemTaskStore:148] Query took 359 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:38.152 [AsyncProcessor-5, MemTaskStore:148] Query took 405 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:38.407 [AsyncProcessor-0, MemTaskStore:148]

[jira] [Updated] (AURORA-1878) Increased executor logs can lead to task's running out of disk space

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1878:

Fix Version/s: 0.17.0

> Increased executor logs can lead to task's running out of disk space
> 
>
> Key: AURORA-1878
> URL: https://issues.apache.org/jira/browse/AURORA-1878
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
> Fix For: 0.17.0
>
>
> After the health check for updates patch, this log statement is being emitted 
> once every 500ms: 
> https://github.com/apache/aurora/commit/2992c8b4#diff-6d60c873330419a828fb992f46d53372R121
> This is due to this 
> [code|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/status_checker.py#L120-L124]:
> {code}
> if status_result is not None:
>   log.info('%s reported %s' % (status_checker.__class__.__name__, 
> status_result))
> {code}
> Previously, {{status_result}} would be {{None}} unless the status checker had 
> a terminal event. Now, {{status_result}} will always be set, but we only 
> consider the {{status_result}} to be terminal if the {{status}} is not 
> {{TASK_STARTING}} or {{TASK_RUNNING}}. So, for the healthy case, we log that 
> the task is {{TASK_RUNNING}} every 500ms.
> !https://frinkiac.com/meme/S10E02/818984.jpg?b64lines=IFRISVMgV0lMTCBTT1VORCBFVkVSWQogVEhSRUUgU0VDT05EUyBVTkxFU1MKIFNPTUVUSElORyBJU04nVCBPS0FZIQ==!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1861) Remove duplicate Snapshot fields for DB stores

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1861:

Fix Version/s: 0.17.0

> Remove duplicate Snapshot fields for DB stores
> --
>
> Key: AURORA-1861
> URL: https://issues.apache.org/jira/browse/AURORA-1861
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
> Fix For: 0.17.0
>
> Attachments: select-all-job-update-details time.png, 
> snapshot-create-time-only.png, snapshot-total-time.png
>
>
> Currently we double-write any DB-backed stores into a Snapshot struct when 
> creating a Snapshot. This inflates the size of the Snapshot, which is already 
> a problem for large production clusters (see AURORA-74). 
> Example for LockStore from 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java:
> {code}
>   new SnapshotField() {
> // It's important for locks to be replayed first, since there are 
> relations that expect
> // references to be valid on insertion.
> @Override
> public void saveToSnapshot(MutableStoreProvider store, Snapshot 
> snapshot) {
>   
> snapshot.setLocks(ILock.toBuildersSet(store.getLockStore().fetchLocks()));
> }
> @Override
> public void restoreFromSnapshot(MutableStoreProvider store, Snapshot 
> snapshot) {
>   if (hasDbSnapshot(snapshot)) {
> LOG.info("Deferring lock restore to dbsnapshot");
> return;
>   }
>   store.getLockStore().deleteLocks();
>   if (snapshot.isSetLocks()) {
> for (Lock lock : snapshot.getLocks()) {
>   store.getLockStore().saveLock(ILock.build(lock));
> }
>   }
> }
>   },
> {code}
> The saveToSnapshot here is totally redundant as the entire H2 database is 
> dumped into the dbScript field. 
> Note: one major side-effect here is if anyone is trying to read these 
> snapshots and utilize the data outside of Java - they'll lose the ability to 
> process the data without being able to apply the DB script. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1792) Executor does not log full task information.

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1792:

Fix Version/s: 0.17.0

> Executor does not log full task information.
> 
>
> Key: AURORA-1792
> URL: https://issues.apache.org/jira/browse/AURORA-1792
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.17.0
>
>
> I launched a task that has an {{initial_interval_secs}} in the health check 
> config. However the log contains no information about this field:
> {noformat}
> $ grep "initial_interval_secs" __main__.log
> {noformat}
> We should log the entire ExecutorInfo blob.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1541) Observer logs are noisy

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1541:

Fix Version/s: 0.17.0

> Observer logs are noisy
> ---
>
> Key: AURORA-1541
> URL: https://issues.apache.org/jira/browse/AURORA-1541
> Project: Aurora
>  Issue Type: Bug
>  Components: Observer
>Reporter: David Robinson
>Assignee: Stephan Erb
>Priority: Minor
> Fix For: 0.17.0
>
>
> The observer's logs consist of lots of warnings about being unable to find 
> PIDs. This is likely due to the checkpoint pointing to PIDs that have been 
> cleaned by Mesos.
> {noformat}
> W1117 20:11:38.103549 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 39594
> W1117 20:11:38.151583 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 14012
> W1117 20:11:38.232773 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 26565
> W1117 20:11:38.486680 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 44902
> W1117 20:11:38.612293 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 32871
> W1117 20:11:38.694812 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 7182
> {noformat}
> The warning messages should probably be debug messages, since Mesos cleaning 
> sandboxes is an expected operation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1789:

Fix Version/s: 0.17.0

> Incorrect --mesos_containerizer_path value results in thermos failure loop
> --
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
> Fix For: 0.17.0
>
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1793:

Fix Version/s: 0.17.0

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Blocker
> Fix For: 0.17.0
>
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1798:

Fix Version/s: 0.17.0

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: AURORA-1798
> URL: https://issues.apache.org/jira/browse/AURORA-1798
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
> Fix For: 0.17.0
>
>
> When Thermos launches a task using a Docker image it mounts the image as a 
> volume and manually chroots into it. One consequence of this is the logic 
> inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
> host into the new rootfs is bypassed. The Thermos executor should manually 
> copy this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1225:

Fix Version/s: 0.17.0

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
> Fix For: 0.17.0
>
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1791) Commit ca683 is not backwards compatible.

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1791:

Fix Version/s: 0.17.0

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
> Fix For: 0.17.0
>
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1795) Internal server error in scheduler Thrift API on missing Content-Type

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1795:

Fix Version/s: 0.17.0

> Internal server error in scheduler Thrift API on missing Content-Type
> -
>
> Key: AURORA-1795
> URL: https://issues.apache.org/jira/browse/AURORA-1795
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.16.0
>Reporter: Stephan Erb
>Assignee: Zameer Manji
> Fix For: 0.17.0
>
>
> This happens if a user has a very old browser, i.e. Firefox 41.
> {code}
> I1017 09:38:15.618 [qtp1426166274-44336, Slf4jRequestLog:60] 10.x.x.x - - 
> [17/Oct/2016:09:38:15 +] "POST //foobar.example.org/api HTTP/1.1" 200 794
> W1017 09:38:15.627 [qtp1426166274-44066, ServletHandler:631] /api 
> java.lang.NullPointerException: null
> at java.util.Objects.requireNonNull(Objects.java:203) 
> ~[na:1.8.0-internal]
> at java.util.Optional.(Optional.java:96) ~[na:1.8.0-internal]
> at java.util.Optional.of(Optional.java:108) ~[na:1.8.0-internal]
> at 
> org.apache.aurora.scheduler.http.api.TContentAwareServlet.doPost(TContentAwareServlet.java:123)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.api.TContentAwareServlet.doGet(TContentAwareServlet.java:164)
>  ~[aurora-0.16.0.jar:na]
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) 
> ~[javax.servlet-api-3.1.0.jar:3.1.0]
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) 
> ~[javax.servlet-api-3.1.0.jar:3.1.0]
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.LeaderRedirectFilter.doFilter(LeaderRedirectFilter.java:72)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44)
>  ~[aurora-0.16.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.HttpStatsFilter.doFilter(HttpStatsFilter.java:71)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44)
>  ~[aurora-0.16.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
>

[jira] [Updated] (AURORA-655) Order job update events and instance events by ID rather than timestamp

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-655:
---
Fix Version/s: 0.17.0

> Order job update events and instance events by ID rather than timestamp
> ---
>
> Key: AURORA-655
> URL: https://issues.apache.org/jira/browse/AURORA-655
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Bill Farner
>Assignee: Jing Chen
>Priority: Trivial
>  Labels: newbie
> Fix For: 0.17.0
>
>
> In {{JobUpdateDetailsMapper.xml}} we order by timestamps, which could be 
> brittle if the system time changes.  Instead of using the timestamp, use the 
> built-in database {{IDENTITY}} for sort order.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1684) Cron tasks are sanitized multiple times (once when being created via the API, and again when actually being triggered)

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1684:

Fix Version/s: 0.17.0

> Cron tasks are sanitized multiple times (once when being created via the API, 
> and again when actually being triggered)
> --
>
> Key: AURORA-1684
> URL: https://issues.apache.org/jira/browse/AURORA-1684
> Project: Aurora
>  Issue Type: Bug
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
> Fix For: 0.17.0
>
>
> This can cause issues in the following scenario:
> - An operator sets default_docker_parameters on the scheduler
> - The operator DOES NOT allow docker paramters (via allow_docker_parameters)
> - A user schedules a cron job using a docker container.
> Because the first pass of ConfigurationManager.validateAndPopulate will 
> mutate the task to have docker parameters (the defaults), the second pass in 
> SanitizedCronJob.fromUnsanitized will fail validation.
> A solution here may be to remove fromUnsanitized and instead pass the job 
> configuration directly, since we know it will always be safe.
> {code}
> W0427 17:01:35.286 [QuartzScheduler_Worker-5, AuroraCronJob:134] Invalid cron 
> job for IJobKey{role=tcdc-infra, environment=prod, 
> name=security-group-alerter} in storage - failed to parse with {} 
> org.apache.aurora.scheduler.configuration.ConfigurationManager$TaskDescriptionException:
>  Docker parameters not allowed.
>   at 
> org.apache.aurora.scheduler.configuration.ConfigurationManager.validateAndPopulate(ConfigurationManager.java:249)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.configuration.ConfigurationManager.validateAndPopulate(ConfigurationManager.java:166)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.configuration.SanitizedConfiguration.fromUnsanitized(SanitizedConfiguration.java:60)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.cron.SanitizedCronJob.(SanitizedCronJob.java:45)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.cron.SanitizedCronJob.fromUnsanitized(SanitizedCronJob.java:102)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$doExecute$163(AuroraCronJob.java:132)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.log.LogStorage.lambda$doInTransaction$222(LogStorage.java:524)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage.transactionedWrite(DbStorage.java:160)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.CGLIB$transactionedWrite$2()
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4$$FastClassByGuice$$e3e3ff55.invoke()
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
>  ~[guice-3.0.jar:na]
>   at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
>  ~[mybatis-guice-3.7.jar:3.7]
>   at 
> com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.transactionedWrite()
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$188(DbStorage.java:174)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:172) 
> ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.CGLIB$write$3()
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4$$FastClassByGuice$$e3e3ff55.invoke()
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
>

[jira] [Updated] (AURORA-1794) Scheduler fails to start if -enable_revocable_ram is toggled

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1794:

Fix Version/s: 0.17.0

> Scheduler fails to start if -enable_revocable_ram is toggled
> 
>
> Key: AURORA-1794
> URL: https://issues.apache.org/jira/browse/AURORA-1794
> Project: Aurora
>  Issue Type: Story
>Affects Versions: 0.16.0
>Reporter: Stephan Erb
>Assignee: Stephan Erb
> Fix For: 0.17.0
>
>
> The scheduler does not start if {{-enable_revocable_ram}} is set:
> {code}
> Exception in thread "main" java.lang.IllegalStateException: A value cannot be 
> changed after it was read.
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:174)
> at org.apache.aurora.common.args.Arg.set(Arg.java:54)
> at 
> org.apache.aurora.common.args.ArgumentInfo.setValue(ArgumentInfo.java:128)
> at org.apache.aurora.common.args.OptionInfo.load(OptionInfo.java:131)
> at 
> org.apache.aurora.common.args.ArgScanner.process(ArgScanner.java:368)
> at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:200)
> at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:178)
> at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:155)
> at 
> org.apache.aurora.scheduler.app.SchedulerMain.applyStaticArgumentValues(SchedulerMain.java:226)
> at 
> org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:197)
> {code}
> This is an unfortunate oversight at my end. When introducing the feature, I 
> deferred the e2e test. It 'worked' in a manual test - at least that is what I 
> believed. Probably, I had only added the flag to the config in the repo, but 
> not to the one that was actually started in vagrant.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1880) How to set the environment variable for Mesos Containerizer?

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1880:

Fix Version/s: 0.17.0

> How to set the environment variable for Mesos Containerizer?
> 
>
> Key: AURORA-1880
> URL: https://issues.apache.org/jira/browse/AURORA-1880
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.15.0
>Reporter: jackyoh
> Fix For: 0.17.0
>
>
> I'm running a Docker on an Aurora framework.
> The question is: how to set the environment variable for Mesos Containerizer?
> For example:
> docker run -e ENV1=env1 ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1110) Running task ssh without an instance should pick a random instance

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1110:

Fix Version/s: 0.17.0

> Running task ssh without an instance should pick a random instance
> --
>
> Key: AURORA-1110
> URL: https://issues.apache.org/jira/browse/AURORA-1110
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Trivial
>  Labels: newbie
> Fix For: 0.17.0
>
>
> I always forget to add an instance to the end of the job key when ssh'ing. It 
> might be nice if running {{aurora task ssh ...}} without specifying an 
> instance either picked a random instance or just defaulted to instance 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1875) The thriftw compatibility thrift binary check is too loose

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1875:

Fix Version/s: 0.17.0

> The thriftw compatibility thrift binary check is too loose
> --
>
> Key: AURORA-1875
> URL: https://issues.apache.org/jira/browse/AURORA-1875
> Project: Aurora
>  Issue Type: Bug
>Reporter: John Sirois
>Assignee: John Sirois
> Fix For: 0.17.0
>
>
> Right now the 
> [check|https://github.com/apache/aurora/blob/master/build-support/thrift/thriftw#L31]
>  is only for the proper version. We need to also check java and python 
> codegen are both supported by the binary since we use both.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1858) Expose stats on offers known to scheduler

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1858:

Fix Version/s: 0.17.0

> Expose stats on offers known to scheduler
> -
>
> Key: AURORA-1858
> URL: https://issues.apache.org/jira/browse/AURORA-1858
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> Expose stats on the number of offers tracked by {{OfferManager}}. This can 
> simply be defined as a collection size gauge on {{offers}} set.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1823:

Fix Version/s: 0.17.0

> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
> Fix For: 0.17.0
>
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1737) Descheduling a cron job checks role access before job key existence

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1737:

Fix Version/s: 0.17.0

> Descheduling a cron job checks role access before job key existence
> ---
>
> Key: AURORA-1737
> URL: https://issues.apache.org/jira/browse/AURORA-1737
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Minor
> Fix For: 0.17.0
>
>
> Trying to deschedule a cron job for a non-existent role returns a permission 
> error rather than a no-such-job error. This leads to confusion for users in 
> the event of a typo in the role.
> Given that jobs are world-readable, we should check for a valid job key 
> before applying permissions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-1787) `-global_container_mounts` does not appear to work with the unified containerizer

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1787:

Fix Version/s: 0.17.0

> `-global_container_mounts` does not appear to work with the unified 
> containerizer
> -
>
> Key: AURORA-1787
> URL: https://issues.apache.org/jira/browse/AURORA-1787
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Priority: Critical
> Fix For: 0.17.0
>
>
> Perhaps I misunderstand how this feature is supposed to be used, but apply 
> the following patch to master:
> {noformat}
> From 1ebb5f4c5815c647e31f3253d5e5c316a0d5edd2 Mon Sep 17 00:00:00 2001
> From: Zameer Manji 
> Date: Tue, 4 Oct 2016 20:45:41 -0700
> Subject: [PATCH] Reproduce the issue.
> ---
>  examples/vagrant/upstart/aurora-scheduler.conf |  2 +-
>  src/test/sh/org/apache/aurora/e2e/run-server.sh|  4 
>  .../sh/org/apache/aurora/e2e/test_end_to_end.sh| 26 
> +++---
>  3 files changed, 18 insertions(+), 14 deletions(-)
> diff --git a/examples/vagrant/upstart/aurora-scheduler.conf 
> b/examples/vagrant/upstart/aurora-scheduler.conf
> index 91b27d7..851b5a1 100644
> --- a/examples/vagrant/upstart/aurora-scheduler.conf
> +++ b/examples/vagrant/upstart/aurora-scheduler.conf
> @@ -40,7 +40,7 @@ exec bin/aurora-scheduler \
>-native_log_file_path=/var/db/aurora \
>-backup_dir=/var/lib/aurora/backups \
>-thermos_executor_path=$DIST_DIR/thermos_executor.pex \
> -  
> -global_container_mounts=/home/vagrant/aurora/examples/vagrant/config:/home/vagrant/aurora/examples/vagrant/config:ro
>  \
> +  -global_container_mounts=/etc/rsyslog.d:rsyslog.d.container:ro \
>-thermos_executor_flags="--announcer-ensemble localhost:2181 
> --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json 
> --mesos-containerizer-path=/usr/libexec/mesos/mesos-containerizer" \
>-allowed_container_types=MESOS,DOCKER \
>-http_authentication_mechanism=BASIC \
> diff --git a/src/test/sh/org/apache/aurora/e2e/run-server.sh 
> b/src/test/sh/org/apache/aurora/e2e/run-server.sh
> index 1fe0909..a0ee76f 100755
> --- a/src/test/sh/org/apache/aurora/e2e/run-server.sh
> +++ b/src/test/sh/org/apache/aurora/e2e/run-server.sh
> @@ -1,6 +1,10 @@
>  #!/bin/bash
>  
>  echo "Starting up server..."
> +if [ ! -d "./rsyslog.d.container" ]; then
> +  echo "Mountpoint Doesn't Exist";
> +  exit 1;
> +fi
>  while true
>  do
>echo -e "HTTP/1.1 200 OK\r\n\r\nHello from a filesystem image." | nc -l 
> "$1"
> diff --git a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> index c93be9b..094d776 100755
> --- a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> +++ b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> @@ -514,27 +514,27 @@ trap collect_result EXIT
>  aurorabuild all
>  setup_ssh
>  
> -test_version
> -test_http_example "${TEST_JOB_ARGS[@]}"
> -test_health_check
> +# test_version
> +# test_http_example "${TEST_JOB_ARGS[@]}"
> +# test_health_check
>  
> -test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}"
> +# test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}"
>  
> -test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}"
> +# test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}"
>  
>  # build the test docker image
> -sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" 
> ${TEST_ROOT}
> -test_http_example "${TEST_JOB_DOCKER_ARGS[@]}"
> +# sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" 
> ${TEST_ROOT}
> +# test_http_example "${TEST_JOB_DOCKER_ARGS[@]}"
>  
>  setup_image_stores
>  test_appc_unified
> -test_docker_unified
> +# test_docker_unified
>  
> -test_admin "${TEST_ADMIN_ARGS[@]}"
> -test_basic_auth_unauthenticated  "${TEST_JOB_ARGS[@]}"
> +# test_admin "${TEST_ADMIN_ARGS[@]}"
> +# test_basic_auth_unauthenticated  "${TEST_JOB_ARGS[@]}"
>  
> -test_ephemeral_daemon_with_final 
> "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}"
> +# test_ephemeral_daemon_with_final 
> "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}"
>  
> -/vagrant/src/test/sh/org/apache/aurora/e2e/test_kerberos_end_to_end.sh
> -/vagrant/src/test/sh/org/apache/aurora/e2e/test_bypass_leader_redirect_end_to_end.sh
> +# /vagrant/src/test/sh/org/apache/aurora/e2e/test_kerberos_end_to_end.sh
> +# 
> /vagrant/src/test/sh/org/apache/aurora/e2e/test_bypass_leader_redirect_end_to_end.sh
>  RETCODE=0
> -- 
> 2.10.0
> {noformat}
> You can apply the patch by copying the content to a {{.patch}} file and 
> running {{git am < file.patch}}
> Run the e2e tests.
> Observe that the tests fail because the tasks fail. The tasks fail because 
> the mountpoint in their sandbox does not exist.
> I observe the correct

[jira] [Updated] (AURORA-894) Server updater should watch healthy instances

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-894:
---
Fix Version/s: 0.17.0

> Server updater should watch healthy instances
> -
>
> Key: AURORA-894
> URL: https://issues.apache.org/jira/browse/AURORA-894
> Project: Aurora
>  Issue Type: Epic
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>  Labels: 2015-Q2
> Fix For: 0.17.0
>
>
> Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}}) 
> countdown when an instance reaches RUNNING state, the updater should rely on 
> the first successful health check instead. This will potentially speed up 
> updates as the {{minWaitInInstanceRunningMs}} will no longer have to be 
> chosen based on the worst observed instance startup/warmup delay but rather 
> as a desired health check duration according to the following formula:
> {noformat}
> minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000
> {noformat}
> where:
>   {{interval_secs}} - 
> https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects
>   {{num_desired_healthchecks}} - the desired number of OK health checks to 
> observe before declaring an instance updated successfully
>   
> The above would allow every instance to start watching interval depending on 
> the individual instance performance and potentially exit updater earlier. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-343) HTTP thrift service is not over SSL

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-343:
---
Fix Version/s: 0.17.0

> HTTP thrift service is not over SSL
> ---
>
> Key: AURORA-343
> URL: https://issues.apache.org/jira/browse/AURORA-343
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Bill Farner
>Assignee: Stephan Erb
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> {{SchedulerAPIServlet}} is bound against the default debug HTTP server, which 
> is non-encrypted.  This leaves the door open to snooping.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (AURORA-133) write_lock_wait_nanos stat is misleading and of little use

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-133:
---
Fix Version/s: 0.17.0

> write_lock_wait_nanos stat is misleading and of little use
> --
>
> Key: AURORA-133
> URL: https://issues.apache.org/jira/browse/AURORA-133
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Bill Farner
>Priority: Minor
> Fix For: 0.17.0
>
>
> {{write_lock_wait_nanos}} is not useful since intrinsic lock on 
> {{LogStorage}} will be contended for and held by the time the read/write lock 
> is acquired



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Reopened] (AURORA-1712) Debian Jessie packagaes are embedding the mesos egg build for Ubuntu trusty

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb reopened AURORA-1712:
-

This bug is only fixed once we adapt the packaging scripts.

> Debian Jessie packagaes are embedding the mesos egg build for Ubuntu trusty
> ---
>
> Key: AURORA-1712
> URL: https://issues.apache.org/jira/browse/AURORA-1712
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>Assignee: Renan DelValle
>
> The Debian packaging scripts for Trusty and Jessie are sharing the same 
> override mechanism for the pants third_party repository. We therefore end up  
> using egg-files build for Ubuntu also on Debian 
> (https://github.com/apache/aurora-packaging/blob/master/specs/debian/aurora-pants.ini)
> It seems like this is kind of working, but is clearly not optimal.
> We should extend 
> https://github.com/apache/aurora/blob/master/build-support/python/make-mesos-native-egg
>  to support Debian and then make use of it in our packaging infrastructure 
> https://github.com/apache/aurora-packaging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (AURORA-894) Server updater should watch healthy instances

2017-01-31 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-894.

Resolution: Fixed

> Server updater should watch healthy instances
> -
>
> Key: AURORA-894
> URL: https://issues.apache.org/jira/browse/AURORA-894
> Project: Aurora
>  Issue Type: Epic
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>  Labels: 2015-Q2
>
> Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}}) 
> countdown when an instance reaches RUNNING state, the updater should rely on 
> the first successful health check instead. This will potentially speed up 
> updates as the {{minWaitInInstanceRunningMs}} will no longer have to be 
> chosen based on the worst observed instance startup/warmup delay but rather 
> as a desired health check duration according to the following formula:
> {noformat}
> minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000
> {noformat}
> where:
>   {{interval_secs}} - 
> https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects
>   {{num_desired_healthchecks}} - the desired number of OK health checks to 
> observe before declaring an instance updated successfully
>   
> The above would allow every instance to start watching interval depending on 
> the individual instance performance and potentially exit updater earlier. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-30 Thread Stephan Erb (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stephan Erb commented on  AURORA-1879 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key  
 
 
 
 
 
 
 
 
 
 

 

commit c385b63f43b84519f7bee74178906bc76a8c8cfb
Author: Stephan Erb 
Date:   Mon Jan 30 21:37:59 2017 +0100

Fix pendingTasks endpoint in case of multiple TaskGroups per job.

Central idea of this patch is to change the return value of `getPendingReasons`
from a map keyed by JobKey to a map keyed by `TaskGroupKey`. This prevents the
`IllegalArgumentException` during the map construction.

Bugs closed: AURORA-1879

Reviewed at https://reviews.apache.org/r/56058/

 src/main/java/org/apache/aurora/scheduler/http/PendingTasks.java   | 24 +++-
 src/main/java/org/apache/aurora/scheduler/metadata/NearestFit.java |  7 ---
 src/test/java/org/apache/aurora/scheduler/http/PendingTasksTest.java   | 28 ++--
 src/test/java/org/apache/aurora/scheduler/metadata/NearestFitTest.java |  8 
 4 files changed, 45 insertions(+), 22 deletions(-)
 

 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-29 Thread Stephan Erb (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stephan Erb assigned an issue to Stephan Erb 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Aurora /  AURORA-1879 
 
 
 
  /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key  
 
 
 
 
 
 
 
 
 

Change By:
 
 Stephan Erb 
 
 
 

Assignee:
 
 Stephan Erb 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore

2017-01-29 Thread Stephan Erb (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stephan Erb commented on  AURORA-1812 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore   
 
 
 
 
 
 
 
 
 
 
While I still believe this is important, I have removed the 0.17 milestone. We don't have the necessary capacity to get this ready for 0.17. 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore

2017-01-29 Thread Stephan Erb (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stephan Erb updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Aurora /  AURORA-1812 
 
 
 
  Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore   
 
 
 
 
 
 
 
 
 

Change By:
 
 Stephan Erb 
 
 
 

Fix Version/s:
 
 0.17.0 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (AURORA-1811) sla_list_safe_domain no longer reports SLA usage

2017-01-29 Thread Stephan Erb (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stephan Erb updated an issue 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 Aurora /  AURORA-1811 
 
 
 
  sla_list_safe_domain no longer reports SLA usage  
 
 
 
 
 
 
 
 
 

Change By:
 
 Stephan Erb 
 
 
 

Fix Version/s:
 
 0.17.0 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2017-01-29 Thread Stephan Erb (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stephan Erb assigned an issue to Stephan Erb 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
https://reviews.apache.org/r/56062/ 
 
 
 
 
 
 
 
 
 
 Aurora /  AURORA-1809 
 
 
 
  Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed  
 
 
 
 
 
 
 
 
 

Change By:
 
 Stephan Erb 
 
 
 

Assignee:
 
 Stephan Erb 
 
 
 

Status:
 
 Open Reviewable 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-29 Thread Stephan Erb (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Stephan Erb commented on  AURORA-1879 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key  
 
 
 
 
 
 
 
 
 
 
https://reviews.apache.org/r/56058 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] [Comment Edited] (AURORA-1751) Update org.apache.aurora/aurora-api in Maven

2017-01-25 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838747#comment-15838747
 ] 

Stephan Erb edited comment on AURORA-1751 at 1/25/17 10:54 PM:
---

[~rohit.aggarwal] are you blocked on this? Or do you see a way that you just 
build the API yourself?

Edit: Before you opened this issue I was not aware that the Aurora project used 
to release this on mavencentral. If you are really the only user of this API 
package, it might not make sense for us to distribute this as an official part 
of the Apache Aurora project. The Apache foundation requires us to perform 
testing and voting for each deliverable and this comes with some burden and 
process overhead for the committers/PMC members.


was (Author: stephanerb):
[~rohit.aggarwal] are you blocked on this? Or do you see a way that you just 
build the API yourself?

> Update org.apache.aurora/aurora-api in Maven
> 
>
> Key: AURORA-1751
> URL: https://issues.apache.org/jira/browse/AURORA-1751
> Project: Aurora
>  Issue Type: Task
>  Components: Packaging
>Affects Versions: 0.13.0
>Reporter: Derek Slager
>Assignee: Jake Farrell
>Priority: Minor
>
> Currently the version of org.apache.aurora/aurora-api available on Maven 
> Central is 0.8.0, which is several versions out of date. It would be ideal to 
> have up-to-date versions available as new Aurora releases are cut.
> https://mvnrepository.com/artifact/org.apache.aurora/aurora-api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1751) Update org.apache.aurora/aurora-api in Maven

2017-01-25 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838747#comment-15838747
 ] 

Stephan Erb commented on AURORA-1751:
-

[~rohit.aggarwal] are you blocked on this? Or do you see a way that you just 
build the API yourself?

> Update org.apache.aurora/aurora-api in Maven
> 
>
> Key: AURORA-1751
> URL: https://issues.apache.org/jira/browse/AURORA-1751
> Project: Aurora
>  Issue Type: Task
>  Components: Packaging
>Affects Versions: 0.13.0
>Reporter: Derek Slager
>Assignee: Jake Farrell
>Priority: Minor
>
> Currently the version of org.apache.aurora/aurora-api available on Maven 
> Central is 0.8.0, which is several versions out of date. It would be ideal to 
> have up-to-date versions available as new Aurora releases are cut.
> https://mvnrepository.com/artifact/org.apache.aurora/aurora-api



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1880) How to set the environment variable for Mesos Containerizer?

2017-01-25 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838731#comment-15838731
 ] 

Stephan Erb commented on AURORA-1880:
-

There are multiple ways to achieve this. The easiest is probably to just place 
the environment variable before the command you are executing, i.e.

{code}
my_process = Process(
  name = 'my_process'
  cmdline = "ENV1=env ./my_cool_script.sh")
{code}


An other alternative would be to use a thermos_profile 
https://github.com/apache/aurora/blob/master/docs/reference/configuration-tutorial.md#getting-environment-variables-into-the-sandbox.

I am closing this ticket as it is not a real bug. If you have any further 
questions, please contact us via mail or in slack/irc. 
http://aurora.apache.org/community/ 



> How to set the environment variable for Mesos Containerizer?
> 
>
> Key: AURORA-1880
> URL: https://issues.apache.org/jira/browse/AURORA-1880
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.15.0
>Reporter: jackyoh
>
> I'm running a Docker on an Aurora framework.
> The question is: how to set the environment variable for Mesos Containerizer?
> For example:
> docker run -e ENV1=env1 ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (AURORA-1880) How to set the environment variable for Mesos Containerizer?

2017-01-25 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb closed AURORA-1880.
---
Resolution: Not A Bug

> How to set the environment variable for Mesos Containerizer?
> 
>
> Key: AURORA-1880
> URL: https://issues.apache.org/jira/browse/AURORA-1880
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.15.0
>Reporter: jackyoh
>
> I'm running a Docker on an Aurora framework.
> The question is: how to set the environment variable for Mesos Containerizer?
> For example:
> docker run -e ENV1=env1 ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-25 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838704#comment-15838704
 ] 

Stephan Erb commented on AURORA-1879:
-

This is a regression on master. We should fix it before releasing 0.17.0.

> /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending 
> tasks with the same key
> ---
>
> Key: AURORA-1879
> URL: https://issues.apache.org/jira/browse/AURORA-1879
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Kai Huang
> Fix For: 0.17.0
>
> Attachments: pending_tasks.png
>
>
> When we have multiple TaskGroups that have same key but different 
> TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error.
> This bug seems to be related to a recent commit (Added the 'reason' to the 
> /pendingTasks 
> endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1).
>  
> Attached were a screenshot of the /pendingTasks endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1879) /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending tasks with the same key

2017-01-25 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1879:

Fix Version/s: 0.17.0

> /pendingTasks endpoint shows 500 HTTP Error when there are multiple pending 
> tasks with the same key
> ---
>
> Key: AURORA-1879
> URL: https://issues.apache.org/jira/browse/AURORA-1879
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Kai Huang
> Fix For: 0.17.0
>
> Attachments: pending_tasks.png
>
>
> When we have multiple TaskGroups that have same key but different 
> TaskConfigs, the /pendingTasks endpoint will gives a 500 HTTP Error.
> This bug seems to be related to a recent commit (Added the 'reason' to the 
> /pendingTasks 
> endpoint)(https://github.com/apache/aurora/commit/8e07b04bbd4de23b8f492627da4a614d1e517cf1).
>  
> Attached were a screenshot of the /pendingTasks endpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2017-01-24 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836806#comment-15836806
 ] 

Stephan Erb commented on AURORA-1809:
-

This test fails due to the recent introduction of {{PR_SET_CHILD_SUBREAPER}}. 
Only in the full suite {{setup_child_subreaping()}} is called before the above 
mentioned test case is run. If we additionally call 
{{setup_child_subreaping()}} from within the testcase it fails all the time.


> Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
> ---
>
> Key: AURORA-1809
> URL: https://issues.apache.org/jira/browse/AURORA-1809
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
> Fix For: 0.17.0
>
>
> If you run it apart of the full test suite it fails like this:
> {noformat}
>   FAILURES 
>  __ TestRunnerKillProcessGroup.test_pg_is_killed __
>  
>  self =  object at 0x7f0c79893e10>
>  
>  [1mdef test_pg_is_killed(self):[0m
>  [1m  runner = self.start_runner()[0m
>  [1m  tm = TaskMonitor(runner.tempdir, 
> runner.task_id)[0m
>  [1m  self.wait_until_running(tm)[0m
>  [1m  process_state, run_number = 
> tm.get_active_processes()[0][0m
>  [1m  assert process_state.process == 'process'[0m
>  [1m  assert run_number == 0[0m
>  [1m[0m
>  [1m  child_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'child.txt')[0m
>  [1m  while not os.path.exists(child_pidfile):[0m
>  [1mtime.sleep(0.1)[0m
>  [1m  parent_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'parent.txt')[0m
>  [1m  while not os.path.exists(parent_pidfile):[0m
>  [1mtime.sleep(0.1)[0m
>  [1m  with open(child_pidfile) as fp:[0m
>  [1mchild_pid = int(fp.read().rstrip())[0m
>  [1m  with open(parent_pidfile) as fp:[0m
>  [1mparent_pid = int(fp.read().rstrip())[0m
>  [1m[0m
>  [1m  ps = ProcessProviderFactory.get()[0m
>  [1m  ps.collect_all()[0m
>  [1m  assert parent_pid in ps.pids()[0m
>  [1m  assert child_pid in ps.pids()[0m
>  [1m  assert child_pid in 
> ps.children_of(parent_pid)[0m
>  [1m[0m
>  [1m  with open(os.path.join(runner.sandbox, 
> runner.task_id, 'exit.txt'), 'w') as fp:[0m
>  [1mfp.write('go away!')[0m
>  [1m[0m
>  [1m  while tm.task_state() is not 
> TaskState.SUCCESS:[0m
>  [1mtime.sleep(0.1)[0m
>  [1m[0m
>  [1m  state = tm.get_state()[0m
>  [1m  assert state.processes['process'][0].state == 
> ProcessState.SUCCESS[0m
>  [1m[0m
>  [1m  ps.collect_all()[0m
>  [1m  assert parent_pid not in ps.pids()[0m
>  [1m> assert child_pid not in ps.pids()[0m
>  [1m[31mE assert 30475 not in set([1, 2, 3, 5, 7, 
> 8, ...])[0m
>  [1m[31mE  +  where set([1, 2, 3, 5, 7, 8, ...]) = 
>   at 0x7f0c798b1990>>()[0m
>  [1m[31mE  +where  ProcessProvider_Procfs.pids of 
>  at 0x7f0c798b1990>> = 
>  at 0x7f0c798b1990>.pids[0m
>  
>  
> src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError
>  -- Captured stderr call --
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>

[jira] [Commented] (AURORA-1781) Sandbox taskfs setup fails (groupadd error)

2017-01-24 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836710#comment-15836710
 ] 

Stephan Erb commented on AURORA-1781:
-

Unfortunately, I have removed the milestone from this one. We don't have a way 
to reproduce this yet so we cannot fix it in time for 0.17.0.

> Sandbox taskfs setup fails (groupadd error)
> ---
>
> Key: AURORA-1781
> URL: https://issues.apache.org/jira/browse/AURORA-1781
> Project: Aurora
>  Issue Type: Bug
>  Components: Docker, Executor
>Affects Versions: 0.16.0
>Reporter: Justin Venus
>
> I hit what smells like a permission issue w/ `/etc/group` when trying to use 
> a docker-image (unified containerizer setup) with mesos-1.0.0. and 
> aurora-0.16.0-rc2.  I cannot reproduce issue w/ mesos-0.28.2 and aurora-015.0.
> {code}
> Failed to initialize sandbox: Failed to create group in sandbox for task 
> image: Command '['groupadd', '-R', 
> '/var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs',
>  '-g', '99', 'nobody']' returned non-zero exit status 10
> {code}
> {code}
> [root@mesos-master01of2 taskfs]# pwd
> /var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs
> [root@mesos-master01of2 taskfs]# groupadd -R $PWD -g 99 nobody
> groupadd: cannot lock /etc/group; try again later.
> {code}
> Maybe related to AURORA-1761
> I'm running CoreOS with the mesos-agent (and thermos) inside docker.  Here is 
> the gist of how it's started.
> {code}
> /usr/bin/sh -c "exec /usr/bin/docker run \
> --name=mesos_slave \
> --net=host \
> --pid=host \
> --privileged \
> -v /sys:/sys \
> -v /usr/bin/docker:/usr/bin/docker:ro \
> -v /var/lib/docker:/var/lib/docker \
> -v /var/run/docker.sock:/root/docker.sock \
> -v /run/systemd/system:/run/systemd/system \
> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
> -v /sys/fs/cgroup:/sys/fs/cgroup \
> -v /var/lib/mesos:/var/lib/mesos \
> -e MESOS_CONTAINERIZERS=docker,mesos \
> -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \
> -e MESOS_WORK_DIR=/var/lib/mesos \
> -e MESOS_LOGGING_LEVEL=INFO \
> -e AMAZON_REGION=us-office-2 \
> -e AVAILABILITY_ZONE=us-office-2b \
> -e MESOS_ATTRIBUTES=\"platform:linux;host:$(hostname);rack:us-office-2b\" 
> \
> -e MESOS_CLUSTER=ZeroZero \
> -e MESOS_DOCKER_SOCKET=/root/docker.sock \
> -e 
> MESOS_MASTER=zk://10.150.150.224:2181,10.150.150.225:2181,10.150.150.226:2181/mesos
>  \
> -e MESOS_LOG_DIR=/var/log/mesos \
> -e 
> MESOS_ISOLATION=\"filesystem/linux,cgroups/cpu,cgroups/mem,docker/runtime\" \
> -e MESOS_IMAGE_PROVIDERS=docker \
> -e MESOS_IMAGE_PROVISIONER_BACKEND=copy \
> -e MESOS_DOCKER_REGISTRY=http://docker-registry:31000 \
> -e MESOS_DOCKER_STORE_DIR=/var/lib/mesos/docker \
> --entrypoint=/usr/sbin/mesos-slave \
> docker-registry.thebrighttag.com:31000/mesos:latest \
> --no-systemd_enable_support \
> || rm -f /var/lib/mesos/meta/slaves/latest"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1781) Sandbox taskfs setup fails (groupadd error)

2017-01-24 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1781:

Fix Version/s: (was: 0.17.0)

> Sandbox taskfs setup fails (groupadd error)
> ---
>
> Key: AURORA-1781
> URL: https://issues.apache.org/jira/browse/AURORA-1781
> Project: Aurora
>  Issue Type: Bug
>  Components: Docker, Executor
>Affects Versions: 0.16.0
>Reporter: Justin Venus
>
> I hit what smells like a permission issue w/ `/etc/group` when trying to use 
> a docker-image (unified containerizer setup) with mesos-1.0.0. and 
> aurora-0.16.0-rc2.  I cannot reproduce issue w/ mesos-0.28.2 and aurora-015.0.
> {code}
> Failed to initialize sandbox: Failed to create group in sandbox for task 
> image: Command '['groupadd', '-R', 
> '/var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs',
>  '-g', '99', 'nobody']' returned non-zero exit status 10
> {code}
> {code}
> [root@mesos-master01of2 taskfs]# pwd
> /var/lib/mesos/slaves/5d28d0cc-2793-4471-82d5-e67276c53f70-S2/frameworks/20160221-001235-3801519626-5050-1-/executors/thermos-nobody-prod-jenkins-0-47cc7824-565b-4265-9ab4-9ba3f364ebed/runs/a3f78288-4865-4166-8685-1ad941562f2f/taskfs
> [root@mesos-master01of2 taskfs]# groupadd -R $PWD -g 99 nobody
> groupadd: cannot lock /etc/group; try again later.
> {code}
> Maybe related to AURORA-1761
> I'm running CoreOS with the mesos-agent (and thermos) inside docker.  Here is 
> the gist of how it's started.
> {code}
> /usr/bin/sh -c "exec /usr/bin/docker run \
> --name=mesos_slave \
> --net=host \
> --pid=host \
> --privileged \
> -v /sys:/sys \
> -v /usr/bin/docker:/usr/bin/docker:ro \
> -v /var/lib/docker:/var/lib/docker \
> -v /var/run/docker.sock:/root/docker.sock \
> -v /run/systemd/system:/run/systemd/system \
> -v /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro \
> -v /sys/fs/cgroup:/sys/fs/cgroup \
> -v /var/lib/mesos:/var/lib/mesos \
> -e MESOS_CONTAINERIZERS=docker,mesos \
> -e MESOS_EXECUTOR_REGISTRATION_TIMEOUT=5mins \
> -e MESOS_WORK_DIR=/var/lib/mesos \
> -e MESOS_LOGGING_LEVEL=INFO \
> -e AMAZON_REGION=us-office-2 \
> -e AVAILABILITY_ZONE=us-office-2b \
> -e MESOS_ATTRIBUTES=\"platform:linux;host:$(hostname);rack:us-office-2b\" 
> \
> -e MESOS_CLUSTER=ZeroZero \
> -e MESOS_DOCKER_SOCKET=/root/docker.sock \
> -e 
> MESOS_MASTER=zk://10.150.150.224:2181,10.150.150.225:2181,10.150.150.226:2181/mesos
>  \
> -e MESOS_LOG_DIR=/var/log/mesos \
> -e 
> MESOS_ISOLATION=\"filesystem/linux,cgroups/cpu,cgroups/mem,docker/runtime\" \
> -e MESOS_IMAGE_PROVIDERS=docker \
> -e MESOS_IMAGE_PROVISIONER_BACKEND=copy \
> -e MESOS_DOCKER_REGISTRY=http://docker-registry:31000 \
> -e MESOS_DOCKER_STORE_DIR=/var/lib/mesos/docker \
> --entrypoint=/usr/sbin/mesos-slave \
> docker-registry.thebrighttag.com:31000/mesos:latest \
> --no-systemd_enable_support \
> || rm -f /var/lib/mesos/meta/slaves/latest"
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AURORA-1107) Add support for mounting task specified external volumes into containers

2017-01-24 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836671#comment-15836671
 ] 

Stephan Erb commented on AURORA-1107:
-

I am removing the milestone as this won't make it in time for 0.17.

> Add support for mounting task specified external volumes into containers
> 
>
> Key: AURORA-1107
> URL: https://issues.apache.org/jira/browse/AURORA-1107
> Project: Aurora
>  Issue Type: Task
>  Components: Docker
>Reporter: Steve Niemitz
>Assignee: Zameer Manji
>Priority: Minor
>
> The Mesos docker API allows specifying volumes on the host to mount into the 
> container when it runs.  We should expose this.  I propose:
>  - Add a volumes() set to the Docker object in base.py
>  - Add a similar set to the DockerContainer struct in api.thrift 
>  - Create a way for administrators to restrict the ability to use this.  
> Because mounts are set up by the docker daemon, they effectively allow 
> someone who can configure mounts to access anything on the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (AURORA-1107) Add support for mounting task specified external volumes into containers

2017-01-24 Thread Stephan Erb (JIRA)


 [ 
https://issues.apache.org/jira/browse/AURORA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1107:

Fix Version/s: (was: 0.17.0)

> Add support for mounting task specified external volumes into containers
> 
>
> Key: AURORA-1107
> URL: https://issues.apache.org/jira/browse/AURORA-1107
> Project: Aurora
>  Issue Type: Task
>  Components: Docker
>Reporter: Steve Niemitz
>Assignee: Zameer Manji
>Priority: Minor
>
> The Mesos docker API allows specifying volumes on the host to mount into the 
> container when it runs.  We should expose this.  I propose:
>  - Add a volumes() set to the Docker object in base.py
>  - Add a similar set to the DockerContainer struct in api.thrift 
>  - Create a way for administrators to restrict the ability to use this.  
> Because mounts are set up by the docker daemon, they effectively allow 
> someone who can configure mounts to access anything on the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (AURORA-1811) sla_list_safe_domain no longer reports SLA usage

2017-01-18 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/AURORA-1811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827991#comment-15827991
 ] 

Stephan Erb edited comment on AURORA-1811 at 1/18/17 12:56 PM:
---

[~a-nldisr] to clarify: Do you have a workaround in place, or do you need a fix 
for this in the next version? If not, we could resolve this when dropping the 
deprecated `production` code in 0.18.


was (Author: stephanerb):
[~a-nldisr] to clarify: Do you have a workaround in place, or do you need a fix 
for this in the next version. If not, we could resolve this when dropping the 
deprecated `production` code in 0.18.

> sla_list_safe_domain no longer reports SLA usage
> 
>
> Key: AURORA-1811
> URL: https://issues.apache.org/jira/browse/AURORA-1811
> Project: Aurora
>  Issue Type: Bug
>  Components: Client, Maintenance, SLA
>Affects Versions: 0.16.0
> Environment: Vagrant image - Ubuntu, Centos 7.2
>Reporter: Rogier Dikkes
>Priority: Minor
>  Labels: client, features, sla
> Fix For: 0.17.0
>
>
> We recently had to patch hosts, in our situation we have a couple of services 
> that run less than 2-5 instances with production = true and tier = preferred 
> as provided in the default example documentation. 
> As we understood host_drain is not configurable to set the minimum job 
> instance count, the default is 10. We tried to compile a list of hosts with 
> aurora_admin sla_list_safe_domain that are running these services to feed 
> host_drain with an unsafe_hosts_file. 
> When we ran the aurora_admin sla_list_safe_domain --min_job_instance_count=2 
> devcluster 95 1m the scheduler returns: 
>  INFO] Response from scheduler: OK (message: )
> As if there are no hosts. We tried to change the percentage and duration to 
> see if anything was returned but we never receive an different response.
> To ensure that the client is not the cause we used the 0.16.0 client against 
> an 0.14.0 cluster, this cluster reports hosts that are safe to kill without 
> violating job sla's. 
> To ensure its not a faulty cluster setup on our part we started the vagrant 
> sandbox, started an task with 3 instances with tier = preferred and 
> production = True.
> commands used:
> aurora_admin sla_list_safe_domain --min_job_instance_count=2 devcluster 20 50m
> aurora_admin sla_list_safe_domain --min_job_instance_count=2 devcluster 90 5m
> With -l or with time and percentage variations never changes the outcome.
> Changing the instance_count to a higher number does not change output either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 >

1 - 100 of 402 matches

Mail list logo