Thanks Bill. I will take you up on that offer. I filed AURORA-1952 <https://issues.apache.org/jira/browse/AURORA-1952>. Do I need some privileges to assign it to myself?
On Fri, Sep 29, 2017 at 10:09 AM, Bill Farner <[email protected]> wrote: > concurrent map > > > I'm looking at this chunk here > <https://github.com/apache/aurora/blob/7a803730c95fc7d1f788292d83c3d2eeb81a936d/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java#L193-L202>, > where a concurrent map would not help. > > Optional<HostOffer> sameSlave = hostOffers.get(offer.getOffer( > ).getAgentId()); > if (sameSlave.isPresent()) { > // If there are existing offers for the slave, decline all of them > so the master can > // compact all of those offers into a single offer and send them > back. > LOG.info("Returning offers for " + offer.getOffer().getAgentId(). > getValue() > + " for compaction."); > decline(offer.getOffer().getId()); > removeAndDecline(sameSlave.get().getOffer().getId()); > } else { > hostOffers.add(offer); > > This exhibits a classic check-then-act race on hostOffers, which could > allow a second offer with the same agent ID. An obvious fix here would be > to move the "if exists, remove, else add" sequence in a synchronized method > in hostOffers. > > Happy to help guide you on a patch! > > > > On Fri, Sep 29, 2017 at 9:57 AM, Mohit Jaggi <[email protected]> wrote: > >> Will do. If the fix involves making the map of offers by agent id a >> concurrent map...I can contribute that. >> >> On Fri, Sep 29, 2017 at 9:09 AM, Bill Farner <[email protected]> wrote: >> >>> This is due to multiple offers for the same agent, rather than duplicate >>> offers. I don't see a specific bug in the suspect code >>> (OfferManager.java), but it does stand out as subject to races. >>> Specifically, there is a lack of synchronization when checking for an offer >>> exists for a given agent ID and subsequently removing that offer. >>> >>> Can you file a bug? >>> >>> On Thu, Sep 28, 2017 at 1:56 PM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> Folks, >>>> >>>> I saw the following crash in my scheduler. It appears to be due to >>>> duplicates offers. Any insights appreciated! >>>> >>>> Mohit. >>>> >>>> *Code:* >>>> >>>> https://github.com/apache/aurora/blob/master/src/main/java/o >>>> rg/apache/aurora/scheduler/preemptor/PendingTaskProcessor.java#L145 >>>> >>>> *Logs:* >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: Sep 28, 2017 >>>> 6:09:00 PM com.google.common.util.concurrent.ServiceManager$ServiceListener >>>> failed >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: SEVERE: Service >>>> PreemptorService [FAILED] has failed in the RUNNING state. >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: >>>> java.lang.IllegalArgumentException: Multiple entries with same key: >>>> 1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102=HostOffer{offer=id { >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: value: >>>> "f7b84805-a0c5-4405-be77-f7f1b7110405-O56597202" >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: } >>>> >>>> >>>> ... >>>> >>>> ... >>>> >>>> >>>> ep 28 18:09:00 machine1163 aurora-scheduler[14266]: , >>>> hostAttributes=IHostAttributes{host=compute606-dca1.prod.uber.internal, >>>> attributes=[IAttribute{name=host, values=[compute606-dca1]}, >>>> IAttribute{name=rack, values=[as13]}, IAttribute{name=pod, values=[d]}, >>>> IAttribute{name=dedicated, values=[infra/cassandra]}], mode=NONE, >>>> slaveId=1ed038e0-a3ef-4476-adfd-70c86241c5f7-S102}}. To index multiple >>>> values under a key, use Multimaps.index. >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1251) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> com.google.common.collect.Maps.uniqueIndex(Maps.java:1208) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.l >>>> ambda$run$0(PendingTaskProcessor.java:146) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.scheduler.storage.db.DbStorage.read(DbStor >>>> age.java:147) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.mybatis.guice.transactional.TransactionalMethodIntercept >>>> or.invoke(TransactionalMethodInterceptor.java:101) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.common.inject.TimedInterceptor.invoke(Time >>>> dInterceptor.java:83) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.scheduler.storage.log.LogStorage.read(LogS >>>> torage.java:562) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.scheduler.storage.CallOrderEnforcingStorag >>>> e.read(CallOrderEnforcingStorage.java:113) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.scheduler.preemptor.PendingTaskProcessor.r >>>> un(PendingTaskProcessor.java:135) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.common.inject.TimedInterceptor.invoke(Time >>>> dInterceptor.java:83) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> org.apache.aurora.scheduler.preemptor.PreemptorModule$Preemp >>>> torService.runOneIteration(PreemptorModule.java:161) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> com.google.common.util.concurrent.AbstractScheduledService$S >>>> erviceDelegate$Task.run(AbstractScheduledService.java:188) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> com.google.common.util.concurrent.Callables$4.run(Callables.java:122) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>>> tureTask.access$301(ScheduledThreadPoolExecutor.java:180) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFu >>>> tureTask.run(ScheduledThreadPoolExecutor.java:294) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool >>>> Executor.java:1142) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo >>>> lExecutor.java:617) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: at >>>> java.lang.Thread.run(Thread.java:748) >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: E0928 18:09:00.316 >>>> [PreemptorService RUNNING, GuavaUtils$LifecycleShutdownListener:55] >>>> Service: PreemptorService [FAILED] failed unexpectedly. Triggering >>>> shutdown. >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 >>>> [qtp1000734462-3068369, Slf4jRequestLog:60] 10.187.28.19 - - >>>> [28/Sep/2017:18:09:00 +0000] "POST //10.188.43.6:8082/api HTTP/1.1" >>>> 200 95 >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 >>>> [PreemptorService RUNNING, Lifecycle:84] Shutting down application >>>> >>>> >>>> >>>> Sep 28 18:09:00 machine1163 aurora-scheduler[14266]: I0928 18:09:00.316 >>>> [PreemptorService RUNNING, ShutdownRegistry$ShutdownRegistryImpl:77] >>>> Executing 4 shutdown commands. >>>> >>>> >>>> >>>> >>>> >>> >> >
