[jira] [Created] (MESOS-10159) Running unit test command hangs
Jinesh Patel created MESOS-10159: Summary: Running unit test command hangs Key: MESOS-10159 URL: https://issues.apache.org/jira/browse/MESOS-10159 Project: Mesos Issue Type: Bug Components: test Environment: OS: Ubuntu 20.04 Arch: Intel Reporter: Jinesh Patel Running the `make check` command to execute mesos test cases hangs after printing failed test results. The process doesn't hang if all test cases pass. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10158) Mesos Agent gets stuck in Draining due to pending unacknowledged status updates
Andrei Budnik created MESOS-10158: - Summary: Mesos Agent gets stuck in Draining due to pending unacknowledged status updates Key: MESOS-10158 URL: https://issues.apache.org/jira/browse/MESOS-10158 Project: Mesos Issue Type: Bug Components: master Reporter: Andrei Budnik A Mesos agent can get stuck in the Draining mode caused by pending unacknowledged status updates. When the framework becomes disconnected, the agent keeps sending task status updates for terminated tasks of that framework. This leads to a problem when the agent gets stuck in the Draining state because the master transitions the agent from DRAINING to DRAINED state only after all task status updates get acknowledged. This problem can be resolved by sending ["Teardown" operation|https://github.com/apache/mesos/blob/8ce5d30808f3744eeded09d530f226079d569a94/include/mesos/v1/master/master.proto#L299-L303] for all lost frameworks. However, it would be much better if this situation could be handled automatically by the Master. At least, we should make it easier for an operator to find out what prevents draining operation to complete. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10140) CMake Error: Problem with archive_read_open_file(): Unrecognized archive format
[ https://issues.apache.org/jira/browse/MESOS-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152815#comment-17152815 ] Greg Mann commented on MESOS-10140: --- [~QuellaZhang] could you try building again on latest master branch of Mesos? We believe the issue should be fixed now. If so, please close out this ticket, otherwise let us know. Thanks! > CMake Error: Problem with archive_read_open_file(): Unrecognized archive > format > --- > > Key: MESOS-10140 > URL: https://issues.apache.org/jira/browse/MESOS-10140 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: QuellaZhang >Priority: Major > Labels: windows > Attachments: mesos_build.log > > > Hi All, > We tried to build Mesos on Windows with VS2019. It failed to build due to > "CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): > Unrecognized archive format > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj]" on Windows > using MSVC. It can be reproduced on latest reversion d4634f4 on master > branch. Could you help confirm? We use cmake version 3.17.2. > > Reproduce steps: > 1. git clone -c core.autocrlf=true [https://github.com/apache/mesos] > F:\gitP\apache\mesos > 2. Open a VS 2019 x64 command prompt as admin and browse to > F:\gitP\apache\mesos > 3. mkdir build_amd64 && pushd build_amd64 > 4. cmake -G "Visual Studio 16 2019" -A x64 > -DCMAKE_SYSTEM_VERSION=10.0.18362.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="F:\tools\gnuwin32\bin" -T host=x64 .. > 5. set _CL_=/D_SILENCE_TR1_NAMESPACE_DEPRECATION_WARNING > 6. msbuild /maxcpucount:4 /p:Platform=x64 /p:Configuration=Debug Mesos.sln > /t:Rebuild > > ErrorMessage: > *manual run:* > F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake --version > cmake version 3.17.2 > CMake suite maintained and supported by Kitware (kitware.com/cmake). > F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP\src>cmake -E tar xjf > archive.tar > CMake Error: Problem with archive_read_open_file(): Unrecognized archive > format > CMake Error: Problem extracting tar: archive.tar > *build log: (see attachment)* > 59>CUSTOMBUILD : CMake error : Problem with archive_read_open_file(): > Unrecognized archive format > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj] > 59>CUSTOMBUILD : CMake error : Problem extracting tar: > F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj] > – extracting... [error clean up] > CMake Error at wclayer-WIP-stamp/extract-wclayer-WIP.cmake:33 (message): > 59>CUSTOMBUILD : error : extract of > [F:\gitP\apache\mesos\build_amd64\3rdparty\wclayer-WIP.vcxproj] > 'F:/gitP/apache/mesos/build_amd64/3rdparty/wclayer-WIP/src/archive.tar' > failed -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10143) Outstanding Offers accumulating
[ https://issues.apache.org/jira/browse/MESOS-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152811#comment-17152811 ] Greg Mann commented on MESOS-10143: --- [~puneetku287] it's unclear to me from the description if this is an issue in Mesos or in your scheduler. A more precise description of the framework's behavior during the incidents would help - what does the scheduler do with the offers during this time? Feel free to find us on Mesos Slack, that might be an easier place to have a synchronous discussion about your issue. > Outstanding Offers accumulating > --- > > Key: MESOS-10143 > URL: https://issues.apache.org/jira/browse/MESOS-10143 > Project: Mesos > Issue Type: Bug > Components: master, scheduler driver >Affects Versions: 1.7.0 > Environment: Mesos Version 1.7.0 > JDK 8.0 >Reporter: Puneet Kumar >Priority: Minor > > We manage an Apache Mesos cluster version 1.7.0. We have written a framework > in Java that schedules tasks to Mesos master at a rate of 300 TPS. Everything > works fine for almost 24 hours but then outstanding offers accumulate & > saturate within 15 minutes. Outstanding offers aren't reclaimed by Mesos > master. We observe "RescindOffer" messages in verbose (GLOG v=3) framework > logs but outstanding offers don't reduce. New resources aren't offered to > framework when outstanding offers saturate. We have to restart the scheduler > to reset outstanding offers to zero. > Any suggestions to debug this issue are welcome. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash
[ https://issues.apache.org/jira/browse/MESOS-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152809#comment-17152809 ] Greg Mann commented on MESOS-10146: --- [~sunshine123] thank you for the bug report! Would it be possible to get a full verbose master log from an incident? The logs surrounding the check failure may help us pinpoint the issue more precisely. > Removing task from slave when framework is disconnected causes master to crash > -- > > Key: MESOS-10146 > URL: https://issues.apache.org/jira/browse/MESOS-10146 > Project: Mesos > Issue Type: Bug > Components: c++ api, framework >Affects Versions: 1.9.0 > Environment: Mesos master with three master nodes >Reporter: Naveen >Priority: Major > > Hello, > we want to report an issue we observed when remove tasks from slave. > There is condition to check for valid framework before tasks can be removed. > There can be several reasons framework can be disconnected. This check fails > and crashes mesos master node. > [https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842] > There is also unguarded access to the internal framework state on line 11853. > Error logs - > {noformat} > mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health > check timed out > mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check > failed: framework != nullptr Framework > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 > (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } > } > mesos-master[5483]: *** Check failure stack trace: *** > mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed > all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 > mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed > agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 > mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica > received learned notice for position 42070 from > log-network(1)@10.160.73.212:5050 > mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail() > mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog() > mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush() > mesos-master[5483]: @ 0x7f2fdf6a8859 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[5483]: @ 0x7f2fde2677f2 > mesos::internal::master::Master::__removeSlave() > mesos-master[5483]: @ 0x7f2fde267ebe > mesos::internal::master::Master::_markUnreachable() > mesos-master[5483]: @ 0x7f2fde268215 > _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbclEv > mesos-master[5483]: @ 0x7f2fddf30688 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEclEOS3_ > mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume() > mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume() > mesos-master[5483]: @ 0x7f2fdf60cb36 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine > mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread > mesos-master[5483]: @ 0x7f2fdb20e8dd __clone > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service failed. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopped Mesos Master. > systemd[1]: Started Mesos Master. > mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level > logging started! > mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: > 2020-05-09 10:42:00 by centos > mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0 > mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0 > mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: > 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10157) Add document for the `volume/csi` isolator
Qian Zhang created MESOS-10157: -- Summary: Add document for the `volume/csi` isolator Key: MESOS-10157 URL: https://issues.apache.org/jira/browse/MESOS-10157 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10156) Enable the `volume/csi` isolator in UCR
Qian Zhang created MESOS-10156: -- Summary: Enable the `volume/csi` isolator in UCR Key: MESOS-10156 URL: https://issues.apache.org/jira/browse/MESOS-10156 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10155) Implement the `recover` method of the `volume/csi` isolator
Qian Zhang created MESOS-10155: -- Summary: Implement the `recover` method of the `volume/csi` isolator Key: MESOS-10155 URL: https://issues.apache.org/jira/browse/MESOS-10155 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10154) Implement the `cleanup` method of the `volume/csi` isolator
Qian Zhang created MESOS-10154: -- Summary: Implement the `cleanup` method of the `volume/csi` isolator Key: MESOS-10154 URL: https://issues.apache.org/jira/browse/MESOS-10154 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10153) Implement the `prepare` method of the `volume/csi` isolator
Qian Zhang created MESOS-10153: -- Summary: Implement the `prepare` method of the `volume/csi` isolator Key: MESOS-10153 URL: https://issues.apache.org/jira/browse/MESOS-10153 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10152) Implement the `create` method of the `volume/csi` isolator
Qian Zhang created MESOS-10152: -- Summary: Implement the `create` method of the `volume/csi` isolator Key: MESOS-10152 URL: https://issues.apache.org/jira/browse/MESOS-10152 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10151) Introduce a new agent flag `--csi_plugin_config_dir`
[ https://issues.apache.org/jira/browse/MESOS-10151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152554#comment-17152554 ] Qian Zhang commented on MESOS-10151: See [here|https://docs.google.com/document/d/1NfWLS2OdiYjXZa2dpd_DOWOK4eou-SedY396Jl68s9Y/edit#heading=h.iobmmefa9bop] for the detailed design. > Introduce a new agent flag `--csi_plugin_config_dir` > > > Key: MESOS-10151 > URL: https://issues.apache.org/jira/browse/MESOS-10151 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10151) Implement the `create` method of the `volume/csi` isolator
Qian Zhang created MESOS-10151: -- Summary: Implement the `create` method of the `volume/csi` isolator Key: MESOS-10151 URL: https://issues.apache.org/jira/browse/MESOS-10151 Project: Mesos Issue Type: Task Reporter: Qian Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10150) Refactor CSI volume manager to support pre-provisioned CSI volumes
Qian Zhang created MESOS-10150: -- Summary: Refactor CSI volume manager to support pre-provisioned CSI volumes Key: MESOS-10150 URL: https://issues.apache.org/jira/browse/MESOS-10150 Project: Mesos Issue Type: Task Reporter: Qian Zhang The existing [VolumeManager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L138] is like a wrapper for various CSI gRPC calls, we could consider leveraging it to call CSI plugins rather than making raw CSI gRPC calls in `volume/csi` isolator. But there is a problem, the lifecycle of the volumes managed by VolumeManager starts from the `[createVolume|https://github.com/apache/mesos/blob/1.10.0/src/csi/v1_volume_manager.cpp#L281:L329]` CSI call, but what we plan to support in MVP is pre-provisioned volumes, so we need to refactor VolumeManager by making it support pre-provisioned volumes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10149) Refactor CSI service manager to support unmanaged CSI plugins
Qian Zhang created MESOS-10149: -- Summary: Refactor CSI service manager to support unmanaged CSI plugins Key: MESOS-10149 URL: https://issues.apache.org/jira/browse/MESOS-10149 Project: Mesos Issue Type: Task Reporter: Qian Zhang Refactor [CSI service manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L81] by making it support unmanaged plugins (i.e. the plugin deployed out of Mesos) and make it’s `getServiceEndpoint` method can also return unmanaged plugins's endpoint. -- This message was sent by Atlassian Jira (v8.3.4#803005)