[jira] [Commented] (MESOS-9080) Port mapping isolator leaks ephemeral ports when a container is destroyed during preparation

2019-01-24 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751362#comment-16751362
 ] 

Ilya Pronin commented on MESOS-9080:


[67936| https://reviews.apache.org/r/67936/] should fix the problem. We can 
close this.

> Port mapping isolator leaks ephemeral ports when a container is destroyed 
> during preparation
> 
>
> Key: MESOS-9080
> URL: https://issues.apache.org/jira/browse/MESOS-9080
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.6.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Major
>
> {{network/port_mapping}} isolator leaks ephemeral ports during container 
> cleanup if {{Isolator::isolate()}} was not called, i.e. the container is 
> being destroyed during preparation. If the isolator doesn't know the main 
> container's PID it skips filters cleanup (they should not exist in this case) 
> and ephemeral ports deallocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9476) XFS project IDs aren't released upon task completion

2018-12-13 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720643#comment-16720643
 ] 

Ilya Pronin commented on MESOS-9476:


This change was introduced in 1.7. The isolator now periodically (every 
{{\-\-disk_watch_interval}}) checks which container sandboxes and persistent 
volumes were removed (e.g. by disk GC) and reclaims their project IDs. The main 
reason for doing so was the fact that project IDs cannot be removed from 
symlinks, which may lead to weird accounting. Also, currently isolators don't 
get notified when a persistent volume is removed, to {{disk/xfs}} can only do 
periodic scans to reclaim volume project IDs. See MESOS-5158 and MESOS-9007 for 
more information.

XFS project IDs are 16 or 32 bit integers, usually there should be plenty of 
them available. Can you give your Mesos agents a larger ID range?

> XFS project IDs aren't released upon task completion
> 
>
> Key: MESOS-9476
> URL: https://issues.apache.org/jira/browse/MESOS-9476
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.7.0
> Environment: Centos 7.1
> Mesos 1.7
>Reporter: Omar AitMous
>Priority: Major
> Attachments: Vagrantfile, build.sh
>
>
> The XFS isolation doesn't release project IDs when a task finishes on Mesos 
> 1.7 (branch 1.7.x), and once all project IDs are taken, scheduling new tasks 
> fails with:
> {code:java}
> Failed to assign project ID, range exhausted
> {code}
>  
> Attached is a vagrant configuration that sets up a VM with an XFS disk 
> (mounted on /var/opt/mesos), zookeeper 3.4.12, mesos 1.7 and marathon 1.6.
> Once the box is ready, start zookeeper, mesos-master, mesos-agent (using the 
> XFS disk) and marathon:
> {code:java}
> sudo bin/zkServer.sh start
> sudo /home/vagrant/mesos/build/bin/mesos-master.sh --ip=192.168.33.10 
> --work_dir=/mnt/mesos
> sudo /home/vagrant/mesos/build/bin/mesos-agent.sh --master=192.168.33.10:5050 
> --work_dir=/var/opt/mesos --enforce_container_disk_quota --isolation=disk/xfs 
> --xfs_project_range=[5000-5009]
> sudo 
> MESOS_NATIVE_JAVA_LIBRARY="/home/vagrant/mesos/build/src/.libs/libmesos.so" 
> sbt 'run --master 192.168.33.10:5050 --zk zk://localhost:2181/marathon'
> {code}
>  
> Create an app on marathon, for example:
> {code:java}
> {"id": "/test", "cmd": "sleep 3600", "cpus": 0.01, "mem": 32, "disk": 1, 
> "instances": 5}  
> {code}
>  
> You should see 5 project IDs being used:
> {code:java}
> $ sudo xfs_quota -x -c "report -a -n -L 5000 -U 5009" | grep '^#[1-9][0-9]*'
> #5000 4 1024 1024 00 []
> #5001 4 1024 1024 00 []
> #5002 4 1024 1024 00 []
> #5003 4 1024 1024 00 []
> #5004 4 1024 1024 00 []
> {code}
>  
> If you scale down to 0 instances, the project IDs aren't released.
> If you scale back up to 8 instances, only 5 of them will start, the remaining 
> 3 will fail with errors like this:
> {code:java}
> E1213 14:38:36.190430 20813 slave.cpp:6204] Container 
> '064b8a6b-c42d-4905-b2a7-632318aa2b83' for executor 
> 'test.c5e88a67-fee4-11e8-9cc6-0800278a1a98' of framework 
> 0473e272-04f7-4b1d-ae1d-f7177940e295- failed to start: Failed to assign 
> project ID, range exhausted
> {code}
>  
> I've tested on Mesos 1.4, the project IDs are properly released when the task 
> finishes.
> (I haven't tested other versions)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9451) Libprocess endpoints can ignore required gzip compression

2018-12-04 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709355#comment-16709355
 ] 

Ilya Pronin edited comment on MESOS-9451 at 12/4/18 10:32 PM:
--

Per [RFC 7231|https://tools.ietf.org/html/rfc7231#section-5.3.4] 
{{Accept-Encoding}} header field is an advertisement that a particular encoding 
is supported by the requester. The server may still use {{identity}} encoding 
(no encoding) unless the client forbids it with {{identity;q=0}}.

I think it's OK for libprocess to continue to apply body length threshold as 
long as it checks that {{identity}}'s weight is not 0.


was (Author: ipronin):
Per [RFC 7231|https://tools.ietf.org/html/rfc7231#section-5.3.4] 
{{Accept-Encoding}} header field is an advertisement that a particular encoding 
is supported by the requestor. The server may still use {{identity}} encoding 
(no encoding) unless the client forbids it with {{identity;q=0}}.

I think it's OK for libprocess to continue to apply body length threshold as 
long as it checks that {{identity}}'s weight is not 0.

> Libprocess endpoints can ignore required gzip compression
> -
>
> Key: MESOS-9451
> URL: https://issues.apache.org/jira/browse/MESOS-9451
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently, libprocess decides whether a response should be compressed by the 
> following conditional:
> {noformat}
> if (response.type == http::Response::BODY &&
> response.body.length() >= GZIP_MINIMUM_BODY_LENGTH &&
> !headers.contains("Content-Encoding") &&
> request.acceptsEncoding("gzip")) {
>   [...]
> {noformat}
> However, this implies that a request sent with the header "Accept-Encoding: 
> gzip" can not rely on actually getting a gzipped response, e.g. when the 
> response size is below the threshold:
> {noformat}
> $ nc localhost 5050
> GET /tasks HTTP/1.1
> Accept-Encoding: gzip
> HTTP/1.1 200 OK
> Date: Tue, 04 Dec 2018 12:49:56 GMT
> Content-Type: application/json
> Content-Length: 12
> {"tasks":[]}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9451) Libprocess endpoints can ignore required gzip compression

2018-12-04 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709355#comment-16709355
 ] 

Ilya Pronin commented on MESOS-9451:


Per [RFC 7231|https://tools.ietf.org/html/rfc7231#section-5.3.4] 
{{Accept-Encoding}} header field is an advertisement that a particular encoding 
is supported by the requestor. The server may still use {{identity}} encoding 
(no encoding) unless the client forbids it with {{identity;q=0}}.

I think it's OK for libprocess to continue to apply body length threshold as 
long as it checks that {{identity}}'s weight is not 0.

> Libprocess endpoints can ignore required gzip compression
> -
>
> Key: MESOS-9451
> URL: https://issues.apache.org/jira/browse/MESOS-9451
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: libprocess
>
> Currently, libprocess decides whether a response should be compressed by the 
> following conditional:
> {noformat}
> if (response.type == http::Response::BODY &&
> response.body.length() >= GZIP_MINIMUM_BODY_LENGTH &&
> !headers.contains("Content-Encoding") &&
> request.acceptsEncoding("gzip")) {
>   [...]
> {noformat}
> However, this implies that a request sent with the header "Accept-Encoding: 
> gzip" can not rely on actually getting a gzipped response, e.g. when the 
> response size is below the threshold:
> {noformat}
> $ nc localhost 5050
> GET /tasks HTTP/1.1
> Accept-Encoding: gzip
> HTTP/1.1 200 OK
> Date: Tue, 04 Dec 2018 12:49:56 GMT
> Content-Type: application/json
> Content-Length: 12
> {"tasks":[]}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9382) mesos-gtest-runner doesn't work on systems without ulimit binary

2018-11-09 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-9382:
--

 Summary: mesos-gtest-runner doesn't work on systems without ulimit 
binary
 Key: MESOS-9382
 URL: https://issues.apache.org/jira/browse/MESOS-9382
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Ilya Pronin


{{mesos-gtest-runner.py}} fails on systems without a separate ulimit binary 
(i.e. CentOS 7).
{noformat}
/home/ipronin/mesos/build/../support/mesos-gtest-runner.py --sequential=*ROOT_* 
./mesos-tests
Could not check compatibility of ulimit settings: [Errno 2] No such file or 
directory: 'ulimit'
{noformat}
The problem arises in [this 
call|https://github.com/apache/mesos/blob/630d8938462381e8e7b0f44fa6434e47460fb178/support/mesos-gtest-runner.py#L209].
 Seems that it can be fixed by passing a {{shell=True}} argument to 
{{subprocess.check_output()}}.

Another problem is {{ROOT_*}} tests which should be ran as root. For root 
{{ulimit -u}} will most likely return "unlimited", which will again crash the 
runner.
{noformat}
Could not check compatibility of ulimit settings: invalid literal for int() 
with base 10: b'unlimited\n'
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9118) Add port mapping isolator and network ports isolator support to CMake

2018-07-27 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560280#comment-16560280
 ] 

Ilya Pronin commented on MESOS-9118:


Duplicate of MESOS-8993? I can take with this one.

> Add port mapping isolator and network ports isolator support to CMake
> -
>
> Key: MESOS-9118
> URL: https://issues.apache.org/jira/browse/MESOS-9118
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrew Schwartzmeyer
>Priority: Major
>
> These fall under the same issue because they are very similar, and both 
> require that {{libnl-3}} be checked for as a third-party dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-184) Log has a space leak

2018-07-27 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560273#comment-16560273
 ] 

Ilya Pronin commented on MESOS-184:
---

Discussion at the dev@ mailing list: 
https://lists.apache.org/thread.html/a0a58e42cbb8dd92dcebbedaa2556e8460a005d4a7cdb2ce1205d04a@%3Cdev.mesos.apache.org%3E

Review requests:
https://reviews.apache.org/r/68089/
https://reviews.apache.org/r/68090/

> Log has a space leak
> 
>
> Key: MESOS-184
> URL: https://issues.apache.org/jira/browse/MESOS-184
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, replicated log
>Affects Versions: 0.9.0, 0.14.0, 0.14.1, 0.14.2, 0.15.0, 0.16.0, 0.17.0, 
> 0.18.0, 0.18.1, 0.18.2, 0.19.0
>Reporter: John Sirois
>Assignee: Ilya Pronin
>Priority: Minor
>  Labels: twitter
>
> In short, the access pattern of the Log of the underlying LevelDB storage is 
> such that background compactions are ineffective and a long running Log will 
> have a space leak on disk even in the presence of otherwise apparently 
> sufficient Log::Writer::truncate calls.
> It seems the right thing to do is to issue a DB::CompactRange(NULL, 
> Slice(truncateToKey)) after a replica learns a Action::TRUNCATE Record.  The 
> cost here is a synchronous compaction stall on every truncate so maybe this 
> should be a configuration option or even an explicit api.
> ===
> Snip of email explanation:
> I spent some time understanding what was going on here and our use pattern of 
> leveldb does in fact defeat the backround compaction algorithm.
> The docs are here: http://leveldb.googlecode.com/svn/trunk/doc/impl.html in 
> the 'Compactions' section, but in short the gist is compaction operates on an 
> uncompacted file from a level (1 file) + all files overlapping its key range 
> in the next level.  Since we write sequential keys with no randomness at all, 
> by definition the only overlap we ever can get is in level 0 which is the 
> only level that leveldb allows for overlap in sstables in the 1st place.
> That leaves the question of why no compaction on open.  Looking there: 
> http://code.google.com/p/leveldb/source/browse/db/db_impl.cc#1376
> I see a call to MaybeScheduleCompaction, but following that trail, that just 
> leads to 
> http://code.google.com/p/leveldb/source/browse/db/version_set.cc?spec=svnbc1ee4d25e09b04e074db330a41f54ef4af0e31b=36a5f8ed7f9fb3373236d5eace4f5fea369856ee#1156
>  which implements the compaction strategy I tried to summarize above, and 
> thus background compactions for out case are limited to level0 -> level 1 
> compactions and lefel1 and higher never compact automatically.
> This seems born out by the LOG files.  For example, from smf1-prod - restarts 
> after your manual compaction fix in bold:
> [jsirois@smf1-ajb-35-sr1 ~]$ grep Compacting 
> /var/lib/mesos/scheduler_db/mesos_log/LOG.old 
> 2012/04/13-00:24:20.356673 44c1e940 Compacting 3@0 + 4@1 files
> 2012/04/13-00:24:20.490113 44c1e940 Compacting 5@1 + 281@2 files
> 2012/04/13-00:24:25.824995 44c1e940 Compacting 1@1 + 0@2 files
> 2012/04/13-00:24:26.008857 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.196877 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.312465 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.429817 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.533483 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.631044 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.733702 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.832787 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.949864 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.052502 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.164623 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.275621 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.376748 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.477728 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.611332 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:28.050275 44c1e940 Compacting 50@2 + 242@3 files
> 2012/04/13-00:24:32.455665 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:32.538566 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:32.819205 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.052064 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.198850 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.350893 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.521784 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.693531 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.847151 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.034277 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.225582 44c1e940 Compacting 1@3 + 0@4 files
> 

[jira] [Assigned] (MESOS-184) Log has a space leak

2018-07-27 Thread Ilya Pronin (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-184:
-

Assignee: Ilya Pronin

> Log has a space leak
> 
>
> Key: MESOS-184
> URL: https://issues.apache.org/jira/browse/MESOS-184
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, replicated log
>Affects Versions: 0.9.0, 0.14.0, 0.14.1, 0.14.2, 0.15.0, 0.16.0, 0.17.0, 
> 0.18.0, 0.18.1, 0.18.2, 0.19.0
>Reporter: John Sirois
>Assignee: Ilya Pronin
>Priority: Minor
>  Labels: twitter
>
> In short, the access pattern of the Log of the underlying LevelDB storage is 
> such that background compactions are ineffective and a long running Log will 
> have a space leak on disk even in the presence of otherwise apparently 
> sufficient Log::Writer::truncate calls.
> It seems the right thing to do is to issue a DB::CompactRange(NULL, 
> Slice(truncateToKey)) after a replica learns a Action::TRUNCATE Record.  The 
> cost here is a synchronous compaction stall on every truncate so maybe this 
> should be a configuration option or even an explicit api.
> ===
> Snip of email explanation:
> I spent some time understanding what was going on here and our use pattern of 
> leveldb does in fact defeat the backround compaction algorithm.
> The docs are here: http://leveldb.googlecode.com/svn/trunk/doc/impl.html in 
> the 'Compactions' section, but in short the gist is compaction operates on an 
> uncompacted file from a level (1 file) + all files overlapping its key range 
> in the next level.  Since we write sequential keys with no randomness at all, 
> by definition the only overlap we ever can get is in level 0 which is the 
> only level that leveldb allows for overlap in sstables in the 1st place.
> That leaves the question of why no compaction on open.  Looking there: 
> http://code.google.com/p/leveldb/source/browse/db/db_impl.cc#1376
> I see a call to MaybeScheduleCompaction, but following that trail, that just 
> leads to 
> http://code.google.com/p/leveldb/source/browse/db/version_set.cc?spec=svnbc1ee4d25e09b04e074db330a41f54ef4af0e31b=36a5f8ed7f9fb3373236d5eace4f5fea369856ee#1156
>  which implements the compaction strategy I tried to summarize above, and 
> thus background compactions for out case are limited to level0 -> level 1 
> compactions and lefel1 and higher never compact automatically.
> This seems born out by the LOG files.  For example, from smf1-prod - restarts 
> after your manual compaction fix in bold:
> [jsirois@smf1-ajb-35-sr1 ~]$ grep Compacting 
> /var/lib/mesos/scheduler_db/mesos_log/LOG.old 
> 2012/04/13-00:24:20.356673 44c1e940 Compacting 3@0 + 4@1 files
> 2012/04/13-00:24:20.490113 44c1e940 Compacting 5@1 + 281@2 files
> 2012/04/13-00:24:25.824995 44c1e940 Compacting 1@1 + 0@2 files
> 2012/04/13-00:24:26.008857 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.196877 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.312465 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.429817 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.533483 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.631044 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.733702 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.832787 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:26.949864 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.052502 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.164623 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.275621 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.376748 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.477728 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:27.611332 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:28.050275 44c1e940 Compacting 50@2 + 242@3 files
> 2012/04/13-00:24:32.455665 44c1e940 Compacting 1@2 + 0@3 files
> 2012/04/13-00:24:32.538566 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:32.819205 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.052064 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.198850 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.350893 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.521784 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.693531 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:33.847151 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.034277 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.225582 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.390228 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.554127 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.715242 44c1e940 Compacting 1@3 + 0@4 files
> 2012/04/13-00:24:34.852110 44c1e940 Compacting 1@3 + 0@4 files
> 

[jira] [Comment Edited] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

2018-07-23 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543695#comment-16543695
 ] 

Ilya Pronin edited comment on MESOS-9007 at 7/24/18 12:43 AM:
--

Review requests:
https://reviews.apache.org/r/67915/
https://reviews.apache.org/r/67914/
https://reviews.apache.org/r/68029/


was (Author: ipronin):
Review requests:
https://reviews.apache.org/r/67915/
https://reviews.apache.org/r/67914/

> XFS disk isolator doesn't clean up project ID from symlinks
> ---
>
> Key: MESOS-9007
> URL: https://issues.apache.org/jira/browse/MESOS-9007
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> Upon container destruction its project ID is unallocated by the isolator and 
> removed from the container work directory. However the removing function 
> skips symbolic links and because of that the project still exists until the 
> container directory is garbage collected. If the project ID is reused for a 
> new container, any lingering symlinks that still have that project ID will 
> contribute to disk usage of the new container. Typically symlinks don't take 
> much space, but still this leads to inaccuracy in disk space usage accounting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9080) Port mapping isolator leaks ephemeral ports when a container is destroyed during preparation

2018-07-16 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-9080:
--

 Summary: Port mapping isolator leaks ephemeral ports when a 
container is destroyed during preparation
 Key: MESOS-9080
 URL: https://issues.apache.org/jira/browse/MESOS-9080
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Affects Versions: 1.6.0
Reporter: Ilya Pronin
Assignee: Ilya Pronin


{{network/port_mapping}} isolator leaks ephemeral ports during container 
cleanup if {{Isolator::isolate()}} was not called, i.e. the container is being 
destroyed during preparation. If the isolator doesn't know the main container's 
PID it skips filters cleanup (they should not exist in this case) and ephemeral 
ports deallocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

2018-07-13 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543695#comment-16543695
 ] 

Ilya Pronin commented on MESOS-9007:


Review requests:
https://reviews.apache.org/r/67915/
https://reviews.apache.org/r/67914/

> XFS disk isolator doesn't clean up project ID from symlinks
> ---
>
> Key: MESOS-9007
> URL: https://issues.apache.org/jira/browse/MESOS-9007
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> Upon container destruction its project ID is unallocated by the isolator and 
> removed from the container work directory. However the removing function 
> skips symbolic links and because of that the project still exists until the 
> container directory is garbage collected. If the project ID is reused for a 
> new container, any lingering symlinks that still have that project ID will 
> contribute to disk usage of the new container. Typically symlinks don't take 
> much space, but still this leads to inaccuracy in disk space usage accounting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

2018-07-03 Thread Ilya Pronin (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-9007:
--

Assignee: Ilya Pronin

> XFS disk isolator doesn't clean up project ID from symlinks
> ---
>
> Key: MESOS-9007
> URL: https://issues.apache.org/jira/browse/MESOS-9007
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> Upon container destruction its project ID is unallocated by the isolator and 
> removed from the container work directory. However the removing function 
> skips symbolic links and because of that the project still exists until the 
> container directory is garbage collected. If the project ID is reused for a 
> new container, any lingering symlinks that still have that project ID will 
> contribute to disk usage of the new container. Typically symlinks don't take 
> much space, but still this leads to inaccuracy in disk space usage accounting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

2018-06-18 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516486#comment-16516486
 ] 

Ilya Pronin edited comment on MESOS-9007 at 6/19/18 12:27 AM:
--

Per [discussion at the XFS mailing 
list|https://www.spinics.net/lists/linux-xfs/msg19197.html] it is not possible 
to unset project ID from a symlink. So we need to change the approach to 
project ID deallocation. XFS project IDs are 16-32bit integers, so we have 
plenty of them. We can leave project IDs on container sandboxes until they are 
GCed. The question is how track when a directory is GCed, since there's no GC 
hook that the isolator could use. The simplest way would be to periodically 
check the work dir to see if any of the sandboxes was removed and the project 
ID associated with it can be deallocated. Or we can use inotify mechanism. Or 
add a hook :)


was (Author: ipronin):
Per [discussion at the XFS mailing 
list|https://www.spinics.net/lists/linux-xfs/msg19197.html] it is not possible 
to unset project ID from a symlink. So we need to change the approach to 
project ID deallocation. XFS project IDs are 32bit integers, so we have plenty 
of them. We can leave project IDs on container sandboxes until they are GCed. 
The question is how track when a directory is GCed, since there's no GC hook 
that the isolator could use. The simplest way would be to periodically check 
the work dir to see if any of the sandboxes was removed and the project ID 
associated with it can be deallocated. Or we can use inotify mechanism. Or add 
a hook :)

> XFS disk isolator doesn't clean up project ID from symlinks
> ---
>
> Key: MESOS-9007
> URL: https://issues.apache.org/jira/browse/MESOS-9007
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Ilya Pronin
>Priority: Minor
>
> Upon container destruction its project ID is unallocated by the isolator and 
> removed from the container work directory. However the removing function 
> skips symbolic links and because of that the project still exists until the 
> container directory is garbage collected. If the project ID is reused for a 
> new container, any lingering symlinks that still have that project ID will 
> contribute to disk usage of the new container. Typically symlinks don't take 
> much space, but still this leads to inaccuracy in disk space usage accounting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

2018-06-18 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516486#comment-16516486
 ] 

Ilya Pronin commented on MESOS-9007:


Per [discussion at the XFS mailing 
list|https://www.spinics.net/lists/linux-xfs/msg19197.html] it is not possible 
to unset project ID from a symlink. So we need to change the approach to 
project ID deallocation. XFS project IDs are 32bit integers, so we have plenty 
of them. We can leave project IDs on container sandboxes until they are GCed. 
The question is how track when a directory is GCed, since there's no GC hook 
that the isolator could use. The simplest way would be to periodically check 
the work dir to see if any of the sandboxes was removed and the project ID 
associated with it can be deallocated. Or we can use inotify mechanism. Or add 
a hook :)

> XFS disk isolator doesn't clean up project ID from symlinks
> ---
>
> Key: MESOS-9007
> URL: https://issues.apache.org/jira/browse/MESOS-9007
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Ilya Pronin
>Priority: Minor
>
> Upon container destruction its project ID is unallocated by the isolator and 
> removed from the container work directory. However the removing function 
> skips symbolic links and because of that the project still exists until the 
> container directory is garbage collected. If the project ID is reused for a 
> new container, any lingering symlinks that still have that project ID will 
> contribute to disk usage of the new container. Typically symlinks don't take 
> much space, but still this leads to inaccuracy in disk space usage accounting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9007) XFS disk isolator doesn't clean up project ID from symlinks

2018-06-18 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-9007:
--

 Summary: XFS disk isolator doesn't clean up project ID from 
symlinks
 Key: MESOS-9007
 URL: https://issues.apache.org/jira/browse/MESOS-9007
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Ilya Pronin


Upon container destruction its project ID is unallocated by the isolator and 
removed from the container work directory. However the removing function skips 
symbolic links and because of that the project still exists until the container 
directory is garbage collected. If the project ID is reused for a new 
container, any lingering symlinks that still have that project ID will 
contribute to disk usage of the new container. Typically symlinks don't take 
much space, but still this leads to inaccuracy in disk space usage accounting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8993) `network/ports` isolator missing from CMake build

2018-06-14 Thread Ilya Pronin (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512788#comment-16512788
 ] 

Ilya Pronin commented on MESOS-8993:


Please note that {{network/port_mapping}} (hidden behind 
{{\-\-with-network-isolator}} flag in Autotools based build) and 
{{network/ports}} are 2 different isolators. libnl is a dependency of the 
former and {{src/tests/containerizer/port_mapping_tests.cpp}} contains its 
tests. I can help with it.

> `network/ports` isolator missing from CMake build
> -
>
> Key: MESOS-8993
> URL: https://issues.apache.org/jira/browse/MESOS-8993
> Project: Mesos
>  Issue Type: Bug
>  Components: cmake
>Affects Versions: 1.7.0
> Environment: Linux with CMake
>Reporter: Andrew Schwartzmeyer
>Assignee: James Peach
>Priority: Major
>  Labels: cmake
>
> The `network/ports` isolator is completely missing from the CMake build. It 
> looks like it needs libnl-3 and the files 
> src/tests/containerizer/ports_isolator_tests.cpp, 
> tests/containerizer/port_mapping_tests.cpp, and 
> src/slave/containerizer/mesos/isolators/network/ports.cpp, along with a 
> configuration option network-ports-isolator.
> Note that this was discovered due to a build break in the associated tests as 
> they were missing from the build I ran as a test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.

2018-02-08 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357316#comment-16357316
 ] 

Ilya Pronin commented on MESOS-7069:


[~jieyu] I believe this was fixed in https://reviews.apache.org/r/61122/. 
Closing this issue.

> The linux filesystem isolator should set mode and ownership for host volumes.
> -
>
> Key: MESOS-7069
> URL: https://issues.apache.org/jira/browse/MESOS-7069
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Ilya Pronin
>Priority: Major
>  Labels: filesystem, linux, volumes
>
> If the host path is a relative path, the linux filesystem isolator should set 
> the mode and ownership for this host volume since it allows non-root user to 
> write to the volume. Note that this is the case of sharing the host 
> fileysystem (without rootfs).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7698) Libprocess doesn't handle IP changes

2018-02-01 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349554#comment-16349554
 ] 

Ilya Pronin edited comment on MESOS-7698 at 2/2/18 12:16 AM:
-

[~greggomann], libprocess looks up the host address its running on upon process 
startup and remembers that address for the lifetime of the process. If 
{{--advertise_ip}} flag is not provided, then this address is used as a return 
address in inter-libprocess communication ({{User-Agent: libprocess/*}} header 
field). When I encountered the described problem, the IP address of one of our 
hosts has changed due to network maintenance. The agent on that host tried to 
re-register with the master, telling him that he was located at addr1, while in 
reality he was at addr2. Because of that logic with return address, the master 
was sending his responses to a wrong host at addr1. I never tried to reproduce 
the problem, but I suppose it should be relatively easy reproduced by changing 
the IP address of the interface used by the agent for communicating with the 
master.

Maybe we could make the usage of return addresses in libprocess-libprocess 
communication more "relaxed". If the user doesn't want libprocess to advertise 
a specific address, sending libprocess can omit the address in the 
{{User-Agent}} field and the receiver will use return address from the 
connection?

I can work on a patch if somebody can shepherd this work.


was (Author: ipronin):
[~greggomann], libprocess looks up the host address its running on upon process 
startup and remembers that address for the lifetime of the process. If 
{{--advertise_ip}} flag is not provided, then this address is used as a return 
address in inter-libprocess communication ({{User-Agent: libprocess/*}} header 
field). When I encountered the described problem, the IP address of one of our 
hosts has changed due to network maintenance. The agent on that host tried to 
re-register with the master, telling him that he was located at addr1, while in 
reality he was at addr2. Because of that logic with return address, the master 
was sending his responses to a wrong host at addr1. I never tried to reproduce 
the problem, but I suppose it should be relatively easy reproduced by changing 
the IP address of the interface used by the agent for communicating with the 
master.

Maybe we could make the usage of return addresses in libprocess-libprocess 
communication more "relaxed". If the user doesn't want libprocess to advertise 
a specific address, sending libprocess can omit the address in the 
{{User-Agent}} field and the receiver will use return address from the 
connection?

> Libprocess doesn't handle IP changes
> 
>
> Key: MESOS-7698
> URL: https://issues.apache.org/jira/browse/MESOS-7698
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Priority: Major
>
> If a host IP address changes libprocess will never learn about it and will 
> continue to send messages "from" the old IP.
> This will cause weird situations. E.g. an agent will indefinitely try to 
> reregister with a master pretending that it can be reached by an old IP. The 
> master will send {{SlaveReregisteredMessage}} to the wrong host (potentially 
> a different agent), using an IP from the {{User-Agent: libprocess/*}} header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7698) Libprocess doesn't handle IP changes

2018-02-01 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349554#comment-16349554
 ] 

Ilya Pronin commented on MESOS-7698:


[~greggomann], libprocess looks up the host address its running on upon process 
startup and remembers that address for the lifetime of the process. If 
{{--advertise_ip}} flag is not provided, then this address is used as a return 
address in inter-libprocess communication ({{User-Agent: libprocess/*}} header 
field). When I encountered the described problem, the IP address of one of our 
hosts has changed due to network maintenance. The agent on that host tried to 
re-register with the master, telling him that he was located at addr1, while in 
reality he was at addr2. Because of that logic with return address, the master 
was sending his responses to a wrong host at addr1. I never tried to reproduce 
the problem, but I suppose it should be relatively easy reproduced by changing 
the IP address of the interface used by the agent for communicating with the 
master.

Maybe we could make the usage of return addresses in libprocess-libprocess 
communication more "relaxed". If the user doesn't want libprocess to advertise 
a specific address, sending libprocess can omit the address in the 
{{User-Agent}} field and the receiver will use return address from the 
connection?

> Libprocess doesn't handle IP changes
> 
>
> Key: MESOS-7698
> URL: https://issues.apache.org/jira/browse/MESOS-7698
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Priority: Major
>
> If a host IP address changes libprocess will never learn about it and will 
> continue to send messages "from" the old IP.
> This will cause weird situations. E.g. an agent will indefinitely try to 
> reregister with a master pretending that it can be reached by an old IP. The 
> master will send {{SlaveReregisteredMessage}} to the wrong host (potentially 
> a different agent), using an IP from the {{User-Agent: libprocess/*}} header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8493) Master task state metrics discrepancy

2018-01-25 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-8493:
---
Description: 
Currently if the task status update has no reason we don't increment 
{{master///}} metric. Because of that 
{{master/task_/*}} counters may not sum up to {{master/tasks_}} 
counter, if for example a custom executor doesn't set the reason in its status 
updates.

Since the zero value in {{Reason}} enum is already taken it is possible to just 
count status updates with no reason an under artificial {{reason_unknown}} name.

  was:
Currently if the task status update has no reason we don't increment 
{{master///}} metric. Because of that 
{{master/task_/*}} counters may not sum up to {{master/tasks_}} 
counter, if for example a custom executor doesn't set a reason to its status 
updates.

Since the zero value in {{Reason}} enum is already taken it is possible to just 
count status updates with no reason an under artificial {{reason_unknown}} name.


> Master task state metrics discrepancy
> -
>
> Key: MESOS-8493
> URL: https://issues.apache.org/jira/browse/MESOS-8493
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Priority: Trivial
>
> Currently if the task status update has no reason we don't increment 
> {{master///}} metric. Because of that 
> {{master/task_/*}} counters may not sum up to {{master/tasks_}} 
> counter, if for example a custom executor doesn't set the reason in its 
> status updates.
> Since the zero value in {{Reason}} enum is already taken it is possible to 
> just count status updates with no reason an under artificial 
> {{reason_unknown}} name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (MESOS-8493) Master task state metrics discrepancy

2018-01-25 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-8493:
---
Component/s: master

> Master task state metrics discrepancy
> -
>
> Key: MESOS-8493
> URL: https://issues.apache.org/jira/browse/MESOS-8493
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Priority: Trivial
>
> Currently if the task status update has no reason we don't increment 
> {{master///}} metric. Because of that 
> {{master/task_/*}} counters may not sum up to 
> {{master/tasks_}} counter, if for example a custom executor doesn't 
> set a reason to its status updates.
> Since the zero value in {{Reason}} enum is already taken it is possible to 
> just count status updates with no reason an under artificial 
> {{reason_unknown}} name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8493) Master task state metrics discrepancy

2018-01-25 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-8493:
--

 Summary: Master task state metrics discrepancy
 Key: MESOS-8493
 URL: https://issues.apache.org/jira/browse/MESOS-8493
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.2.0
Reporter: Ilya Pronin


Currently if the task status update has no reason we don't increment 
{{master///}} metric. Because of that 
{{master/task_/*}} counters may not sum up to {{master/tasks_}} 
counter, if for example a custom executor doesn't set a reason to its status 
updates.

Since the zero value in {{Reason}} enum is already taken it is possible to just 
count status updates with no reason an under artificial {{reason_unknown}} name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6985) os::getenv() can segfault

2018-01-24 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338076#comment-16338076
 ] 

Ilya Pronin commented on MESOS-6985:


[~vinodkone], sorry I missed the comment somehow. I have a POC-like patch for 
this, didn't have time to finish it. I'll try to finish it maybe next week. 
Feel free to reassign if somebody would like to work on it before that.

> os::getenv() can segfault
> -
>
> Key: MESOS-6985
> URL: https://issues.apache.org/jira/browse/MESOS-6985
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
> Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without 
> libevent/SSL
>Reporter: Greg Mann
>Assignee: Ilya Pronin
>Priority: Major
>  Labels: flaky-test, reliability, stout
> Attachments: 
> MasterMaintenanceTest.InverseOffersFilters-truncated.txt, 
> MasterTest.MultipleExecutors.txt
>
>
> This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 
> and has been produced by the tests {{MasterTest.MultipleExecutors}} and 
> {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, 
> {{os::getenv()}} segfaults with the same stack trace:
> {code}
> *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are 
> using GNU date ***
> PC: @ 0x2ad59e3ae82d (unknown)
> I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0
> *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; 
> stack trace: ***
> I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: 
> executor(75)@172.17.0.2:45752 with pid 28591
> @ 0x2ad5ab953197 (unknown)
> @ 0x2ad5ab957479 (unknown)
> @ 0x2ad59e165330 (unknown)
> @ 0x2ad59e3ae82d (unknown)
> @ 0x2ad594631358 os::getenv()
> @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment()
> @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor()
> @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run()
> @ 0x2ad59ac1ec10 
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x2ad59ac1e6bf 
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x2ad59bce2304 std::function<>::operator()()
> @ 0x2ad59bcc9824 process::ProcessBase::visit()
> @ 0x2ad59bd4028e process::DispatchEvent::visit()
> @ 0x2ad594616df1 process::ProcessBase::serve()
> @ 0x2ad59bcc72b7 process::ProcessManager::resume()
> @ 0x2ad59bcd567c 
> process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x2ad59bcd5585 
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x2ad59bcd std::_Bind_simple<>::operator()()
> @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run()
> @ 0x2ad59d9e6a60 (unknown)
> @ 0x2ad59e15d184 start_thread
> @ 0x2ad59e46d37d (unknown)
> make[4]: *** [check-local] Segmentation fault
> {code}
> Find attached the full log from a failed run of 
> {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of 
> {{MasterMaintenanceTest.InverseOffersFilters}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8377) RecoverTest.CatchupTruncated is flaky.

2018-01-03 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310511#comment-16310511
 ] 

Ilya Pronin edited comment on MESOS-8377 at 1/4/18 12:14 AM:
-

Review request: https://reviews.apache.org/r/64938/

I couldn't reproduce the issue on my machine with {{--gtest_repeat=1000 
--gtest_break_on_failure=1}}, but I suspect that it has something to do with 
the fact the test uses {{Shared}} which probably can still be retained 
by "managed" {{CatchupProcess}} at the moment when I [try to recreate the 
replica| 
https://github.com/apache/mesos/blob/master/src/tests/log_tests.cpp#L2096]. 
Because of that the DB can not be closed and LevelDB complaints that the 
process still holds the DB lock. I've added the code to make sure that the test 
code is the only owner of {{replica3}} before proceeding to recreate it.


was (Author: ipronin):
Review request: https://reviews.apache.org/r/64938/

I couldn't reproduce the issue on my machine with `--gtest_repeat=1000 
--gtest_break_on_failure=1`, but I suspect that it has something to do with the 
fact the test uses {{Shared}} which probably can still be retained by 
"managed" {{CatchupProcess}} at the moment when I [try to recreate the replica| 
https://github.com/apache/mesos/blob/master/src/tests/log_tests.cpp#L2096]. 
Because of that the DB can not be closed and LevelDB complaints that the 
process still holds the DB lock. I've added the code to make sure that the test 
code is the only owner of {{replica3}} before proceeding to recreate it.

> RecoverTest.CatchupTruncated is flaky.
> --
>
> Key: MESOS-8377
> URL: https://issues.apache.org/jira/browse/MESOS-8377
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Alexander Rukletsov
>Assignee: Ilya Pronin
>  Labels: flaky-test
> Attachments: CatchupTruncated-badrun.txt, 
> RecoverTest.CatchupTruncated-badrun2.txt
>
>
> Observing regularly in our CI. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8377) RecoverTest.CatchupTruncated is flaky.

2018-01-03 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-8377:
--

Assignee: Ilya Pronin

> RecoverTest.CatchupTruncated is flaky.
> --
>
> Key: MESOS-8377
> URL: https://issues.apache.org/jira/browse/MESOS-8377
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Reporter: Alexander Rukletsov
>Assignee: Ilya Pronin
>  Labels: flaky-test
> Attachments: CatchupTruncated-badrun.txt, 
> RecoverTest.CatchupTruncated-badrun2.txt
>
>
> Observing regularly in our CI. Logs attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7973) Non-leading VOTING replica catch-up

2018-01-03 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16310132#comment-16310132
 ] 

Ilya Pronin commented on MESOS-7973:


Documentation patches:
https://reviews.apache.org/r/64921/
https://reviews.apache.org/r/64922/
https://reviews.apache.org/r/64923/

> Non-leading VOTING replica catch-up
> ---
>
> Key: MESOS-7973
> URL: https://issues.apache.org/jira/browse/MESOS-7973
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
> Fix For: 1.5.0
>
>
> Currently it is not possible to perform consistent reads from non-leading 
> replicas due to the fact that if a non-leading replica is partitioned it may 
> miss some log positions and will not make any attempt to “fill” those holes.
> If a non-leading replica could catch-up missing log positions it would be 
> able to serve eventually consistent reads to the framework. This would make 
> it possible to do additional work on non-leading framework replicas (e.g. 
> offload some reading from a leader to standbys or reduce failover time by 
> keeping in-memory storage represented by the log “hot”).
> Design doc: 
> https://docs.google.com/document/d/1dERXJeAsi3Lnq9Akt82JGWK4pKNeJ6k7PTVCpM9ic_8/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8120) Benchmark scheduler API performance

2017-11-30 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273056#comment-16273056
 ] 

Ilya Pronin edited comment on MESOS-8120 at 11/30/17 5:54 PM:
--

Posted the benchmarks described in the doc for review: 
https://reviews.apache.org/r/64217


was (Author: ipronin):
Added the benchmarks described in the doc for review: 
https://reviews.apache.org/r/64217

> Benchmark scheduler API performance
> ---
>
> Key: MESOS-8120
> URL: https://issues.apache.org/jira/browse/MESOS-8120
> Project: Mesos
>  Issue Type: Task
>Reporter: Ilya Pronin
>Priority: Minor
> Attachments: revive.master.lockfree.8threads.svg, 
> revive.master.lockfree.svg, revive.master.svg
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8120) Benchmark scheduler API performance

2017-11-30 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273056#comment-16273056
 ] 

Ilya Pronin commented on MESOS-8120:


Added the benchmarks described in the doc for review: 
https://reviews.apache.org/r/64217

> Benchmark scheduler API performance
> ---
>
> Key: MESOS-8120
> URL: https://issues.apache.org/jira/browse/MESOS-8120
> Project: Mesos
>  Issue Type: Task
>Reporter: Ilya Pronin
>Priority: Minor
> Attachments: revive.master.lockfree.8threads.svg, 
> revive.master.lockfree.svg, revive.master.svg
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6406) Send latest status for partition-aware tasks when agent reregisters

2017-11-29 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271799#comment-16271799
 ] 

Ilya Pronin commented on MESOS-6406:


What if the agent becomes unreachable, then master failover happens and then 
the agent re-registers? Let's pretend that the agent's entry was GCd from the 
registry. In this case the framework will not know that the task came back, 
right?

> Send latest status for partition-aware tasks when agent reregisters
> ---
>
> Key: MESOS-6406
> URL: https://issues.apache.org/jira/browse/MESOS-6406
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Megha Sharma
>  Labels: mesosphere
>
> When an agent reregisters, we should notify frameworks about the current 
> status of any partition-aware tasks that were/are running on the agent -- 
> i.e., report the current state of the task at the agent to the framework.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-20 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259755#comment-16259755
 ] 

Ilya Pronin commented on MESOS-8185:


[~xujyan] it should, but we'll need MESOS-6406 so we don't have to rely on 
explicit reconciliation to get notified about the tasks that came back.

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>  Labels: reliability
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-8185:
--

Assignee: Ilya Pronin

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249889#comment-16249889
 ] 

Ilya Pronin edited comment on MESOS-8185 at 11/13/17 6:00 PM:
--

[~bmahler] I see you've already changed it. Thanks!

Sure, let's discuss all available options. I suggested killing, because current 
contract between master and not {{PARTITION_AWARE}} frameworks is that {{LOST}} 
tasks will be killed by master.


was (Author: ipronin):
[~bmahler] I see you've already changed it.

Thanks! Sure, let's discuss all available options. I suggested killing, because 
current contract between master and not {{PARTITION_AWARE}} frameworks is that 
{{LOST}} tasks will be killed by master.

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8185) Tasks can be known to the agent but unknown to the master.

2017-11-13 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249889#comment-16249889
 ] 

Ilya Pronin commented on MESOS-8185:


[~bmahler] I see you've already changed it.

Thanks! Sure, let's discuss all available options. I suggested killing, because 
current contract between master and not {{PARTITION_AWARE}} frameworks is that 
{{LOST}} tasks will be killed by master.

> Tasks can be known to the agent but unknown to the master.
> --
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8185) Master should kill tasks that are unknown to it after registered agent re-registers

2017-11-08 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16245048#comment-16245048
 ] 

Ilya Pronin commented on MESOS-8185:


[~bmahler] can you shepherd this, please?

> Master should kill tasks that are unknown to it after registered agent 
> re-registers
> ---
>
> Key: MESOS-8185
> URL: https://issues.apache.org/jira/browse/MESOS-8185
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Ilya Pronin
>
> Currently, when a master re-registers an agent that was marked unreachable, 
> it shutdowns all not partition-aware frameworks on that agent. When a master 
> re-registers an agent that is already registered, it doesn't check that all 
> tasks from the slave's re-registration message are known to it.
> It is possible that due to a transient loss of connectivity an agent may miss 
> {{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
> will not kill not partition-aware tasks. But the master will mark the agent 
> as registered and will not re-add tasks that it thought will be killed. The 
> agent may re-register again, this time successfully, before becoming marked 
> unreachable while never having terminated tasks of not partition-aware 
> frameworks. The master will simply forget those tasks ever existed, because 
> it has "removed" them during the previous re-registration.
> Example scenario:
> # Connection from the master to the agent stops working
> # Agent doesn't see pings from the master and attempts to re-register
> # Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
> don't get to the agent because of the connection failure. Agent is marked 
> registered.
> # Network issue resolves, connection breaks. Agent retries re-registration.
> # Master thinks that the agent was registered since step (3) and just 
> re-sends {{SlaveRegisteredMessage}}. Tasks remain running on the agent.
> One of the possible solutions would be to compare the list of tasks the the 
> already registered agent reports in {{ReregisterSlaveMessage}} and the list 
> of tasks the master has. In this case anything that the master doesn't know 
> about should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8185) Master should kill tasks that are unknown to it after registered agent re-registers

2017-11-08 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-8185:
--

 Summary: Master should kill tasks that are unknown to it after 
registered agent re-registers
 Key: MESOS-8185
 URL: https://issues.apache.org/jira/browse/MESOS-8185
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Ilya Pronin


Currently, when a master re-registers an agent that was marked unreachable, it 
shutdowns all not partition-aware frameworks on that agent. When a master 
re-registers an agent that is already registered, it doesn't check that all 
tasks from the slave's re-registration message are known to it.

It is possible that due to a transient loss of connectivity an agent may miss 
{{SlaveReregisteredMessage}} along with {{ShutdownFrameworkMessage}} and thus 
will not kill not partition-aware tasks. But the master will mark the agent as 
registered and will not re-add tasks that it thought will be killed. The agent 
may re-register again, this time successfully, before becoming marked 
unreachable while never having terminated tasks of not partition-aware 
frameworks. The master will simply forget those tasks ever existed, because it 
has "removed" them during the previous re-registration.

Example scenario:
# Connection from the master to the agent stops working
# Agent doesn't see pings from the master and attempts to re-register
# Master sends {{SlaveRegisteredMessage}} and {{ShutdownSlaveMessage}}, which 
don't get to the agent because of the connection failure. Agent is marked 
registered.
# Network issue resolves, connection breaks. Agent retries re-registration.
# Master thinks that the agent was registered since step (3) and just re-sends 
{{SlaveRegisteredMessage}}. Tasks remain running on the agent.

One of the possible solutions would be to compare the list of tasks the the 
already registered agent reports in {{ReregisterSlaveMessage}} and the list of 
tasks the master has. In this case anything that the master doesn't know about 
should not exist on the agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8165) TASK_UNKNOWN status is ambiguous

2017-11-02 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-8165:
--

 Summary: TASK_UNKNOWN status is ambiguous
 Key: MESOS-8165
 URL: https://issues.apache.org/jira/browse/MESOS-8165
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Major


There's an ambiguousness in the definition of {{TASK_UNKNOWN}} status. 
Currently it is sent by the master during explicit reconciliation when it 
doesn't know about the task. This covers 2 situations that should be handled 
differently by frameworks:
# Task is unknown, agent is unknown - we don't know the fate of the task, it's 
reasonable for the framework to wait until the task comes back (if SLA allows);
# Task is unknown, agent is registered - task is definitely in the terminal 
state and won't come back, the framework should reschedule it.

The second situation should produce {{TASK_GONE}} status.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8120) Benchmark scheduler API performance

2017-10-20 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-8120:
---
Attachment: revive.master.lockfree.8threads.svg

> Benchmark scheduler API performance
> ---
>
> Key: MESOS-8120
> URL: https://issues.apache.org/jira/browse/MESOS-8120
> Project: Mesos
>  Issue Type: Task
>Reporter: Ilya Pronin
>Priority: Minor
> Attachments: revive.master.lockfree.8threads.svg, 
> revive.master.lockfree.svg, revive.master.svg
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-8120) Benchmark scheduler API performance

2017-10-20 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213109#comment-16213109
 ] 

Ilya Pronin edited comment on MESOS-8120 at 10/20/17 7:42 PM:
--

Results: 
https://docs.google.com/document/d/1gGxrrqC9ahN1oYFUpxpE_b27Xc5ld_04x1Bl86DROqY
Flamegraph for master during "revive" benchmark: [^revive.master.svg]
Flamegraph for master with lock-free event queue and run queue during "revive" 
benchmark: [^revive.master.lockfree.svg]


was (Author: ipronin):
Results: 
https://docs.google.com/document/d/1gGxrrqC9ahN1oYFUpxpE_b27Xc5ld_04x1Bl86DROqY

> Benchmark scheduler API performance
> ---
>
> Key: MESOS-8120
> URL: https://issues.apache.org/jira/browse/MESOS-8120
> Project: Mesos
>  Issue Type: Task
>Reporter: Ilya Pronin
>Priority: Minor
> Attachments: revive.master.lockfree.svg, revive.master.svg
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8120) Benchmark scheduler API performance

2017-10-20 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-8120:
---
Attachment: revive.master.lockfree.svg
revive.master.svg

> Benchmark scheduler API performance
> ---
>
> Key: MESOS-8120
> URL: https://issues.apache.org/jira/browse/MESOS-8120
> Project: Mesos
>  Issue Type: Task
>Reporter: Ilya Pronin
>Priority: Minor
> Attachments: revive.master.lockfree.svg, revive.master.svg
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-8120) Benchmark scheduler API performance

2017-10-20 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213109#comment-16213109
 ] 

Ilya Pronin commented on MESOS-8120:


Results: 
https://docs.google.com/document/d/1gGxrrqC9ahN1oYFUpxpE_b27Xc5ld_04x1Bl86DROqY

> Benchmark scheduler API performance
> ---
>
> Key: MESOS-8120
> URL: https://issues.apache.org/jira/browse/MESOS-8120
> Project: Mesos
>  Issue Type: Task
>Reporter: Ilya Pronin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-8120) Benchmark scheduler API performance

2017-10-20 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-8120:
--

 Summary: Benchmark scheduler API performance
 Key: MESOS-8120
 URL: https://issues.apache.org/jira/browse/MESOS-8120
 Project: Mesos
  Issue Type: Task
Reporter: Ilya Pronin
Priority: Minor


Results: 
https://docs.google.com/document/d/1gGxrrqC9ahN1oYFUpxpE_b27Xc5ld_04x1Bl86DROqY



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-8120) Benchmark scheduler API performance

2017-10-20 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-8120:
---
Description: (was: Results: 
https://docs.google.com/document/d/1gGxrrqC9ahN1oYFUpxpE_b27Xc5ld_04x1Bl86DROqY)

> Benchmark scheduler API performance
> ---
>
> Key: MESOS-8120
> URL: https://issues.apache.org/jira/browse/MESOS-8120
> Project: Mesos
>  Issue Type: Task
>Reporter: Ilya Pronin
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7973) Non-leading VOTING replica catch-up

2017-10-03 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164926#comment-16164926
 ] 

Ilya Pronin edited comment on MESOS-7973 at 10/3/17 11:19 PM:
--

Review requests:
https://reviews.apache.org/r/62283/
https://reviews.apache.org/r/62284/
https://reviews.apache.org/r/62285/
https://reviews.apache.org/r/62760/
https://reviews.apache.org/r/62761/
https://reviews.apache.org/r/62286/
https://reviews.apache.org/r/62287/
https://reviews.apache.org/r/62288/


was (Author: ipronin):
Review requests:
https://reviews.apache.org/r/62283/
https://reviews.apache.org/r/62284/
https://reviews.apache.org/r/62285/
https://reviews.apache.org/r/62286/
https://reviews.apache.org/r/62287/
https://reviews.apache.org/r/62288/

> Non-leading VOTING replica catch-up
> ---
>
> Key: MESOS-7973
> URL: https://issues.apache.org/jira/browse/MESOS-7973
> Project: Mesos
>  Issue Type: Improvement
>  Components: replicated log
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> Currently it is not possible to perform consistent reads from non-leading 
> replicas due to the fact that if a non-leading replica is partitioned it may 
> miss some log positions and will not make any attempt to “fill” those holes.
> If a non-leading replica could catch-up missing log positions it would be 
> able to serve eventually consistent reads to the framework. This would make 
> it possible to do additional work on non-leading framework replicas (e.g. 
> offload some reading from a leader to standbys or reduce failover time by 
> keeping in-memory storage represented by the log “hot”).
> Design doc: 
> https://docs.google.com/document/d/1dERXJeAsi3Lnq9Akt82JGWK4pKNeJ6k7PTVCpM9ic_8/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7973) Non-leading VOTING replica catch-up

2017-09-13 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7973:
--

 Summary: Non-leading VOTING replica catch-up
 Key: MESOS-7973
 URL: https://issues.apache.org/jira/browse/MESOS-7973
 Project: Mesos
  Issue Type: Improvement
  Components: replicated log
Reporter: Ilya Pronin
Assignee: Ilya Pronin


Currently it is not possible to perform consistent reads from non-leading 
replicas due to the fact that if a non-leading replica is partitioned it may 
miss some log positions and will not make any attempt to “fill” those holes.

If a non-leading replica could catch-up missing log positions it would be able 
to serve eventually consistent reads to the framework. This would make it 
possible to do additional work on non-leading framework replicas (e.g. offload 
some reading from a leader to standbys or reduce failover time by keeping 
in-memory storage represented by the log “hot”).

Design doc: 
https://docs.google.com/document/d/1dERXJeAsi3Lnq9Akt82JGWK4pKNeJ6k7PTVCpM9ic_8/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6971) Use arena allocation to improve protobuf message passing performance.

2017-08-29 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-6971:
--

Assignee: (was: Ilya Pronin)

> Use arena allocation to improve protobuf message passing performance.
> -
>
> Key: MESOS-6971
> URL: https://issues.apache.org/jira/browse/MESOS-6971
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>  Labels: mesosphere, performance, tech-debt
>
> The protobuf message passing provided by {{ProtobufProcess}} provide const 
> access of the message and/or its fields to the handler function.
> This means that we can leverage the [arena 
> allocator|https://developers.google.com/protocol-buffers/docs/reference/arenas]
>  provided by protobuf to reduce the memory allocation cost during 
> de-serialization and improve cache efficiency.
> This would require using protobuf 3.x with "proto2" syntax (which appears to 
> be the default if unspecified) to maintain our existing "proto2" 
> requirements. The upgrade to protobuf 3.x while keeping "proto2" syntax 
> should be tackled via a separate ticket that blocks this one.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers

2017-08-29 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128827#comment-16128827
 ] 

Ilya Pronin edited comment on MESOS-7895 at 8/29/17 12:51 PM:
--

[~vinodkone] can you shepherd this please?

Review requests for agents:
https://reviews.apache.org/r/61689/
https://reviews.apache.org/r/61690/
https://reviews.apache.org/r/61965/


was (Author: ipronin):
[~vinodkone] can you shepherd this please?

Review requests for agents:
https://reviews.apache.org/r/61689/
https://reviews.apache.org/r/61690/

> ZK session timeout is unconfigurable in agent and scheduler drivers
> ---
>
> Key: MESOS-7895
> URL: https://issues.apache.org/jira/browse/MESOS-7895
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default 
> ZK session timeout (10 secs). This timeout may have to be increased to cope 
> with long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause 
> lots of {{TASK_LOST}}, because sessions expire on disconnection after 
> {{session_timeout * 2 / 3}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7867) Master doesn't handle scheduler driver downgrade from HTTP based to PID based

2017-08-18 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118527#comment-16118527
 ] 

Ilya Pronin edited comment on MESOS-7867 at 8/18/17 2:47 PM:
-

We can add those metrics back in {{Master::failoverFramework(Framework*, const 
UPID&}} or just remove metrics removal from 
{{Master::failoverFramework(Framework*, const HttpConnection&)}}. 
{{Master::addFramework()}} adds those metrics regardless of the scheduler 
driver type.

[~anandmazumdar], can you shepherd this please?


was (Author: ipronin):
We can add those metrics back in {{Master::failoverFramework(Framework*, const 
UPID&}} or just remove metrics removal from 
{{Master::failoverFramework(Framework*, const HttpConnection&)}}. 
{{Master::addFramework()}} adds those metrics regardless of the scheduler 
driver type.

> Master doesn't handle scheduler driver downgrade from HTTP based to PID based
> -
>
> Key: MESOS-7867
> URL: https://issues.apache.org/jira/browse/MESOS-7867
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> When a framework upgrades from a PID based driver to an HTTP based driver, 
> master removes its per-framework-principal metrics ({{messages_received}} and 
> {{messages_processed}}) in {{Master::failoverFramework}}. When the same 
> framework downgrades back to a PID based driver, the master doesn't reinstate 
> those metrics. This causes a crash when the master receives a message from 
> the failed over framework and increments {{messages_received}} counter in 
> {{Master::visit(const MessageEvent&)}}.
> {noformat}
> I0807 18:17:45.713220 19095 master.cpp:2916] Framework 
> 70822e80-ca38-4470-916e-e6da073a4742- (TwitterScheduler) failed over
> F0807 18:18:20.725908 19079 master.cpp:1451] Check failed: 
> metrics->frameworks.contains(principal.get())
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7896) Use std::error_code for reporting platform-dependent errors

2017-08-16 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7896:
--

 Summary: Use std::error_code for reporting platform-dependent 
errors
 Key: MESOS-7896
 URL: https://issues.apache.org/jira/browse/MESOS-7896
 Project: Mesos
  Issue Type: Improvement
Reporter: Ilya Pronin
Priority: Minor


It may be useful to return an error code from various functions to be able to 
distinguish different kinds of errors. E.g. for being able to ignore {{ENOENT}} 
from {{unlink()}}. This can be achieved by returning {{Try}}, 
but this is not portable.

Since C++11 STL has {{std::error_code}} that hides platform-dependent error 
code behind a portable error condition. We can use it for error reporting.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers

2017-08-16 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16128827#comment-16128827
 ] 

Ilya Pronin commented on MESOS-7895:


[~vinodkone] can you shepherd this please?

Review requests for agents:
https://reviews.apache.org/r/61689/
https://reviews.apache.org/r/61690/

> ZK session timeout is unconfigurable in agent and scheduler drivers
> ---
>
> Key: MESOS-7895
> URL: https://issues.apache.org/jira/browse/MESOS-7895
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default 
> ZK session timeout (10 secs). This timeout may have to be increased to cope 
> with long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause 
> lots of {{TASK_LOST}}, because sessions expire on disconnection after 
> {{session_timeout * 2 / 3}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7895) ZK session timeout is unconfigurable in agent and scheduler drivers

2017-08-16 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7895:
--

 Summary: ZK session timeout is unconfigurable in agent and 
scheduler drivers
 Key: MESOS-7895
 URL: https://issues.apache.org/jira/browse/MESOS-7895
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Minor


{{ZooKeeperMasterDetector}} in agents and scheduler drivers uses the default ZK 
session timeout (10 secs). This timeout may have to be increased to cope with 
long ZK upgrades or ZK GC pauses (with local ZK sessions these can cause lots 
of {{TASK_LOST}}, because sessions expire on disconnection after 
{{session_timeout * 2 / 3}}).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7795) Remove "latest" symlink after agent reboot

2017-08-15 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16127328#comment-16127328
 ] 

Ilya Pronin commented on MESOS-7795:


Review requests:
https://reviews.apache.org/r/61661/
https://reviews.apache.org/r/61662/

> Remove "latest" symlink after agent reboot
> --
>
> Key: MESOS-7795
> URL: https://issues.apache.org/jira/browse/MESOS-7795
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> Currently when the agent detects that the host was rebooted it doesn't 
> recover agent info. New agent info is not checkpointed until the agent 
> successfully registers with a master. If the agent crashes before 
> registering, on restart it will recover the old agent info that was 
> checkpointed before host reboot.
> This can lead to problems. E.g. the agent may flap due to incompatible agent 
> info, if its resources somehow change after reboot. Or the usage of the old 
> agent ID in reregistration process may cause crashes like MESOS-7432.
> We can remove the "latest" symlink when we detect that current boot ID is 
> different from the checkpointed one in order to prevent the agent from 
> recovering stale info after we checkpoint new boot ID. Or we can postpone 
> boot ID checkpointing until we checkpointed new agent info.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7867) Master doesn't handle scheduler driver downgrade from HTTP based to PID based

2017-08-08 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118527#comment-16118527
 ] 

Ilya Pronin commented on MESOS-7867:


We can add those metrics back in {{Master::failoverFramework(Framework*, const 
UPID&}} or just remove metrics removal from 
{{Master::failoverFramework(Framework*, const HttpConnection&)}}. 
{{Master::addFramework()}} adds those metrics regardless of the scheduler 
driver type.

> Master doesn't handle scheduler driver downgrade from HTTP based to PID based
> -
>
> Key: MESOS-7867
> URL: https://issues.apache.org/jira/browse/MESOS-7867
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> When a framework upgrades from a PID based driver to an HTTP based driver, 
> master removes its per-framework-principal metrics ({{messages_received}} and 
> {{messages_processed}}) in {{Master::failoverFramework}}. When the same 
> framework downgrades back to a PID based driver, the master doesn't reinstate 
> those metrics. This causes a crash when the master receives a message from 
> the failed over framework and increments {{messages_received}} counter in 
> {{Master::visit(const MessageEvent&)}}.
> {noformat}
> I0807 18:17:45.713220 19095 master.cpp:2916] Framework 
> 70822e80-ca38-4470-916e-e6da073a4742- (TwitterScheduler) failed over
> F0807 18:18:20.725908 19079 master.cpp:1451] Check failed: 
> metrics->frameworks.contains(principal.get())
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7867) Master doesn't handle scheduler driver downgrade from HTTP based to PID based

2017-08-08 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7867:
--

 Summary: Master doesn't handle scheduler driver downgrade from 
HTTP based to PID based
 Key: MESOS-7867
 URL: https://issues.apache.org/jira/browse/MESOS-7867
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.3.0
Reporter: Ilya Pronin
Assignee: Ilya Pronin


When a framework upgrades from a PID based driver to an HTTP based driver, 
master removes its per-framework-principal metrics ({{messages_received}} and 
{{messages_processed}}) in {{Master::failoverFramework}}. When the same 
framework downgrades back to a PID based driver, the master doesn't reinstate 
those metrics. This causes a crash when the master receives a message from the 
failed over framework and increments {{messages_received}} counter in 
{{Master::visit(const MessageEvent&)}}.

{noformat}
I0807 18:17:45.713220 19095 master.cpp:2916] Framework 
70822e80-ca38-4470-916e-e6da073a4742- (TwitterScheduler) failed over
F0807 18:18:20.725908 19079 master.cpp:1451] Check failed: 
metrics->frameworks.contains(principal.get())
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7795) Remove "latest" symlink after agent reboot

2017-08-03 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-7795:
---
Shepherd: Yan Xu

> Remove "latest" symlink after agent reboot
> --
>
> Key: MESOS-7795
> URL: https://issues.apache.org/jira/browse/MESOS-7795
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> Currently when the agent detects that the host was rebooted it doesn't 
> recover agent info. New agent info is not checkpointed until the agent 
> successfully registers with a master. If the agent crashes before 
> registering, on restart it will recover the old agent info that was 
> checkpointed before host reboot.
> This can lead to problems. E.g. the agent may flap due to incompatible agent 
> info, if its resources somehow change after reboot. Or the usage of the old 
> agent ID in reregistration process may cause crashes like MESOS-7432.
> We can remove the "latest" symlink when we detect that current boot ID is 
> different from the checkpointed one in order to prevent the agent from 
> recovering stale info after we checkpoint new boot ID. Or we can postpone 
> boot ID checkpointing until we checkpointed new agent info.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7795) Remove "latest" symlink after agent reboot

2017-08-03 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-7795:
--

Assignee: Ilya Pronin

> Remove "latest" symlink after agent reboot
> --
>
> Key: MESOS-7795
> URL: https://issues.apache.org/jira/browse/MESOS-7795
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> Currently when the agent detects that the host was rebooted it doesn't 
> recover agent info. New agent info is not checkpointed until the agent 
> successfully registers with a master. If the agent crashes before 
> registering, on restart it will recover the old agent info that was 
> checkpointed before host reboot.
> This can lead to problems. E.g. the agent may flap due to incompatible agent 
> info, if its resources somehow change after reboot. Or the usage of the old 
> agent ID in reregistration process may cause crashes like MESOS-7432.
> We can remove the "latest" symlink when we detect that current boot ID is 
> different from the checkpointed one in order to prevent the agent from 
> recovering stale info after we checkpoint new boot ID. Or we can postpone 
> boot ID checkpointing until we checkpointed new agent info.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-1216) Attributes comparator operator should allow multiple attributes of same name and type

2017-07-26 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-1216:
---
Shepherd: Anand Mazumdar

> Attributes comparator operator should allow multiple attributes of same name 
> and type
> -
>
> Key: MESOS-1216
> URL: https://issues.apache.org/jira/browse/MESOS-1216
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Ilya Pronin
>  Labels: tech-debt
>
> The fact that we currently don't support SET type in Attribute means that 
> operators might end up having multiple attributes (e.g., slave attributes) 
> with the same name and type but different value.
> But the comparator operator for Attributes expects unique (name, type) 
> Attributes. This results in slave recovery failure when comparing 
> checkpointed attributes with those set via flags.
> https://issues.apache.org/jira/browse/MESOS-1215 adds SET support, but for 
> backwards compatibility we should fix the comparator operator first.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7795) Remove "latest" symlink after agent reboot

2017-07-14 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-7795:
---
Description: 
Currently when the agent detects that the host was rebooted it doesn't recover 
agent info. New agent info is not checkpointed until the agent successfully 
registers with a master. If the agent crashes before registering, on restart it 
will recover the old agent info that was checkpointed before host reboot.

This can lead to problems. E.g. the agent may flap due to incompatible agent 
info, if its resources somehow change after reboot. Or the usage of the old 
agent ID in reregistration process may cause crashes like MESOS-7432.

We can remove the "latest" symlink when we detect that current boot ID is 
different from the checkpointed one in order to prevent the agent from 
recovering stale info after we checkpoint new boot ID. Or we can postpone boot 
ID checkpointing until we checkpointed new agent info.

  was:
Currently when the agent detects that the host was rebooted it doesn't recover 
agent info. New agent info is not checkpointed until the agent successfully 
registers with a master. If the agent crashes before registering, on restart it 
will recover the old agent info that was checkpointed before host reboot.

This can lead to problems. E.g. the agent may flap due to incompatible agent 
info, if its resources somehow change after reboot. Or the usage of the old 
agent ID in reregistration process may cause crashes like MESOS-7432.

We can remove the "latest" symlink when we detect that current boot ID is 
different from the checkpointed one in order to prevent the agent from 
recovering stale info after we checkpoint new boot ID.


> Remove "latest" symlink after agent reboot
> --
>
> Key: MESOS-7795
> URL: https://issues.apache.org/jira/browse/MESOS-7795
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Ilya Pronin
>Priority: Minor
>
> Currently when the agent detects that the host was rebooted it doesn't 
> recover agent info. New agent info is not checkpointed until the agent 
> successfully registers with a master. If the agent crashes before 
> registering, on restart it will recover the old agent info that was 
> checkpointed before host reboot.
> This can lead to problems. E.g. the agent may flap due to incompatible agent 
> info, if its resources somehow change after reboot. Or the usage of the old 
> agent ID in reregistration process may cause crashes like MESOS-7432.
> We can remove the "latest" symlink when we detect that current boot ID is 
> different from the checkpointed one in order to prevent the agent from 
> recovering stale info after we checkpoint new boot ID. Or we can postpone 
> boot ID checkpointing until we checkpointed new agent info.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7795) Remove "latest" symlink after agent reboot

2017-07-14 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7795:
--

 Summary: Remove "latest" symlink after agent reboot
 Key: MESOS-7795
 URL: https://issues.apache.org/jira/browse/MESOS-7795
 Project: Mesos
  Issue Type: Improvement
  Components: agent
Reporter: Ilya Pronin
Priority: Minor


Currently when the agent detects that the host was rebooted it doesn't recover 
agent info. New agent info is not checkpointed until the agent successfully 
registers with a master. If the agent crashes before registering, on restart it 
will recover the old agent info that was checkpointed before host reboot.

This can lead to problems. E.g. the agent may flap due to incompatible agent 
info, if its resources somehow change after reboot. Or the usage of the old 
agent ID in reregistration process may cause crashes like MESOS-7432.

We can remove the "latest" symlink when we detect that current boot ID is 
different from the checkpointed one in order to prevent the agent from 
recovering stale info after we checkpoint new boot ID.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2017-07-10 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-5361:
--

Assignee: (was: Ilya Pronin)

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world 
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-1216) Attributes comparator operator should allow multiple attributes of same name and type

2017-06-28 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-1216:
--

Assignee: Ilya Pronin

> Attributes comparator operator should allow multiple attributes of same name 
> and type
> -
>
> Key: MESOS-1216
> URL: https://issues.apache.org/jira/browse/MESOS-1216
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Ilya Pronin
>  Labels: tech-debt
>
> The fact that we currently don't support SET type in Attribute means that 
> operators might end up having multiple attributes (e.g., slave attributes) 
> with the same name and type but different value.
> But the comparator operator for Attributes expects unique (name, type) 
> Attributes. This results in slave recovery failure when comparing 
> checkpointed attributes with those set via flags.
> https://issues.apache.org/jira/browse/MESOS-1215 adds SET support, but for 
> backwards compatibility we should fix the comparator operator first.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4092) Try to re-establish connection on ping timeouts with agent before removing it

2017-06-27 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065357#comment-16065357
 ] 

Ilya Pronin commented on MESOS-4092:


Looks like our problem here is that we use our health-check for detecting 
remote-peer failure and link failure, but don't distinguish them. When a 
connection breaks, libprocess issues {{ExitedEvent}} and opens a new connection 
when required. But in the case of a network problem a relatively long time may 
pass before TCP retransmissions limit is reached and the connection is declared 
dead.

One possible solution can be to try using the aforementioned "relink" 
functionality at some point during agent pinging. We can use a strategy similar 
to the one used by TCP: after N consecutive failed pings "relink" before 
sending the next ping. Plus a similar thing on the agent's side.

Another possible solution can be to use TCP keepalive mechanism tuned to 
"detect" broken connections faster than {{agent_ping_timeout * 
max_agent_ping_timeouts}}. Or we can mess with TCP user timeout, but IMO it's a 
road to hell and AFAIK user timeout is available only on Linux.

> Try to re-establish connection on ping timeouts with agent before removing it
> -
>
> Key: MESOS-4092
> URL: https://issues.apache.org/jira/browse/MESOS-4092
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after 
> {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. 
> This can occur because of transient network failures, e.g., gray failures of 
> a switch uplink exhibiting heavy or total packet loss. Some network 
> architectures are designed to tolerate such gray failures and support 
> multiple paths between hosts. This can be implemented with equal-cost 
> multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple 
> possible uplinks. In such networks re-establishing a TCP connection will 
> almost certainly use a new source port and thus will likely be hashed to a 
> different uplink, avoiding the failed uplink and re-establishing connectivity 
> with the agent.
> After failing to receive pongs the SlaveObserver should next try to 
> re-establish a TCP connection (with exponential back-off) before declaring 
> the agent as lost. This can avoid significant disruption where large numbers 
> of agents reached through a single failed link could be removed unnecessarily 
> while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-4092) Try to re-establish connection on ping timeouts with agent before removing it

2017-06-27 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065357#comment-16065357
 ] 

Ilya Pronin edited comment on MESOS-4092 at 6/27/17 7:43 PM:
-

Seems that our problem here is that we use our health-check for detecting 
remote-peer failure and link failure, but don't distinguish them. When a 
connection breaks, libprocess issues {{ExitedEvent}} and opens a new connection 
when required. But in the case of a network problem a relatively long time may 
pass before TCP retransmissions limit is reached and the connection is declared 
dead.

One possible solution can be to try using the aforementioned "relink" 
functionality at some point during agent pinging. We can use a strategy similar 
to the one used by TCP: after N consecutive failed pings "relink" before 
sending the next ping. Plus a similar thing on the agent's side.

Another possible solution can be to use TCP keepalive mechanism tuned to 
"detect" broken connections faster than {{agent_ping_timeout * 
max_agent_ping_timeouts}}. Or we can mess with TCP user timeout, but IMO it's a 
road to hell and AFAIK user timeout is available only on Linux.


was (Author: ipronin):
Looks like our problem here is that we use our health-check for detecting 
remote-peer failure and link failure, but don't distinguish them. When a 
connection breaks, libprocess issues {{ExitedEvent}} and opens a new connection 
when required. But in the case of a network problem a relatively long time may 
pass before TCP retransmissions limit is reached and the connection is declared 
dead.

One possible solution can be to try using the aforementioned "relink" 
functionality at some point during agent pinging. We can use a strategy similar 
to the one used by TCP: after N consecutive failed pings "relink" before 
sending the next ping. Plus a similar thing on the agent's side.

Another possible solution can be to use TCP keepalive mechanism tuned to 
"detect" broken connections faster than {{agent_ping_timeout * 
max_agent_ping_timeouts}}. Or we can mess with TCP user timeout, but IMO it's a 
road to hell and AFAIK user timeout is available only on Linux.

> Try to re-establish connection on ping timeouts with agent before removing it
> -
>
> Key: MESOS-4092
> URL: https://issues.apache.org/jira/browse/MESOS-4092
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 0.25.0
>Reporter: Ian Downes
>
> The SlaveObserver will trigger an agent to be removed after 
> {{flags.max_slave_ping_timeouts}} timeouts of {{flags.slave_ping_timeout}}. 
> This can occur because of transient network failures, e.g., gray failures of 
> a switch uplink exhibiting heavy or total packet loss. Some network 
> architectures are designed to tolerate such gray failures and support 
> multiple paths between hosts. This can be implemented with equal-cost 
> multi-path routing (ECMP) where flows are hashed by their 5-tuple to multiple 
> possible uplinks. In such networks re-establishing a TCP connection will 
> almost certainly use a new source port and thus will likely be hashed to a 
> different uplink, avoiding the failed uplink and re-establishing connectivity 
> with the agent.
> After failing to receive pongs the SlaveObserver should next try to 
> re-establish a TCP connection (with exponential back-off) before declaring 
> the agent as lost. This can avoid significant disruption where large numbers 
> of agents reached through a single failed link could be removed unnecessarily 
> while still ensuring that agents that are truly lost are recognized as such.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2017-06-26 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-5361:
--

Assignee: Ilya Pronin

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>Assignee: Ilya Pronin
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world 
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.

2017-06-22 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059531#comment-16059531
 ] 

Ilya Pronin edited comment on MESOS-7688 at 6/22/17 3:59 PM:
-

Attached a perf script ([^reregistration.perf.gz]) and a flamegraph 
([^reregistration.svg]) for a 2 minute sample of 33k agents reregistration 
after master (1.2.0 with [r58355|https://reviews.apache.org/r/58355/] 
backported) failover. 


was (Author: ipronin):
Attached a perf script ([^reregistration.perf.gz]) and a flamegraph 
([^reregistration.svg]) for a 2 minute sample of 33k agents reregistration 
after master failover.

> Improve master failover performance by reducing unnecessary agent retries.
> --
>
> Key: MESOS-7688
> URL: https://issues.apache.org/jira/browse/MESOS-7688
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>  Labels: scalability
> Attachments: 1.2.0.png, reregistration.perf.gz, reregistration.svg
>
>
> Currently, during a failover the agents will (re-)register with the master. 
> While the master is recovering, the master may drop messages from the agents, 
> and so the agents must retry registration using a backoff mechanism. For 
> large clusters, there can be a lot of overhead in processing unnecessary 
> retries from the agents, given that these messages must be deserialized and 
> contain all of the task / executor information many times over.
> In order to reduce this overhead, the idea is to avoid the need for agents to 
> blindly retry (re-)registration with the master. Two approaches for this are:
> (1) Update the MasterInfo in ZK when the master is recovered. This is a bit 
> of an abuse of MasterInfo unfortunately, but the idea is for agents to only 
> (re-)register when they see that the master reaches a recovered state. Once 
> recovered, the master will not drop messages, and therefore agents only need 
> to retry when the connection breaks.
> (2) Have the master reply with a retry message when it's in the recovering 
> state, so that agents get a clear signal that their messages were dropped. 
> The agents only retry when the connection breaks or they get a retry message. 
> This one is less optimal, because the master may have to process a lot of 
> messages and send retries, but once the master is recovered, the master will 
> process only a single (re-)registration from each agent. The number of 
> (re-)registrations that occur while the master is recovering can be reduced 
> to 1 in this approach if the master sends the retry message only after the 
> master completes recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.

2017-06-22 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16059531#comment-16059531
 ] 

Ilya Pronin edited comment on MESOS-7688 at 6/22/17 3:32 PM:
-

Attached a perf script ([^reregistration.perf.gz]) and a flamegraph 
([^reregistration.svg]) for a 2 minute sample of 33k agents reregistration 
after master failover.


was (Author: ipronin):
Attached a perf script ([^reregistration.perf.gz]) and a flamegraph 
([^reregistration.svg]) for a 2 minute sample of agents reregistration after 
master failover.

> Improve master failover performance by reducing unnecessary agent retries.
> --
>
> Key: MESOS-7688
> URL: https://issues.apache.org/jira/browse/MESOS-7688
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>  Labels: scalability
> Attachments: 1.2.0.png, reregistration.perf.gz, reregistration.svg
>
>
> Currently, during a failover the agents will (re-)register with the master. 
> While the master is recovering, the master may drop messages from the agents, 
> and so the agents must retry registration using a backoff mechanism. For 
> large clusters, there can be a lot of overhead in processing unnecessary 
> retries from the agents, given that these messages must be deserialized and 
> contain all of the task / executor information many times over.
> In order to reduce this overhead, the idea is to avoid the need for agents to 
> blindly retry (re-)registration with the master. Two approaches for this are:
> (1) Update the MasterInfo in ZK when the master is recovered. This is a bit 
> of an abuse of MasterInfo unfortunately, but the idea is for agents to only 
> (re-)register when they see that the master reaches a recovered state. Once 
> recovered, the master will not drop messages, and therefore agents only need 
> to retry when the connection breaks.
> (2) Have the master reply with a retry message when it's in the recovering 
> state, so that agents get a clear signal that their messages were dropped. 
> The agents only retry when the connection breaks or they get a retry message. 
> This one is less optimal, because the master may have to process a lot of 
> messages and send retries, but once the master is recovered, the master will 
> process only a single (re-)registration from each agent. The number of 
> (re-)registrations that occur while the master is recovering can be reduced 
> to 1 in this approach if the master sends the retry message only after the 
> master completes recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7688) Improve master failover performance by reducing unnecessary agent retries.

2017-06-22 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-7688:
---
Attachment: reregistration.svg
reregistration.perf.gz

Attached a perf script ([^reregistration.perf.gz]) and a flamegraph 
([^reregistration.svg]) for a 2 minute sample of agents reregistration after 
master failover.

> Improve master failover performance by reducing unnecessary agent retries.
> --
>
> Key: MESOS-7688
> URL: https://issues.apache.org/jira/browse/MESOS-7688
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>  Labels: scalability
> Attachments: 1.2.0.png, reregistration.perf.gz, reregistration.svg
>
>
> Currently, during a failover the agents will (re-)register with the master. 
> While the master is recovering, the master may drop messages from the agents, 
> and so the agents must retry registration using a backoff mechanism. For 
> large clusters, there can be a lot of overhead in processing unnecessary 
> retries from the agents, given that these messages must be deserialized and 
> contain all of the task / executor information many times over.
> In order to reduce this overhead, the idea is to avoid the need for agents to 
> blindly retry (re-)registration with the master. Two approaches for this are:
> (1) Update the MasterInfo in ZK when the master is recovered. This is a bit 
> of an abuse of MasterInfo unfortunately, but the idea is for agents to only 
> (re-)register when they see that the master reaches a recovered state. Once 
> recovered, the master will not drop messages, and therefore agents only need 
> to retry when the connection breaks.
> (2) Have the master reply with a retry message when it's in the recovering 
> state, so that agents get a clear signal that their messages were dropped. 
> The agents only retry when the connection breaks or they get a retry message. 
> This one is less optimal, because the master may have to process a lot of 
> messages and send retries, but once the master is recovered, the master will 
> process only a single (re-)registration from each agent. The number of 
> (re-)registrations that occur while the master is recovering can be reduced 
> to 1 in this approach if the master sends the retry message only after the 
> master completes recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7698) Libprocess doesn't handle IP changes

2017-06-20 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7698:
--

 Summary: Libprocess doesn't handle IP changes
 Key: MESOS-7698
 URL: https://issues.apache.org/jira/browse/MESOS-7698
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Affects Versions: 1.2.0
Reporter: Ilya Pronin


If a host IP address changes libprocess will never learn about it and will 
continue to send messages "from" the old IP.

This will cause weird situations. E.g. an agent will indefinitely try to 
reregister with a master pretending that it can be reached by an old IP. The 
master will send {{SlaveReregisteredMessage}} to the wrong host (potentially a 
different agent), using an IP from the {{User-Agent: libprocess/*}} header.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6971) Use arena allocation to improve protobuf message passing performance.

2017-05-27 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-6971:
--

Assignee: Ilya Pronin

> Use arena allocation to improve protobuf message passing performance.
> -
>
> Key: MESOS-6971
> URL: https://issues.apache.org/jira/browse/MESOS-6971
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Mahler
>Assignee: Ilya Pronin
>  Labels: tech-debt
>
> The protobuf message passing provided by {{ProtobufProcess}} provide const 
> access of the message and/or its fields to the handler function.
> This means that we can leverage the [arena 
> allocator|https://developers.google.com/protocol-buffers/docs/reference/arenas]
>  provided by protobuf to reduce the memory allocation cost during 
> de-serialization and improve cache efficiency.
> This would require using protobuf 3.x with "proto2" syntax (which appears to 
> be the default if unspecified) to maintain our existing "proto2" 
> requirements. The upgrade to protobuf 3.x while keeping "proto2" syntax 
> should be tackled via a separate ticket that blocks this one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7461) balloon test and disk full framework test relies on possibly unavailable ports

2017-05-05 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998289#comment-15998289
 ] 

Ilya Pronin commented on MESOS-7461:


We solved this problem in our environment by randomizing the port number :)

> balloon test and disk full framework test relies on possibly unavailable ports
> --
>
> Key: MESOS-7461
> URL: https://issues.apache.org/jira/browse/MESOS-7461
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Zhitao Li
>
> balloon_framework_test.sh and disk_full_framework_test.sh all have code to 
> directly listen at a {{5432}} port, but in our environment that port is 
> directly reserved by something else.
> A possible fix is to write some utility to try to find an unused port, and 
> try to use it for the master. It's not perfect though as there could still be 
> a race condition.
> Another possible fix if to move listen "port" to a domain socket, when that's 
> supported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7387) ZK master contender and detector don't respect zk_session_timeout option

2017-04-18 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972777#comment-15972777
 ] 

Ilya Pronin commented on MESOS-7387:


Added one more patch that updates the ZK session timeouts part of the high 
availability doc: https://reviews.apache.org/r/58506/

> ZK master contender and detector don't respect zk_session_timeout option
> 
>
> Key: MESOS-7387
> URL: https://issues.apache.org/jira/browse/MESOS-7387
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterContender}} and {{ZooKeeperMasterDetector}} are using 
> hardcoded ZK session timeouts ({{MASTER_CONTENDER_ZK_SESSION_TIMEOUT}} and 
> {{MASTER_DETECTOR_ZK_SESSION_TIMEOUT}}) and do not respect 
> {{--zk_session_timeout}} master option. This is unexpected and doesn't play 
> well with ZK updates that take longer than 10 secs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7387) ZK master contender and detector don't respect zk_session_timeout option

2017-04-13 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967767#comment-15967767
 ] 

Ilya Pronin commented on MESOS-7387:


Review request: https://reviews.apache.org/r/58421/

[~vinodkone], [~bmahler] can you shepherd this please?

> ZK master contender and detector don't respect zk_session_timeout option
> 
>
> Key: MESOS-7387
> URL: https://issues.apache.org/jira/browse/MESOS-7387
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterContender}} and {{ZooKeeperMasterDetector}} are using 
> hardcoded ZK session timeouts ({{MASTER_CONTENDER_ZK_SESSION_TIMEOUT}} and 
> {{MASTER_DETECTOR_ZK_SESSION_TIMEOUT}}) and do not respect 
> {{--zk_session_timeout}} master option. This is unexpected and doesn't play 
> well with ZK updates that take longer than 10 secs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7387) ZK master contender and detector don't respect zk_session_timeout option

2017-04-13 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7387:
--

 Summary: ZK master contender and detector don't respect 
zk_session_timeout option
 Key: MESOS-7387
 URL: https://issues.apache.org/jira/browse/MESOS-7387
 Project: Mesos
  Issue Type: Improvement
  Components: master
Affects Versions: 1.3.0
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Minor


{{ZooKeeperMasterContender}} and {{ZooKeeperMasterDetector}} are using 
hardcoded ZK session timeouts ({{MASTER_CONTENDER_ZK_SESSION_TIMEOUT}} and 
{{MASTER_DETECTOR_ZK_SESSION_TIMEOUT}}) and do not respect 
{{--zk_session_timeout}} master option. This is unexpected and doesn't play 
well with ZK updates that take longer than 10 secs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7376) Long registry updates when the number of agents is high

2017-04-11 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964547#comment-15964547
 ] 

Ilya Pronin edited comment on MESOS-7376 at 4/11/17 3:49 PM:
-

Review request: https://reviews.apache.org/r/58355/

{{Registry}} is being copied as a part of 
{{state::protobuf::State::Variable}} or as a bound parameter in 
{{state::protobuf::State::store()}}. The former can be mitigated by adding move 
support to {{Variable}} (using {{Swap()}} for protobuf message). The latter - 
by using {{Owned}}. But that's not enough because {{Variable}} will still be 
copied in return value propagation through the {{Future}}-s chain. So in my 
patch I bypassed {{state::protobuf::State}}.

[~bmahler] can you shepherd this please?

h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] Successfully updated the 
registry in 89.478144ms
I0411 10:04:13.860827 11803 registrar.cpp:508] Successfully updated the 
registry in 216.688896ms
I0411 10:04:15.167768 11803 registrar.cpp:508] Successfully updated the 
registry in 1.29364096secs
I0411 10:04:18.967394 11803 registrar.cpp:508] Successfully updated the 
registry in 3.696552192secs
I0411 10:04:25.631009 11803 registrar.cpp:508] Successfully updated the 
registry in 6.267425024secs
I0411 10:04:42.625507 11803 registrar.cpp:508] Successfully updated the 
registry in 15.876419072secs
I0411 10:04:44.209377 11787 registrar_tests.cpp:1262] Admitted 5 agents in 
30.479743816secs
I0411 10:05:04.446650 11820 registrar.cpp:508] Successfully updated the 
registry in 18.338545152secs
I0411 10:05:21.171001 11820 registrar.cpp:508] Successfully updated the 
registry in 15.31903872secs
I0411 10:05:37.592319 11820 registrar.cpp:508] Successfully updated the 
registry in 14.863101952secs
I0411 10:05:39.099174 11787 registrar_tests.cpp:1276] Marked 5 agents 
reachable in 53.593596102secs
../../src/tests/registrar_tests.cpp:1287: Failure
Failed to wait 15secs for registry
{noformat}
After:
{noformat}
I0411 15:19:12.228904 40643 registrar.cpp:524] Successfully updated the 
registry in 91.262208ms
I0411 15:19:14.543190 40660 registrar.cpp:524] Successfully updated the 
registry in 377.45408ms
I0411 15:19:15.707006 40660 registrar.cpp:524] Successfully updated the 
registry in 1.138724096secs
I0411 15:19:18.267305 40660 registrar.cpp:524] Successfully updated the 
registry in 2.466145792secs
I0411 15:19:19.092073 40660 registrar.cpp:524] Successfully updated the 
registry in 523.11296ms
I0411 15:19:20.809330 40648 registrar.cpp:524] Successfully updated the 
registry in 892.141824ms
I0411 15:19:21.194135 40622 registrar_tests.cpp:1262] Admitted 5 agents in 
6.938952085secs
I0411 15:19:26.973904 40637 registrar.cpp:524] Successfully updated the 
registry in 3.938064128secs
I0411 15:19:28.631865 40637 registrar.cpp:524] Successfully updated the 
registry in 1.116326144secs
I0411 15:19:30.222944 40660 registrar.cpp:524] Successfully updated the 
registry in 911.86688ms
I0411 15:19:30.678509 40622 registrar_tests.cpp:1276] Marked 5 agents 
reachable in 8.249523305secs
I0411 15:19:35.138797 40645 registrar.cpp:524] Successfully updated the 
registry in 815.439104ms
I0411 15:19:41.783651 40622 registrar_tests.cpp:1288] Recovered 5 agents 
(8238915B) in 10.963297677secs
I0411 15:19:47.431670 40657 registrar.cpp:524] Successfully updated the 
registry in 3.960920064secs
I0411 15:20:13.769872 40657 registrar.cpp:524] Successfully updated the 
registry in 1.169234944secs
I0411 15:21:49.685801 40657 registrar.cpp:524] Successfully updated the 
registry in 264.850688ms
Removed 5 agents in 2.12256788111667mins
{noformat}

Similar picture in scale testing:
{noformat}
I0411 13:04:27.598438 41549 registrar.cpp:537] Successfully updated the 
registry in 2.68846208secs
I0411 13:04:30.716615 41552 registrar.cpp:537] Successfully updated the 
registry in 2.61457792secs
I0411 13:04:33.828133 41554 registrar.cpp:537] Successfully updated the 
registry in 2.644827904secs
I0411 13:04:37.634577 41553 registrar.cpp:537] Successfully updated the 
registry in 3.338414848secs
I0411 13:04:40.723475 41546 registrar.cpp:537] Successfully updated the 
registry in 2.629734144secs
{noformat}


was (Author: ipronin):
Review request: https://reviews.apache.org/r/58355/

{{Registry}} is being copied as a part of 
{{state::protobuf::State::Variable}} or as a bound parameter in 
{{state::protobuf::State::store()}}. The former can be mitigated by adding move 
support to {{Variable}} (using {{Swap()}} for protobuf message). The latter - 
by using {{Owned}}. But that's not enough because {{Variable}} will still be 
copied in return value propagation through the {{Future}}-s chain. So in my 
patch I bypassed {{state::protobuf::State}}.

[~bmahler] can you shepherd this please?

h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] 

[jira] [Commented] (MESOS-7376) Long registry updates when the number of agents is high

2017-04-11 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964547#comment-15964547
 ] 

Ilya Pronin commented on MESOS-7376:


Review request: https://reviews.apache.org/r/58355/

{{Registry}} is being copied as a part of 
{{state::protobuf::State::Variable}} or as a bound parameter in 
{{state::protobuf::State::store()}}. The former can be mitigated by adding move 
support to {{Variable}} (using {{Swap()}} for protobuf message). The latter - 
by using {{Owned}}. But that's not enough because {{Variable}} will still be 
copied in return value propagation through the {{Future}}-s chain. So in my 
patch I bypassed {{state::protobuf::State}}.

[~bmahler] can you shepherd this please?

h4. Benchmark results
Before:
{noformat}
I0411 10:04:11.726016 11802 registrar.cpp:508] Successfully updated the 
registry in 89.478144ms
I0411 10:04:13.860827 11803 registrar.cpp:508] Successfully updated the 
registry in 216.688896ms
I0411 10:04:15.167768 11803 registrar.cpp:508] Successfully updated the 
registry in 1.29364096secs
I0411 10:04:18.967394 11803 registrar.cpp:508] Successfully updated the 
registry in 3.696552192secs
I0411 10:04:25.631009 11803 registrar.cpp:508] Successfully updated the 
registry in 6.267425024secs
I0411 10:04:42.625507 11803 registrar.cpp:508] Successfully updated the 
registry in 15.876419072secs
I0411 10:04:44.209377 11787 registrar_tests.cpp:1262] Admitted 5 agents in 
30.479743816secs
I0411 10:05:04.446650 11820 registrar.cpp:508] Successfully updated the 
registry in 18.338545152secs
I0411 10:05:21.171001 11820 registrar.cpp:508] Successfully updated the 
registry in 15.31903872secs
I0411 10:05:37.592319 11820 registrar.cpp:508] Successfully updated the 
registry in 14.863101952secs
I0411 10:05:39.099174 11787 registrar_tests.cpp:1276] Marked 5 agents 
reachable in 53.593596102secs
../../src/tests/registrar_tests.cpp:1287: Failure
Failed to wait 15secs for registry
{noformat}
After:
{noformat}
I0411 15:19:12.228904 40643 registrar.cpp:524] Successfully updated the 
registry in 91.262208ms
I0411 15:19:14.543190 40660 registrar.cpp:524] Successfully updated the 
registry in 377.45408ms
I0411 15:19:15.707006 40660 registrar.cpp:524] Successfully updated the 
registry in 1.138724096secs
I0411 15:19:18.267305 40660 registrar.cpp:524] Successfully updated the 
registry in 2.466145792secs
I0411 15:19:19.092073 40660 registrar.cpp:524] Successfully updated the 
registry in 523.11296ms
I0411 15:19:20.809330 40648 registrar.cpp:524] Successfully updated the 
registry in 892.141824ms
I0411 15:19:21.194135 40622 registrar_tests.cpp:1262] Admitted 5 agents in 
6.938952085secs
I0411 15:19:26.973904 40637 registrar.cpp:524] Successfully updated the 
registry in 3.938064128secs
I0411 15:19:28.631865 40637 registrar.cpp:524] Successfully updated the 
registry in 1.116326144secs
I0411 15:19:30.222944 40660 registrar.cpp:524] Successfully updated the 
registry in 911.86688ms
I0411 15:19:30.678509 40622 registrar_tests.cpp:1276] Marked 5 agents 
reachable in 8.249523305secs
I0411 15:19:35.138797 40645 registrar.cpp:524] Successfully updated the 
registry in 815.439104ms
I0411 15:19:41.783651 40622 registrar_tests.cpp:1288] Recovered 5 agents 
(8238915B) in 10.963297677secs
I0411 15:19:47.431670 40657 registrar.cpp:524] Successfully updated the 
registry in 3.960920064secs
I0411 15:20:13.769872 40657 registrar.cpp:524] Successfully updated the 
registry in 1.169234944secs
I0411 15:21:49.685801 40657 registrar.cpp:524] Successfully updated the 
registry in 264.850688ms
Removed 5 agents in 2.12256788111667mins
{noformat}

> Long registry updates when the number of agents is high
> ---
>
> Key: MESOS-7376
> URL: https://issues.apache.org/jira/browse/MESOS-7376
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> During scale testing we discovered that as the number of registered agents 
> grows the time it takes to update the registry grows to unacceptable values 
> very fast. At some point it starts exceeding {{registry_store_timeout}} which 
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
> finished in 2.428189561secs (b=1)
> I0331 

[jira] [Created] (MESOS-7376) Long registry updates when the number of agents is high

2017-04-11 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7376:
--

 Summary: Long registry updates when the number of agents is high
 Key: MESOS-7376
 URL: https://issues.apache.org/jira/browse/MESOS-7376
 Project: Mesos
  Issue Type: Improvement
  Components: master
Affects Versions: 1.3.0
Reporter: Ilya Pronin
Assignee: Ilya Pronin


During scale testing we discovered that as the number of registered agents 
grows the time it takes to update the registry grows to unacceptable values 
very fast. At some point it starts exceeding {{registry_store_timeout}} which 
doesn't fire.

With 55k agents we saw this ({{registry_store_timeout=20secs}}):
{noformat}
I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
3.138843387secs; attempting to update the registry
I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
74461ns
I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
position=6420881 in 2.41043644secs
I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
finished in 2.428189561secs (b=1)
I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
registry in 34.971944192secs
{noformat}

This is caused by repeated {{Registry}} copying which involves copying a big 
object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-5172) Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.

2017-04-11 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964075#comment-15964075
 ] 

Ilya Pronin edited comment on MESOS-5172 at 4/11/17 2:36 PM:
-

This issue looks related: https://issues.apache.org/jira/browse/MESOS-6561


was (Author: ipronin):
This issue can be related: https://issues.apache.org/jira/browse/MESOS-6561

> Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.
> -
>
> Key: MESOS-5172
> URL: https://issues.apache.org/jira/browse/MESOS-5172
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: containerizer, mesosphere
>
> When the registry puller is pulling a private repository from some private 
> registry (e.g., quay.io), errors may occur when fetching blobs, at which 
> point fetching the manifest of the repo is finished correctly. The error 
> message is `Unexpected HTTP response '400 Bad Request' when trying to 
> download the blob`. This may arise from the logic of fetching blobs, or 
> incorrect format of uri when requesting blobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5172) Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.

2017-04-11 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15964075#comment-15964075
 ] 

Ilya Pronin commented on MESOS-5172:


This issue can be related: https://issues.apache.org/jira/browse/MESOS-6561

> Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.
> -
>
> Key: MESOS-5172
> URL: https://issues.apache.org/jira/browse/MESOS-5172
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: containerizer, mesosphere
>
> When the registry puller is pulling a private repository from some private 
> registry (e.g., quay.io), errors may occur when fetching blobs, at which 
> point fetching the manifest of the repo is finished correctly. The error 
> message is `Unexpected HTTP response '400 Bad Request' when trying to 
> download the blob`. This may arise from the logic of fetching blobs, or 
> incorrect format of uri when requesting blobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6127) Implement suppport for HTTP/2

2017-03-25 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin reassigned MESOS-6127:
--

Assignee: Ilya Pronin

> Implement suppport for HTTP/2
> -
>
> Key: MESOS-6127
> URL: https://issues.apache.org/jira/browse/MESOS-6127
> Project: Mesos
>  Issue Type: Epic
>  Components: HTTP API, libprocess
>Reporter: Aaron Wood
>Assignee: Ilya Pronin
>  Labels: performance
>
> HTTP/2 will allow us to take advantage of connection multiplexing, header 
> compression, streams, server push, etc. Add support for communication over 
> HTTP/2 between masters and agents, framework endpoints, etc.
> Should we support HTTP/2 without TLS? The spec allows for this but most major 
> browser vendors, libraries, and implementations aren't supporting it unless 
> TLS is used. If we do require TLS, what can be done to reduce the performance 
> hit of the TLS handshake? Might need to change more code to make sure that we 
> are taking advantage of connection sharing so that we can (ideally) only ever 
> have a one-time TLS handshake per shared connection.
> Some ideas for libs:
> https://nghttp2.org/documentation/package_README.html - Has encoders/decoders 
> supporting HPACK https://nghttp2.org/documentation/tutorial-hpack.html
> https://nghttp2.org/documentation/libnghttp2_asio.html - Currently marked as 
> experimental by the nghttp2 docs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7281) Backwards incompatible UpdateFrameworkMessage handling

2017-03-22 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936154#comment-15936154
 ] 

Ilya Pronin edited comment on MESOS-7281 at 3/22/17 12:47 PM:
--

Review requests:
https://reviews.apache.org/r/57836/
https://reviews.apache.org/r/57838

[~mcypark] since you wrote the original patch, can you shepherd this please?


was (Author: ipronin):
Review request: https://reviews.apache.org/r/57836/

[~mcypark] since you wrote the original patch, can you shepherd this please?

> Backwards incompatible UpdateFrameworkMessage handling
> --
>
> Key: MESOS-7281
> URL: https://issues.apache.org/jira/browse/MESOS-7281
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> Patch in [r/57108|https://reviews.apache.org/r/57108/] introduced framework 
> info updates. Agents are using a new {{framework_info}} field without 
> checking that it's present. If a patched agent is used with not patched 
> master it will get a default-initialized {{framework_info}}. This will cause 
> agent failures later. E.g abort on framework ID validation when it tries to 
> launch a new task for the updated framework.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7281) Backwards incompatible UpdateFrameworkMessage handling

2017-03-22 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936154#comment-15936154
 ] 

Ilya Pronin commented on MESOS-7281:


Review request: https://reviews.apache.org/r/57836/

[~mcypark] since you wrote the original patch, can you shepherd this please?

> Backwards incompatible UpdateFrameworkMessage handling
> --
>
> Key: MESOS-7281
> URL: https://issues.apache.org/jira/browse/MESOS-7281
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> Patch in [r/57108|https://reviews.apache.org/r/57108/] introduced framework 
> info updates. Agents are using a new {{framework_info}} field without 
> checking that it's present. If a patched agent is used with not patched 
> master it will get a default-initialized {{framework_info}}. This will cause 
> agent failures later. E.g abort on framework ID validation when it tries to 
> launch a new task for the updated framework.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7281) Backwards incompatible UpdateFrameworkMessage handling

2017-03-22 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-7281:
---
Shepherd: Michael Park

> Backwards incompatible UpdateFrameworkMessage handling
> --
>
> Key: MESOS-7281
> URL: https://issues.apache.org/jira/browse/MESOS-7281
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>
> Patch in [r/57108|https://reviews.apache.org/r/57108/] introduced framework 
> info updates. Agents are using a new {{framework_info}} field without 
> checking that it's present. If a patched agent is used with not patched 
> master it will get a default-initialized {{framework_info}}. This will cause 
> agent failures later. E.g abort on framework ID validation when it tries to 
> launch a new task for the updated framework.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7281) Backwards incompatible UpdateFrameworkMessage handling

2017-03-22 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7281:
--

 Summary: Backwards incompatible UpdateFrameworkMessage handling
 Key: MESOS-7281
 URL: https://issues.apache.org/jira/browse/MESOS-7281
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Ilya Pronin
Assignee: Ilya Pronin


Patch in [r/57108|https://reviews.apache.org/r/57108/] introduced framework 
info updates. Agents are using a new {{framework_info}} field without checking 
that it's present. If a patched agent is used with not patched master it will 
get a default-initialized {{framework_info}}. This will cause agent failures 
later. E.g abort on framework ID validation when it tries to launch a new task 
for the updated framework.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7256) Replace Boost Type Traits leftovers with STL

2017-03-16 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928008#comment-15928008
 ] 

Ilya Pronin commented on MESOS-7256:


Review requests:
https://reviews.apache.org/r/57689/
https://reviews.apache.org/r/57690/
https://reviews.apache.org/r/57691/

[~mcypark] can you shepherd this, please?

> Replace Boost Type Traits leftovers with STL
> 
>
> Key: MESOS-7256
> URL: https://issues.apache.org/jira/browse/MESOS-7256
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess, stout
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{boost::enable_if}} and {{boost::is_*}} from Boost Type Traits and Utility 
> are still being used in some places in Stout and libprocess. They can be 
> replaced with their C++11 STL counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7256) Replace Boost Type Traits leftovers with STL

2017-03-16 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7256:
--

 Summary: Replace Boost Type Traits leftovers with STL
 Key: MESOS-7256
 URL: https://issues.apache.org/jira/browse/MESOS-7256
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess, stout
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Minor


{{boost::enable_if}} and {{boost::is_*}} from Boost Type Traits and Utility are 
still being used in some places in Stout and libprocess. They can be replaced 
with their C++11 STL counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-2824) Support pre-fetching images

2017-03-08 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901585#comment-15901585
 ] 

Ilya Pronin commented on MESOS-2824:


Review requests:
https://reviews.apache.org/r/57425/ ({{Containerizer::pull}} method)
https://reviews.apache.org/r/57426/ ({{PULL_CONTAINER_IMAGE}} agent API call)
https://reviews.apache.org/r/57427/ (authorization for {{PULL_CONTAINER_IMAGE}} 
API call)

> Support pre-fetching images
> ---
>
> Key: MESOS-2824
> URL: https://issues.apache.org/jira/browse/MESOS-2824
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.23.0
>Reporter: Ian Downes
>Assignee: Ilya Pronin
>Priority: Minor
>  Labels: mesosphere, twitter
>
> Default container images can be specified with the --default_container_info 
> flag to the slave. This may be a large image that will take a long time to 
> initially fetch/hash/extract when the first container is provisioned. Add 
> optional support to start fetching the image when the slave starts and 
> consider not registering until the fetch is complete.
> To extend that, we should support an operator endpoint so that operators can 
> specify images to pre-fetch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7089) Local Docker Resolver for Mesos Containerizer

2017-02-09 Thread Ilya Pronin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Pronin updated MESOS-7089:
---
Description: Docker’s mutable tags serve as a layer of indirection which 
can be used to point a tag to different images digests (concrete immutable 
images) at different points in time. For instance `latest` tag can point 
`digest-0` at t0 and then to `digest-1` at t1. Mesos has support for local 
docker registry, where the images are files on the local filesystem, named 
either as `repo:tag` or `repo@digest`. This approach trims the degree of 
freedom provided by the indirection mentioned above (from Docker’s mutable 
tags), which can be essential in some cases. For instance, it might be useful 
in cases, where the operator of a cluster would like to rollout image updates 
without having the customers to update their task configuration.  (was: 
Docker’s mutable tags serve as a layer of indirection which can be used to 
point a tag to different images digests (concrete immutable images) at 
different points in time. For instance `latest` tag can point `digest-0` at t0 
and then to `digest-1` at t1. Mesos has support for local docker registry, 
where the images are files on the local filesystem, named either as `repo:tag` 
or `repo:digest`. This approach trims the degree of freedom provided by the 
indirection mentioned above (from Docker’s mutable tags), which can be 
essential in some cases. For instance, it might be useful in cases, where the 
operator of a cluster would like to rollout image updates without having the 
customers to update their task configuration.)

> Local Docker Resolver for Mesos Containerizer
> -
>
> Key: MESOS-7089
> URL: https://issues.apache.org/jira/browse/MESOS-7089
> Project: Mesos
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Docker’s mutable tags serve as a layer of indirection which can be used to 
> point a tag to different images digests (concrete immutable images) at 
> different points in time. For instance `latest` tag can point `digest-0` at 
> t0 and then to `digest-1` at t1. Mesos has support for local docker registry, 
> where the images are files on the local filesystem, named either as 
> `repo:tag` or `repo@digest`. This approach trims the degree of freedom 
> provided by the indirection mentioned above (from Docker’s mutable tags), 
> which can be essential in some cases. For instance, it might be useful in 
> cases, where the operator of a cluster would like to rollout image updates 
> without having the customers to update their task configuration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7089) Local Docker Resolver for Mesos Containerizer

2017-02-09 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859778#comment-15859778
 ] 

Ilya Pronin commented on MESOS-7089:


[~santhk], I'm afraid people won't be able to access docs in our domain using 
the link. You need to post it from a personal account.

> Local Docker Resolver for Mesos Containerizer
> -
>
> Key: MESOS-7089
> URL: https://issues.apache.org/jira/browse/MESOS-7089
> Project: Mesos
>  Issue Type: Story
>Reporter: Santhosh Kumar Shanmugham
>
> Docker’s mutable tags serve as a layer of indirection which can be used to 
> point a tag to different images digests (concrete immutable images) at 
> different points in time. For instance `latest` tag can point `digest-0` at 
> t0 and then to `digest-1` at t1. Mesos has support for local docker registry, 
> where the images are files on the local filesystem, named either as 
> `repo:tag` or `repo:digest`. This approach trims the degree of freedom 
> provided by the indirection mentioned above (from Docker’s mutable tags), 
> which can be essential in some cases. For instance, it might be useful in 
> cases, where the operator of a cluster would like to rollout image updates 
> without having the customers to update their task configuration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.

2017-02-08 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858068#comment-15858068
 ] 

Ilya Pronin edited comment on MESOS-7069 at 2/8/17 2:56 PM:


Internally we tried adding the same functionality that {{filesystem/shared}} 
isolator had (described in [my comment in 
MESOS-6563|https://issues.apache.org/jira/browse/MESOS-6563?focusedCommentId=15683941=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15683941]).
 This can be the first step.

Also {{Volume}} protobuf has the {{mode}} field. It can be used for setting 
permissions on the mounted host directory.


was (Author: ipronin):
Internally we added the same functionality that {{filesystem/shared}} isolator 
had (described in [my comment in 
MESOS-6563|https://issues.apache.org/jira/browse/MESOS-6563?focusedCommentId=15683941=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15683941]).
 This can be the first step.

Also {{Volume}} protobuf has the {{mode}} field. It can be used for setting 
permissions on the mounted host directory.

> The linux filesystem isolator should set mode and ownership for host volumes.
> -
>
> Key: MESOS-7069
> URL: https://issues.apache.org/jira/browse/MESOS-7069
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Gilbert Song
>  Labels: filesystem, linux, volumes
>
> If the host path is a relative path, the linux filesystem isolator should set 
> the mode and ownership for this host volume since it allows non-root user to 
> write to the volume. Note that this is the case of sharing the host 
> fileysystem (without rootfs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7069) The linux filesystem isolator should set mode and ownership for host volumes.

2017-02-08 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15858068#comment-15858068
 ] 

Ilya Pronin commented on MESOS-7069:


Internally we added the same functionality that {{filesystem/shared}} isolator 
had (described in [my comment in 
MESOS-6563|https://issues.apache.org/jira/browse/MESOS-6563?focusedCommentId=15683941=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15683941]).
 This can be the first step.

Also {{Volume}} protobuf has the {{mode}} field. It can be used for setting 
permissions on the mounted host directory.

> The linux filesystem isolator should set mode and ownership for host volumes.
> -
>
> Key: MESOS-7069
> URL: https://issues.apache.org/jira/browse/MESOS-7069
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Gilbert Song
>  Labels: filesystem, linux, volumes
>
> If the host path is a relative path, the linux filesystem isolator should set 
> the mode and ownership for this host volume since it allows non-root user to 
> write to the volume. Note that this is the case of sharing the host 
> fileysystem (without rootfs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7045) Skip already stored layers in local Docker puller

2017-02-07 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855894#comment-15855894
 ] 

Ilya Pronin commented on MESOS-7045:


Fixed tests in https://reviews.apache.org/r/56284/

> Skip already stored layers in local Docker puller
> -
>
> Key: MESOS-7045
> URL: https://issues.apache.org/jira/browse/MESOS-7045
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{slave::docker::LocalPuller}} can skip extracting layers that are already 
> present in the store. {{slave::docker::RegistryPuller}} already does this and 
> {{slave::docker::Store}} is ready for it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-2824) Support pre-fetching images

2017-02-06 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854053#comment-15854053
 ] 

Ilya Pronin commented on MESOS-2824:


I've created a small design doc for this feature: 
https://docs.google.com/document/d/1TdrF-EFNvxlEYou_CCmW0LnCTPoBcUHasRNALXVS3hY/edit?usp=sharing

Please, comment.

> Support pre-fetching images
> ---
>
> Key: MESOS-2824
> URL: https://issues.apache.org/jira/browse/MESOS-2824
> Project: Mesos
>  Issue Type: Improvement
>  Components: isolation
>Affects Versions: 0.23.0
>Reporter: Ian Downes
>Assignee: Ilya Pronin
>Priority: Minor
>  Labels: mesosphere, twitter
>
> Default container images can be specified with the --default_container_info 
> flag to the slave. This may be a large image that will take a long time to 
> initially fetch/hash/extract when the first container is provisioned. Add 
> optional support to start fetching the image when the slave starts and 
> consider not registering until the fetch is complete.
> To extend that, we should support an operator endpoint so that operators can 
> specify images to pre-fetch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7006) Launch docker containers with --cpus instead of cpu-shares

2017-02-06 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15854002#comment-15854002
 ] 

Ilya Pronin commented on MESOS-7006:


On Linux {{--cpus}} option uses only CFS with {{cpus.cfs_period_us}} set to 
100ms: 
https://github.com/docker/docker/blob/master/daemon/daemon_unix.go#L115-L121

> Launch docker containers with --cpus instead of cpu-shares
> --
>
> Key: MESOS-7006
> URL: https://issues.apache.org/jira/browse/MESOS-7006
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Craig W
>Assignee: Tomasz Janiszewski
>Priority: Minor
>
> docker 1.13 was recently released and it now has a new --cpus flag which 
> allows a user to specify how many cpus a container should have. This is much 
> simpler for users to reason about.
> mesos should switch to starting a container with --cpus instead of 
> --cpu-shares, or at least make it configurable.
> https://blog.docker.com/2017/01/cpu-management-docker-1-13/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7059) Unnecessary mkdirs in ProvisionerDockerLocalStoreTest.*

2017-02-03 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7059:
--

 Summary: Unnecessary mkdirs in ProvisionerDockerLocalStoreTest.*
 Key: MESOS-7059
 URL: https://issues.apache.org/jira/browse/MESOS-7059
 Project: Mesos
  Issue Type: Bug
  Components: tests
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Minor


{{ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar}} and 
{{ProvisionerDockerLocalStoreTest.PullingSameImageSimutanuously}} start with 
creating directories that were already created by {{SetUp()}} and directories 
that are not used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7059) Unnecessary mkdirs in ProvisionerDockerLocalStoreTest.*

2017-02-03 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15852093#comment-15852093
 ] 

Ilya Pronin commented on MESOS-7059:


Review request: https://reviews.apache.org/r/56291/

> Unnecessary mkdirs in ProvisionerDockerLocalStoreTest.*
> ---
>
> Key: MESOS-7059
> URL: https://issues.apache.org/jira/browse/MESOS-7059
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ProvisionerDockerLocalStoreTest.LocalStoreTestWithTar}} and 
> {{ProvisionerDockerLocalStoreTest.PullingSameImageSimutanuously}} start with 
> creating directories that were already created by {{SetUp()}} and directories 
> that are not used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7046) Simplify AppC provisioner's cache keys comparison

2017-02-01 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848704#comment-15848704
 ] 

Ilya Pronin commented on MESOS-7046:


Review requests:
https://reviews.apache.org/r/56086/
https://reviews.apache.org/r/56087/

> Simplify AppC provisioner's cache keys comparison
> -
>
> Key: MESOS-7046
> URL: https://issues.apache.org/jira/browse/MESOS-7046
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{appc::Cache::Key::operator==()}} does manual maps comparison by looking up 
> all elements from one container in another and vice versa. 
> {{std::map::operator==()}} does that more effectively by checking sizes and 
> doing element by element comparison.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7046) Simplify AppC provisioner's cache keys comparison

2017-02-01 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7046:
--

 Summary: Simplify AppC provisioner's cache keys comparison
 Key: MESOS-7046
 URL: https://issues.apache.org/jira/browse/MESOS-7046
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Minor


{{appc::Cache::Key::operator==()}} does manual maps comparison by looking up 
all elements from one container in another and vice versa. 
{{std::map::operator==()}} does that more effectively by checking sizes and 
doing element by element comparison.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7045) Skip already stored layers in local Docker puller

2017-02-01 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848486#comment-15848486
 ] 

Ilya Pronin commented on MESOS-7045:


Review request: https://reviews.apache.org/r/56174/

> Skip already stored layers in local Docker puller
> -
>
> Key: MESOS-7045
> URL: https://issues.apache.org/jira/browse/MESOS-7045
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{slave::docker::LocalPuller}} can skip extracting layers that are already 
> present in the store. {{slave::docker::RegistryPuller}} already does this and 
> {{slave::docker::Store}} is ready for it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7045) Skip already stored layers in local Docker puller

2017-02-01 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-7045:
--

 Summary: Skip already stored layers in local Docker puller
 Key: MESOS-7045
 URL: https://issues.apache.org/jira/browse/MESOS-7045
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Ilya Pronin
Assignee: Ilya Pronin
Priority: Minor


{{slave::docker::LocalPuller}} can skip extracting layers that are already 
present in the store. {{slave::docker::RegistryPuller}} already does this and 
{{slave::docker::Store}} is ready for it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] (MESOS-7034) Mesos agent needs to attempt overlayfs module

2017-01-31 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847197#comment-15847197
 ] 

Ilya Pronin edited comment on MESOS-7034 at 1/31/17 5:45 PM:
-

It doesn't feel 100% right with me either. But overlay module doesn't have any 
parameters so the only concern will be when it is loaded.

Also looks like Docker does the same thing though in their docs they tell that 
the module should already be loaded: 
https://github.com/docker/docker/blob/master/daemon/graphdriver/overlay2/overlay.go#L224


was (Author: ipronin):
It doesn't feel 100% right with me either. But "overlay" doesn't have any 
parameters so the only concern will be when it is loaded.

Also looks like Docker does the same thing though in their docs they tell that 
the module should already be loaded: 
https://github.com/docker/docker/blob/master/daemon/graphdriver/overlay2/overlay.go#L224

> Mesos agent needs to attempt overlayfs module
> -
>
> Key: MESOS-7034
> URL: https://issues.apache.org/jira/browse/MESOS-7034
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> Mesos agent reads {{/proc/filesystems}} to check is a filesystem is 
> supported. However for optional filesystems such as {{overlayfs}}, the 
> modules are not loaded by default. Hence attempt to run a {{modprobe 
> overlayfs}} command before reading the {{/proc/filesystems}} file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] (MESOS-7034) Mesos agent needs to attempt overlayfs module

2017-01-31 Thread Ilya Pronin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15847197#comment-15847197
 ] 

Ilya Pronin commented on MESOS-7034:


It doesn't feel 100% right with me either. But "overlay" doesn't have any 
parameters so the only concern will be when it is loaded.

Also looks like Docker does the same thing though in their docs they tell that 
the module should already be loaded: 
https://github.com/docker/docker/blob/master/daemon/graphdriver/overlay2/overlay.go#L224

> Mesos agent needs to attempt overlayfs module
> -
>
> Key: MESOS-7034
> URL: https://issues.apache.org/jira/browse/MESOS-7034
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> Mesos agent reads {{/proc/filesystems}} to check is a filesystem is 
> supported. However for optional filesystems such as {{overlayfs}}, the 
> modules are not loaded by default. Hence attempt to run a {{modprobe 
> overlayfs}} command before reading the {{/proc/filesystems}} file.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   >