Re: MKLDNN performance in CI

2018-11-22 Thread Marco de Abreu
Sure, good idea! https://github.com/apache/incubator-mxnet/pull/13379

-Marco

On Fri, Nov 23, 2018 at 3:38 AM Zhao, Patric  wrote:

> Thanks, it should be the most time-consuming parts.
>
> @Marco, could you try to disable this env and see the performance again?
>
> > -Original Message-
> > From: Lv, Tao A [mailto:tao.a...@intel.com]
> > Sent: Friday, November 23, 2018 10:26 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: RE: MKLDNN performance in CI
> >
> > I think yes, except the cpp test.
> >
> > -Original Message-
> > From: Zhao, Patric [mailto:patric.z...@intel.com]
> > Sent: Friday, November 23, 2018 10:06 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: RE: MKLDNN performance in CI
> >
> > Good point, Tao!
> > Is this env enabled in all MKL-DNN CI?
> >
> > > -Original Message-
> > > From: Lv, Tao A [mailto:tao.a...@intel.com]
> > > Sent: Friday, November 23, 2018 9:53 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: RE: MKLDNN performance in CI
> > >
> > > Thanks for bringing this up, Marco. It's really weird since most of
> > > those tests listed in "worth noting" are not related to mkldnn backend.
> > >
> > > I can understand that some tests for mkldnn operator may be slower
> > > because MXNET_MKLDNN_DEBUG is enabled in the CI:
> > > https://github.com/apache/incubator-
> > > mxnet/blob/master/ci/docker/runtime_functions.sh#L713
> > >
> > > -Original Message-
> > > From: Marco de Abreu [mailto:marco.g.ab...@googlemail.com.INVALID]
> > > Sent: Friday, November 23, 2018 9:22 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: MKLDNN performance in CI
> > >
> > > Hello,
> > >
> > > I have noticed that our Python tests have been increasing in duration
> > recently.
> > > In order to analyse this further, I created the PR [1] which allows to
> > > record test durations. Please note that I did not dive deep on these
> > > numbers and that they have to be taken with a grain of salt since
> > > slaves have varying resource utilizations.
> > >
> > > Please have a look at the two following logs:
> > > Python3 CPU MKLDNN:
> > > http://jenkins.mxnet-ci.amazon-
> > > ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> > > validation/pipelines/unix-cpu/branches/PR-
> > > 13377/runs/2/nodes/155/steps/409/log/?start=0
> > > Python3 CPU Openblas:
> > > http://jenkins.mxnet-ci.amazon-
> > > ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> > > validation/pipelines/unix-cpu/branches/PR-
> > > 13377/runs/2/nodes/152/steps/398/log/?start=0
> > >
> > > If you scroll to the end (note that there are multiple test stages and
> > > summaries being printed in these logs), you will find the following
> > > statements:
> > >
> > > Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
> > > Python3 CPU Openblas: "Ran 702 tests in 2158.458s"
> > >
> > > This shows that the MKLDNN is generally being about 40% slower than
> > > the Openblas backend. If we go into the details, we can see that some
> > > tests are significantly slower:
> > >
> > > Python3 CPU MKLDNN:
> > >
> > > >[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79%
> > > >test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success]
> > > >10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62%
> > > >test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45%
> > > >test_operator.test_pick: 74.4041s [success] 2.39%
> > > >test_metric_perf.test_metric_performance: 72.5445s [success] 2.38%
> > > >test_random.test_negative_binomial_generator: 72.1751s [success]
> > > >1.84%
> > > >test_operator.test_psroipooling: 55.9432s [success] 1.78%
> > > >test_random.test_poisson_generator: 54.0104s [success] 1.72%
> > > >test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success]
> > > >1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41%
> > > >test_random.test_random: 42.8712s [success] 1.03%
> > > >test_operator.test_layer_norm: 31.1242s
> > >
> > >
> > > Python3 CPU Openblas:
> > > > [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s
> > > > [success] 4.34% test_random.test_shuffle: 93.3157s [success] 4.31%
> > > > test_random.test_negative_binomial_generator: 92.6899s [success]
> > > > 3.78%
> > > > test_sparse_operator.test_elemwise_binary_ops: 81.2048s  [success]
> > > > 3.30% test_operator.test_psroipooling: 70.9090s  [success] 3.20%
> > > > test_random.test_poisson_generator: 68.7500s  [success] 3.10%
> > > > test_metric_perf.test_metric_performance: 66.6085s  [success] 2.79%
> > > > test_operator.test_layer_norm: 59.9566s  [success] 2.66%
> > > > test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s  [success]
> > > > 2.62% test_operator.test_pick: 56.2312s  [success] 2.60%
> > > > test_random.test_random: 55.8920s  [success] 2.19%
> > > > test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96%
> > > > test_contrib_control_flow.test_cond: 20.6908s
> > >
> > > Tests worth noting:
> > > - test_random.test_shuffle: 

RE: MKLDNN performance in CI

2018-11-22 Thread Zhao, Patric
Thanks, it should be the most time-consuming parts.

@Marco, could you try to disable this env and see the performance again? 

> -Original Message-
> From: Lv, Tao A [mailto:tao.a...@intel.com]
> Sent: Friday, November 23, 2018 10:26 AM
> To: dev@mxnet.incubator.apache.org
> Subject: RE: MKLDNN performance in CI
> 
> I think yes, except the cpp test.
> 
> -Original Message-
> From: Zhao, Patric [mailto:patric.z...@intel.com]
> Sent: Friday, November 23, 2018 10:06 AM
> To: dev@mxnet.incubator.apache.org
> Subject: RE: MKLDNN performance in CI
> 
> Good point, Tao!
> Is this env enabled in all MKL-DNN CI?
> 
> > -Original Message-
> > From: Lv, Tao A [mailto:tao.a...@intel.com]
> > Sent: Friday, November 23, 2018 9:53 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: RE: MKLDNN performance in CI
> >
> > Thanks for bringing this up, Marco. It's really weird since most of
> > those tests listed in "worth noting" are not related to mkldnn backend.
> >
> > I can understand that some tests for mkldnn operator may be slower
> > because MXNET_MKLDNN_DEBUG is enabled in the CI:
> > https://github.com/apache/incubator-
> > mxnet/blob/master/ci/docker/runtime_functions.sh#L713
> >
> > -Original Message-
> > From: Marco de Abreu [mailto:marco.g.ab...@googlemail.com.INVALID]
> > Sent: Friday, November 23, 2018 9:22 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: MKLDNN performance in CI
> >
> > Hello,
> >
> > I have noticed that our Python tests have been increasing in duration
> recently.
> > In order to analyse this further, I created the PR [1] which allows to
> > record test durations. Please note that I did not dive deep on these
> > numbers and that they have to be taken with a grain of salt since
> > slaves have varying resource utilizations.
> >
> > Please have a look at the two following logs:
> > Python3 CPU MKLDNN:
> > http://jenkins.mxnet-ci.amazon-
> > ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> > validation/pipelines/unix-cpu/branches/PR-
> > 13377/runs/2/nodes/155/steps/409/log/?start=0
> > Python3 CPU Openblas:
> > http://jenkins.mxnet-ci.amazon-
> > ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> > validation/pipelines/unix-cpu/branches/PR-
> > 13377/runs/2/nodes/152/steps/398/log/?start=0
> >
> > If you scroll to the end (note that there are multiple test stages and
> > summaries being printed in these logs), you will find the following
> > statements:
> >
> > Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
> > Python3 CPU Openblas: "Ran 702 tests in 2158.458s"
> >
> > This shows that the MKLDNN is generally being about 40% slower than
> > the Openblas backend. If we go into the details, we can see that some
> > tests are significantly slower:
> >
> > Python3 CPU MKLDNN:
> >
> > >[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79%
> > >test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success]
> > >10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62%
> > >test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45%
> > >test_operator.test_pick: 74.4041s [success] 2.39%
> > >test_metric_perf.test_metric_performance: 72.5445s [success] 2.38%
> > >test_random.test_negative_binomial_generator: 72.1751s [success]
> > >1.84%
> > >test_operator.test_psroipooling: 55.9432s [success] 1.78%
> > >test_random.test_poisson_generator: 54.0104s [success] 1.72%
> > >test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success]
> > >1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41%
> > >test_random.test_random: 42.8712s [success] 1.03%
> > >test_operator.test_layer_norm: 31.1242s
> >
> >
> > Python3 CPU Openblas:
> > > [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s
> > > [success] 4.34% test_random.test_shuffle: 93.3157s [success] 4.31%
> > > test_random.test_negative_binomial_generator: 92.6899s [success]
> > > 3.78%
> > > test_sparse_operator.test_elemwise_binary_ops: 81.2048s  [success]
> > > 3.30% test_operator.test_psroipooling: 70.9090s  [success] 3.20%
> > > test_random.test_poisson_generator: 68.7500s  [success] 3.10%
> > > test_metric_perf.test_metric_performance: 66.6085s  [success] 2.79%
> > > test_operator.test_layer_norm: 59.9566s  [success] 2.66%
> > > test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s  [success]
> > > 2.62% test_operator.test_pick: 56.2312s  [success] 2.60%
> > > test_random.test_random: 55.8920s  [success] 2.19%
> > > test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96%
> > > test_contrib_control_flow.test_cond: 20.6908s
> >
> > Tests worth noting:
> > - test_random.test_shuffle: 700% increase - but I don't know how this
> > may be related to MKLDNN. Are we doing random number generation in
> > either of those backends?
> > - test_sparse_operator.test_elemwise_binary_ops: 700% increase
> > - test_gluon_model_zoo.test_models: 40% decrease - that's awesome and
> > to be expect :)
> > - 

Re: MKLDNN performance in CI

2018-11-22 Thread Marco de Abreu
Okay, that makes sense. Thanks for the very detailed explanation!

Don't worry about speed in CI, especially the debug flag might be the
reason here. But I'm happy to leave it in since the task of CI is accuracy
and detailed reporting, but not speed :)

I just wanted to give a quick heads up in case this could be an indicator.
I'm looking forward to the 1:1 comparisons, but besides that there's no
further action required. Thanks for your efforts!

Best regards,
Marco

Am Fr., 23. Nov. 2018, 03:25 hat Lv, Tao A  geschrieben:

> I think yes, except the cpp test.
>
> -Original Message-
> From: Zhao, Patric [mailto:patric.z...@intel.com]
> Sent: Friday, November 23, 2018 10:06 AM
> To: dev@mxnet.incubator.apache.org
> Subject: RE: MKLDNN performance in CI
>
> Good point, Tao!
> Is this env enabled in all MKL-DNN CI?
>
> > -Original Message-
> > From: Lv, Tao A [mailto:tao.a...@intel.com]
> > Sent: Friday, November 23, 2018 9:53 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: RE: MKLDNN performance in CI
> >
> > Thanks for bringing this up, Marco. It's really weird since most of
> > those tests listed in "worth noting" are not related to mkldnn backend.
> >
> > I can understand that some tests for mkldnn operator may be slower
> > because MXNET_MKLDNN_DEBUG is enabled in the CI:
> > https://github.com/apache/incubator-
> > mxnet/blob/master/ci/docker/runtime_functions.sh#L713
> >
> > -Original Message-
> > From: Marco de Abreu [mailto:marco.g.ab...@googlemail.com.INVALID]
> > Sent: Friday, November 23, 2018 9:22 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: MKLDNN performance in CI
> >
> > Hello,
> >
> > I have noticed that our Python tests have been increasing in duration
> recently.
> > In order to analyse this further, I created the PR [1] which allows to
> > record test durations. Please note that I did not dive deep on these
> > numbers and that they have to be taken with a grain of salt since
> > slaves have varying resource utilizations.
> >
> > Please have a look at the two following logs:
> > Python3 CPU MKLDNN:
> > http://jenkins.mxnet-ci.amazon-
> > ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> > validation/pipelines/unix-cpu/branches/PR-
> > 13377/runs/2/nodes/155/steps/409/log/?start=0
> > Python3 CPU Openblas:
> > http://jenkins.mxnet-ci.amazon-
> > ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> > validation/pipelines/unix-cpu/branches/PR-
> > 13377/runs/2/nodes/152/steps/398/log/?start=0
> >
> > If you scroll to the end (note that there are multiple test stages and
> > summaries being printed in these logs), you will find the following
> > statements:
> >
> > Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
> > Python3 CPU Openblas: "Ran 702 tests in 2158.458s"
> >
> > This shows that the MKLDNN is generally being about 40% slower than
> > the Openblas backend. If we go into the details, we can see that some
> > tests are significantly slower:
> >
> > Python3 CPU MKLDNN:
> >
> > >[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79%
> > >test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success]
> > >10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62%
> > >test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45%
> > >test_operator.test_pick: 74.4041s [success] 2.39%
> > >test_metric_perf.test_metric_performance: 72.5445s [success] 2.38%
> > >test_random.test_negative_binomial_generator: 72.1751s [success]
> > >1.84%
> > >test_operator.test_psroipooling: 55.9432s [success] 1.78%
> > >test_random.test_poisson_generator: 54.0104s [success] 1.72%
> > >test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success]
> > >1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41%
> > >test_random.test_random: 42.8712s [success] 1.03%
> > >test_operator.test_layer_norm: 31.1242s
> >
> >
> > Python3 CPU Openblas:
> > > [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s
> > > [success] 4.34% test_random.test_shuffle: 93.3157s [success] 4.31%
> > > test_random.test_negative_binomial_generator: 92.6899s [success]
> > > 3.78%
> > > test_sparse_operator.test_elemwise_binary_ops: 81.2048s  [success]
> > > 3.30% test_operator.test_psroipooling: 70.9090s  [success] 3.20%
> > > test_random.test_poisson_generator: 68.7500s  [success] 3.10%
> > > test_metric_perf.test_metric_performance: 66.6085s  [success] 2.79%
> > > test_operator.test_layer_norm: 59.9566s  [success] 2.66%
> > > test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s  [success]
> > > 2.62% test_operator.test_pick: 56.2312s  [success] 2.60%
> > > test_random.test_random: 55.8920s  [success] 2.19%
> > > test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96%
> > > test_contrib_control_flow.test_cond: 20.6908s
> >
> > Tests worth noting:
> > - test_random.test_shuffle: 700% increase - but I don't know how this
> > may be related to MKLDNN. Are we doing random number generation in
> > either of those 

RE: MKLDNN performance in CI

2018-11-22 Thread Lv, Tao A
I think yes, except the cpp test.

-Original Message-
From: Zhao, Patric [mailto:patric.z...@intel.com] 
Sent: Friday, November 23, 2018 10:06 AM
To: dev@mxnet.incubator.apache.org
Subject: RE: MKLDNN performance in CI

Good point, Tao! 
Is this env enabled in all MKL-DNN CI? 

> -Original Message-
> From: Lv, Tao A [mailto:tao.a...@intel.com]
> Sent: Friday, November 23, 2018 9:53 AM
> To: dev@mxnet.incubator.apache.org
> Subject: RE: MKLDNN performance in CI
> 
> Thanks for bringing this up, Marco. It's really weird since most of 
> those tests listed in "worth noting" are not related to mkldnn backend.
> 
> I can understand that some tests for mkldnn operator may be slower 
> because MXNET_MKLDNN_DEBUG is enabled in the CI:
> https://github.com/apache/incubator-
> mxnet/blob/master/ci/docker/runtime_functions.sh#L713
> 
> -Original Message-
> From: Marco de Abreu [mailto:marco.g.ab...@googlemail.com.INVALID]
> Sent: Friday, November 23, 2018 9:22 AM
> To: dev@mxnet.incubator.apache.org
> Subject: MKLDNN performance in CI
> 
> Hello,
> 
> I have noticed that our Python tests have been increasing in duration 
> recently.
> In order to analyse this further, I created the PR [1] which allows to 
> record test durations. Please note that I did not dive deep on these 
> numbers and that they have to be taken with a grain of salt since 
> slaves have varying resource utilizations.
> 
> Please have a look at the two following logs:
> Python3 CPU MKLDNN:
> http://jenkins.mxnet-ci.amazon-
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> validation/pipelines/unix-cpu/branches/PR-
> 13377/runs/2/nodes/155/steps/409/log/?start=0
> Python3 CPU Openblas:
> http://jenkins.mxnet-ci.amazon-
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> validation/pipelines/unix-cpu/branches/PR-
> 13377/runs/2/nodes/152/steps/398/log/?start=0
> 
> If you scroll to the end (note that there are multiple test stages and 
> summaries being printed in these logs), you will find the following
> statements:
> 
> Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
> Python3 CPU Openblas: "Ran 702 tests in 2158.458s"
> 
> This shows that the MKLDNN is generally being about 40% slower than 
> the Openblas backend. If we go into the details, we can see that some 
> tests are significantly slower:
> 
> Python3 CPU MKLDNN:
> 
> >[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79%
> >test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success] 
> >10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62%
> >test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45%
> >test_operator.test_pick: 74.4041s [success] 2.39%
> >test_metric_perf.test_metric_performance: 72.5445s [success] 2.38%
> >test_random.test_negative_binomial_generator: 72.1751s [success] 
> >1.84%
> >test_operator.test_psroipooling: 55.9432s [success] 1.78%
> >test_random.test_poisson_generator: 54.0104s [success] 1.72%
> >test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success] 
> >1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41%
> >test_random.test_random: 42.8712s [success] 1.03%
> >test_operator.test_layer_norm: 31.1242s
> 
> 
> Python3 CPU Openblas:
> > [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s 
> > [success] 4.34% test_random.test_shuffle: 93.3157s [success] 4.31%
> > test_random.test_negative_binomial_generator: 92.6899s [success] 
> > 3.78%
> > test_sparse_operator.test_elemwise_binary_ops: 81.2048s  [success] 
> > 3.30% test_operator.test_psroipooling: 70.9090s  [success] 3.20%
> > test_random.test_poisson_generator: 68.7500s  [success] 3.10%
> > test_metric_perf.test_metric_performance: 66.6085s  [success] 2.79%
> > test_operator.test_layer_norm: 59.9566s  [success] 2.66%
> > test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s  [success] 
> > 2.62% test_operator.test_pick: 56.2312s  [success] 2.60%
> > test_random.test_random: 55.8920s  [success] 2.19%
> > test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96%
> > test_contrib_control_flow.test_cond: 20.6908s
> 
> Tests worth noting:
> - test_random.test_shuffle: 700% increase - but I don't know how this 
> may be related to MKLDNN. Are we doing random number generation in 
> either of those backends?
> - test_sparse_operator.test_elemwise_binary_ops: 700% increase
> - test_gluon_model_zoo.test_models: 40% decrease - that's awesome and 
> to be expect :)
> - test_operator.test_broadcast_binary_op: 80% increase
> - test_contrib_control_flow.test_cond: 250% increase
> - test_operator.test_layer_norm: 50% decrease - nice!
> 
> As I have stated previously, these numbers might not mean anything 
> since the CI is not a benchmarking environment (sorry if these are 
> false negatives), but I thought it might be worth mentioning so Intel 
> could follow up and dive deeper.
> 
> Does anybody here create 1:1 operator comparisons (e.g. running 
> layer_norm in the different backends to 

RE: MKLDNN performance in CI

2018-11-22 Thread Zhao, Patric
Good point, Tao! 
Is this env enabled in all MKL-DNN CI? 

> -Original Message-
> From: Lv, Tao A [mailto:tao.a...@intel.com]
> Sent: Friday, November 23, 2018 9:53 AM
> To: dev@mxnet.incubator.apache.org
> Subject: RE: MKLDNN performance in CI
> 
> Thanks for bringing this up, Marco. It's really weird since most of those 
> tests
> listed in "worth noting" are not related to mkldnn backend.
> 
> I can understand that some tests for mkldnn operator may be slower
> because MXNET_MKLDNN_DEBUG is enabled in the CI:
> https://github.com/apache/incubator-
> mxnet/blob/master/ci/docker/runtime_functions.sh#L713
> 
> -Original Message-
> From: Marco de Abreu [mailto:marco.g.ab...@googlemail.com.INVALID]
> Sent: Friday, November 23, 2018 9:22 AM
> To: dev@mxnet.incubator.apache.org
> Subject: MKLDNN performance in CI
> 
> Hello,
> 
> I have noticed that our Python tests have been increasing in duration 
> recently.
> In order to analyse this further, I created the PR [1] which allows to record
> test durations. Please note that I did not dive deep on these numbers and
> that they have to be taken with a grain of salt since slaves have varying
> resource utilizations.
> 
> Please have a look at the two following logs:
> Python3 CPU MKLDNN:
> http://jenkins.mxnet-ci.amazon-
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> validation/pipelines/unix-cpu/branches/PR-
> 13377/runs/2/nodes/155/steps/409/log/?start=0
> Python3 CPU Openblas:
> http://jenkins.mxnet-ci.amazon-
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> validation/pipelines/unix-cpu/branches/PR-
> 13377/runs/2/nodes/152/steps/398/log/?start=0
> 
> If you scroll to the end (note that there are multiple test stages and
> summaries being printed in these logs), you will find the following
> statements:
> 
> Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
> Python3 CPU Openblas: "Ran 702 tests in 2158.458s"
> 
> This shows that the MKLDNN is generally being about 40% slower than the
> Openblas backend. If we go into the details, we can see that some tests are
> significantly slower:
> 
> Python3 CPU MKLDNN:
> 
> >[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79%
> >test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success]
> >10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62%
> >test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45%
> >test_operator.test_pick: 74.4041s [success] 2.39%
> >test_metric_perf.test_metric_performance: 72.5445s [success] 2.38%
> >test_random.test_negative_binomial_generator: 72.1751s [success] 1.84%
> >test_operator.test_psroipooling: 55.9432s [success] 1.78%
> >test_random.test_poisson_generator: 54.0104s [success] 1.72%
> >test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success]
> >1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41%
> >test_random.test_random: 42.8712s [success] 1.03%
> >test_operator.test_layer_norm: 31.1242s
> 
> 
> Python3 CPU Openblas:
> > [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s [success]
> > 4.34% test_random.test_shuffle: 93.3157s [success] 4.31%
> > test_random.test_negative_binomial_generator: 92.6899s [success] 3.78%
> > test_sparse_operator.test_elemwise_binary_ops: 81.2048s  [success]
> > 3.30% test_operator.test_psroipooling: 70.9090s  [success] 3.20%
> > test_random.test_poisson_generator: 68.7500s  [success] 3.10%
> > test_metric_perf.test_metric_performance: 66.6085s  [success] 2.79%
> > test_operator.test_layer_norm: 59.9566s  [success] 2.66%
> > test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s  [success]
> > 2.62% test_operator.test_pick: 56.2312s  [success] 2.60%
> > test_random.test_random: 55.8920s  [success] 2.19%
> > test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96%
> > test_contrib_control_flow.test_cond: 20.6908s
> 
> Tests worth noting:
> - test_random.test_shuffle: 700% increase - but I don't know how this may
> be related to MKLDNN. Are we doing random number generation in either of
> those backends?
> - test_sparse_operator.test_elemwise_binary_ops: 700% increase
> - test_gluon_model_zoo.test_models: 40% decrease - that's awesome and to
> be expect :)
> - test_operator.test_broadcast_binary_op: 80% increase
> - test_contrib_control_flow.test_cond: 250% increase
> - test_operator.test_layer_norm: 50% decrease - nice!
> 
> As I have stated previously, these numbers might not mean anything since
> the CI is not a benchmarking environment (sorry if these are false negatives),
> but I thought it might be worth mentioning so Intel could follow up and dive
> deeper.
> 
> Does anybody here create 1:1 operator comparisons (e.g. running
> layer_norm in the different backends to compare the performance) who
> could provide us with those numbers?
> 
> Best regards,
> Marco
> 
> [1]: https://github.com/apache/incubator-mxnet/pull/13377


RE: MKLDNN performance in CI

2018-11-22 Thread Zhao, Patric
Happy Thanksgiving, everyone :)

Hi Marco,

Thanks to raising this question. We will look into the details for CI test 
cases and Shufan will provide the 1:1 OP level performance data.

In general, the CI tests is not performant cases which covered the different 
situations, even corner cases, for the quality purpose.
Specifically, some overhead may be introduced if we connect  an MKL-DNN OP and 
non MKL-DNN OP where the extra data conversion will happen.
So, it's the expected behaviors that the performance drops in these kinds of 
test cases in which the computation is quite tiny.

But in the real workload, these situations are not such frequency and the total 
performance of MKL-DNN will be much better than OpenBLAS.
(data in here: 
https://cwiki.apache.org/confluence/display/MXNET/MXNet+with+Intel+MKL-DNN+-+Performance+Benchmarking)

For a short time, Shufan will help to look into these test cases and figure out 
a proper solution to make the CI faster.

For a medium and long time, we are working on implementing more MKL-DNN 
supported Ops, like reshape, slice, split, so that less data format conversion 
will be involved.

Feel free to let me know if you have any other concerns.

BR,

--Patric
 


> -Original Message-
> From: Marco de Abreu [mailto:marco.g.ab...@googlemail.com.INVALID]
> Sent: Friday, November 23, 2018 9:22 AM
> To: dev@mxnet.incubator.apache.org
> Subject: MKLDNN performance in CI
> 
> Hello,
> 
> I have noticed that our Python tests have been increasing in duration 
> recently.
> In order to analyse this further, I created the PR [1] which allows to record
> test durations. Please note that I did not dive deep on these numbers and
> that they have to be taken with a grain of salt since slaves have varying
> resource utilizations.
> 
> Please have a look at the two following logs:
> Python3 CPU MKLDNN:
> http://jenkins.mxnet-ci.amazon-
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> validation/pipelines/unix-cpu/branches/PR-
> 13377/runs/2/nodes/155/steps/409/log/?start=0
> Python3 CPU Openblas:
> http://jenkins.mxnet-ci.amazon-
> ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-
> validation/pipelines/unix-cpu/branches/PR-
> 13377/runs/2/nodes/152/steps/398/log/?start=0
> 
> If you scroll to the end (note that there are multiple test stages and
> summaries being printed in these logs), you will find the following
> statements:
> 
> Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
> Python3 CPU Openblas: "Ran 702 tests in 2158.458s"
> 
> This shows that the MKLDNN is generally being about 40% slower than the
> Openblas backend. If we go into the details, we can see that some tests are
> significantly slower:
> 
> Python3 CPU MKLDNN:
> 
> >[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79%
> >test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success]
> >10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62%
> >test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45%
> >test_operator.test_pick: 74.4041s [success] 2.39%
> >test_metric_perf.test_metric_performance: 72.5445s [success] 2.38%
> >test_random.test_negative_binomial_generator: 72.1751s [success] 1.84%
> >test_operator.test_psroipooling: 55.9432s [success] 1.78%
> >test_random.test_poisson_generator: 54.0104s [success] 1.72%
> >test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success]
> >1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41%
> >test_random.test_random: 42.8712s [success] 1.03%
> >test_operator.test_layer_norm: 31.1242s
> 
> 
> Python3 CPU Openblas:
> > [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s [success]
> > 4.34% test_random.test_shuffle: 93.3157s [success] 4.31%
> > test_random.test_negative_binomial_generator: 92.6899s [success] 3.78%
> > test_sparse_operator.test_elemwise_binary_ops: 81.2048s  [success]
> > 3.30% test_operator.test_psroipooling: 70.9090s  [success] 3.20%
> > test_random.test_poisson_generator: 68.7500s  [success] 3.10%
> > test_metric_perf.test_metric_performance: 66.6085s  [success] 2.79%
> > test_operator.test_layer_norm: 59.9566s  [success] 2.66%
> > test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s  [success]
> > 2.62% test_operator.test_pick: 56.2312s  [success] 2.60%
> > test_random.test_random: 55.8920s  [success] 2.19%
> > test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96%
> > test_contrib_control_flow.test_cond: 20.6908s
> 
> Tests worth noting:
> - test_random.test_shuffle: 700% increase - but I don't know how this may
> be related to MKLDNN. Are we doing random number generation in either of
> those backends?
> - test_sparse_operator.test_elemwise_binary_ops: 700% increase
> - test_gluon_model_zoo.test_models: 40% decrease - that's awesome and to
> be expect :)
> - test_operator.test_broadcast_binary_op: 80% increase
> - test_contrib_control_flow.test_cond: 250% increase
> - test_operator.test_layer_norm: 50% decrease - nice!
> 
> As I have stated 

RE: MKLDNN performance in CI

2018-11-22 Thread Lv, Tao A
Thanks for bringing this up, Marco. It's really weird since most of those tests 
listed in "worth noting" are not related to mkldnn backend.

I can understand that some tests for mkldnn operator may be slower because 
MXNET_MKLDNN_DEBUG is enabled in the CI: 
https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L713
 

-Original Message-
From: Marco de Abreu [mailto:marco.g.ab...@googlemail.com.INVALID] 
Sent: Friday, November 23, 2018 9:22 AM
To: dev@mxnet.incubator.apache.org
Subject: MKLDNN performance in CI

Hello,

I have noticed that our Python tests have been increasing in duration recently. 
In order to analyse this further, I created the PR [1] which allows to record 
test durations. Please note that I did not dive deep on these numbers and that 
they have to be taken with a grain of salt since slaves have varying resource 
utilizations.

Please have a look at the two following logs:
Python3 CPU MKLDNN:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-13377/runs/2/nodes/155/steps/409/log/?start=0
Python3 CPU Openblas:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-13377/runs/2/nodes/152/steps/398/log/?start=0

If you scroll to the end (note that there are multiple test stages and 
summaries being printed in these logs), you will find the following
statements:

Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
Python3 CPU Openblas: "Ran 702 tests in 2158.458s"

This shows that the MKLDNN is generally being about 40% slower than the 
Openblas backend. If we go into the details, we can see that some tests are 
significantly slower:

Python3 CPU MKLDNN:

>[success] 20.78% test_random.test_shuffle: 630.7165s [success] 17.79% 
>test_sparse_operator.test_elemwise_binary_ops: 540.0487s [success] 
>10.91% test_gluon_model_zoo.test_models: 331.1503s [success] 2.62% 
>test_operator.test_broadcast_binary_op: 79.4556s [success] 2.45% 
>test_operator.test_pick: 74.4041s [success] 2.39% 
>test_metric_perf.test_metric_performance: 72.5445s [success] 2.38% 
>test_random.test_negative_binomial_generator: 72.1751s [success] 1.84% 
>test_operator.test_psroipooling: 55.9432s [success] 1.78% 
>test_random.test_poisson_generator: 54.0104s [success] 1.72% 
>test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s [success] 
>1.60% test_contrib_control_flow.test_cond: 48.6977s [success] 1.41% 
>test_random.test_random: 42.8712s [success] 1.03% 
>test_operator.test_layer_norm: 31.1242s


Python3 CPU Openblas:
> [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s [success] 
> 4.34% test_random.test_shuffle: 93.3157s [success] 4.31% 
> test_random.test_negative_binomial_generator: 92.6899s [success] 3.78% 
> test_sparse_operator.test_elemwise_binary_ops: 81.2048s  [success] 
> 3.30% test_operator.test_psroipooling: 70.9090s  [success] 3.20% 
> test_random.test_poisson_generator: 68.7500s  [success] 3.10% 
> test_metric_perf.test_metric_performance: 66.6085s  [success] 2.79% 
> test_operator.test_layer_norm: 59.9566s  [success] 2.66% 
> test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s  [success] 
> 2.62% test_operator.test_pick: 56.2312s  [success] 2.60% 
> test_random.test_random: 55.8920s  [success] 2.19% 
> test_operator.test_broadcast_binary_op: 47.1879s [success] 0.96% 
> test_contrib_control_flow.test_cond: 20.6908s

Tests worth noting:
- test_random.test_shuffle: 700% increase - but I don't know how this may be 
related to MKLDNN. Are we doing random number generation in either of those 
backends?
- test_sparse_operator.test_elemwise_binary_ops: 700% increase
- test_gluon_model_zoo.test_models: 40% decrease - that's awesome and to be 
expect :)
- test_operator.test_broadcast_binary_op: 80% increase
- test_contrib_control_flow.test_cond: 250% increase
- test_operator.test_layer_norm: 50% decrease - nice!

As I have stated previously, these numbers might not mean anything since the CI 
is not a benchmarking environment (sorry if these are false negatives), but I 
thought it might be worth mentioning so Intel could follow up and dive deeper.

Does anybody here create 1:1 operator comparisons (e.g. running layer_norm in 
the different backends to compare the performance) who could provide us with 
those numbers?

Best regards,
Marco

[1]: https://github.com/apache/incubator-mxnet/pull/13377


MKLDNN performance in CI

2018-11-22 Thread Marco de Abreu
Hello,

I have noticed that our Python tests have been increasing in duration
recently. In order to analyse this further, I created the PR [1] which
allows to record test durations. Please note that I did not dive deep on
these numbers and that they have to be taken with a grain of salt since
slaves have varying resource utilizations.

Please have a look at the two following logs:
Python3 CPU MKLDNN:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-13377/runs/2/nodes/155/steps/409/log/?start=0
Python3 CPU Openblas:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-13377/runs/2/nodes/152/steps/398/log/?start=0

If you scroll to the end (note that there are multiple test stages and
summaries being printed in these logs), you will find the following
statements:

Python3 CPU MKLDNN: "Ran 702 tests in 3042.102s"
Python3 CPU Openblas: "Ran 702 tests in 2158.458s"

This shows that the MKLDNN is generally being about 40% slower than the
Openblas backend. If we go into the details, we can see that some tests are
significantly slower:

Python3 CPU MKLDNN:

>[success] 20.78% test_random.test_shuffle: 630.7165s
>[success] 17.79% test_sparse_operator.test_elemwise_binary_ops: 540.0487s
>[success] 10.91% test_gluon_model_zoo.test_models: 331.1503s
>[success] 2.62% test_operator.test_broadcast_binary_op: 79.4556s
>[success] 2.45% test_operator.test_pick: 74.4041s
>[success] 2.39% test_metric_perf.test_metric_performance: 72.5445s
>[success] 2.38% test_random.test_negative_binomial_generator: 72.1751s
>[success] 1.84% test_operator.test_psroipooling: 55.9432s
>[success] 1.78% test_random.test_poisson_generator: 54.0104s
>[success] 1.72% test_gluon.test_slice_pooling2d_slice_pooling2d: 52.3447s
>[success] 1.60% test_contrib_control_flow.test_cond: 48.6977s
>[success] 1.41% test_random.test_random: 42.8712s
>[success] 1.03% test_operator.test_layer_norm: 31.1242s


Python3 CPU Openblas:
> [success] 26.20% test_gluon_model_zoo.test_models: 563.3366s
> [success] 4.34% test_random.test_shuffle: 93.3157s
> [success] 4.31% test_random.test_negative_binomial_generator: 92.6899s
> [success] 3.78% test_sparse_operator.test_elemwise_binary_ops: 81.2048s
>  [success] 3.30% test_operator.test_psroipooling: 70.9090s
>  [success] 3.20% test_random.test_poisson_generator: 68.7500s
>  [success] 3.10% test_metric_perf.test_metric_performance: 66.6085s
>  [success] 2.79% test_operator.test_layer_norm: 59.9566s
>  [success] 2.66% test_gluon.test_slice_pooling2d_slice_pooling2d: 57.1887s
>  [success] 2.62% test_operator.test_pick: 56.2312s
>  [success] 2.60% test_random.test_random: 55.8920s
>  [success] 2.19% test_operator.test_broadcast_binary_op: 47.1879s
> [success] 0.96% test_contrib_control_flow.test_cond: 20.6908s

Tests worth noting:
- test_random.test_shuffle: 700% increase - but I don't know how this may
be related to MKLDNN. Are we doing random number generation in either of
those backends?
- test_sparse_operator.test_elemwise_binary_ops: 700% increase
- test_gluon_model_zoo.test_models: 40% decrease - that's awesome and to be
expect :)
- test_operator.test_broadcast_binary_op: 80% increase
- test_contrib_control_flow.test_cond: 250% increase
- test_operator.test_layer_norm: 50% decrease - nice!

As I have stated previously, these numbers might not mean anything since
the CI is not a benchmarking environment (sorry if these are false
negatives), but I thought it might be worth mentioning so Intel could
follow up and dive deeper.

Does anybody here create 1:1 operator comparisons (e.g. running layer_norm
in the different backends to compare the performance) who could provide us
with those numbers?

Best regards,
Marco

[1]: https://github.com/apache/incubator-mxnet/pull/13377


Re: CI impaired

2018-11-22 Thread Marco de Abreu
Thanks everybody, I really appreciate it!

Today was a good day, there were no incidents and everything appears to be
stable. In the meantime I did a deep dive on why we has such a significant
performance decrease with of our compilation jobs - which then clogged up
the queue and resulted in 1000 jobs waiting to be scheduled.

The reason was the way how we use ccache to speed up our compilation jobs.
Usually, this yields us a huge performance improvement (CPU openblas, for
example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down to
~1.5min, etc.). Unfortunately in this case, ccache was our limiting factor.
Here's some background about how we operate our cache:

We use EFS to have a distributed ccache between all of our
unrestricted-prod-slaves. EFS is classified for almost unlimited
scalability (being consumed by thousands of instances in parallel [1]) with
a theoretical throughput of over 10Gbps. One thing I didn't know when I
designed this approach was the method how throughput is being granted.
Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
throughput (default is 50MiB/s) [2]. Due to the high load, we consumed all
of our credits - here's a very interesting graph: [3].

To avoid similar incidents in future, I have taken the following actions:
1. I switched EFS from burst-mode to provisioned throughput with 300MB/s
(in the graph at [3] you can see how our IO immediately increases - and
thus our CI gets faster - as soon as I added provisioned throughput).
2. I created internal follow-up tickets to add monitoring and automated
actions.

First, we should be notified if we are running low on credits to kick-off
an investigation. Second (nice to have), we could have a lambda-function
which listens for that event and automatically switches the EFS volume from
burst-mode to provisioned throughput during high-load-times. The required
throughput could be retrieved via CloudWatch and then multiplied by a
factor. EFS allows to downgrade the throughput mode 24h after the last
changes (to reduce capacity if the load is over) and always allows to
upgrade the provisioned capacity (if the load goes even higher). I've been
looking for a pre-made CloudFormation template to facilitate that, but so
far, I haven't been able to find it.

I'm now running additional load tests on our test CI environment to detect
other potential bottlenecks.

Thanks a lot for your support!

Best regards,
Marco

[1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
[2]:
https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
[3]: https://i.imgur.com/nboQLOn.png

On Thu, Nov 22, 2018 at 1:40 AM Qing Lan  wrote:

> Appreciated for your effort and help to make CI a better place!
>
> Qing
>
> On 11/21/18, 4:38 PM, "Lin Yuan"  wrote:
>
> Thanks for your efforts, Marco!
>
> On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> anirudh2...@gmail.com>
> wrote:
>
> > Thanks for the quick response and mitigation!
> >
> > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> >  wrote:
> >
> > > Hello,
> > >
> > > today, CI had some issues and I had to cancel all jobs a few
> minutes ago.
> > > This was basically caused by the high load that is currently being
> put on
> > > our CI system due to the pre-release efforts for this Friday.
> > >
> > > It's really unfortunate that we just had outages of three core
> components
> > > within the last two days - sorry about that!. To recap, we had the
> > > following outages (which are unrelated to the parallel refactor of
> the
> > > Jenkins pipeline):
> > > - (yesterday evening) The Jenkins master ran out of disk space and
> thus
> > > processed requests at reduced capacity
> > > - (this morning) The Jenkins master got updated which broke our
> > > autoscalings upscaling capabilities.
> > > - (new, this evening) Jenkins API was irresponsive: Due to the high
> > number
> > > of jobs and a bad API design in the Jenkins REST API, the
> time-complexity
> > > of a simple create or delete request was quadratic which resulted
> in all
> > > requests timing out (that was the current outage). This resulted
> in our
> > > auto scaling to be unable to interface with the Jenkins master.
> > >
> > > I have now made improvements to our REST API calls which reduced
> the
> > > complexity from O(N^2) to O(1). The reason was an underlying
> redirect
> > loop
> > > in the Jenkins createNode and deleteNode REST API in combination
> with
> > > unrolling the entire slave and job graph (which got quite huge
> during
> > > extensive load) upon every single request. Since we had about 150
> > > registered slaves and 1000 jobs in the queue, the duration for a
> single
> > > REST API call rose to up to 45 seconds (we execute up to a few
> hundred
> > > queries per auto scaling loop). This lead to our auto scaling
> timing 

Re: Slack access

2018-11-22 Thread Steffen Rochel
added

On Thu, Nov 22, 2018 at 9:11 AM Vishaal Kapoor 
wrote:

> Switched e-mails. Thanks!
> VIshaal
>


Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Anton Chernov
Thank you for you answer, Chris.

> The whole “mixing omp libraries” is something that occurs in production
every day and certainly in everything that uses mkl.

I'm afraid this statement is wrong. Intel MKL-DNN strictly ensures that
this mixture is not happening:

"Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP runtime
library to work. As different OpenMP runtimes may not be binary compatible
it's important to ensure that only one OpenMP runtime is used throughout
the application. Having more than one OpenMP runtime initialized may lead
to undefined behavior resulting in incorrect results or crashes." [1]

That is why 2 different MKLML libraries are provided:

lib/libmklml_gnu.so  | Intel MKL small library for GNU* OpenMP runtime
lib/libmklml_intel.so | Intel MKL small library for Intel(R) OpenMP runtime

> is the suggestion that libiomp be removed from mkl?

That is certainly not my suggestion.

> have you spoken with intel? have you consulted Intel at all?

Yes, I have asked for comments on the issue.

> “hard to debug random crash”. you’re seeing an assertion which is
probably ...

I'm seeing the result of undefined behaviour. And I want to put emphasis on
the following statement:

I disregards of whether there is a particular reason for the assert - it is
a result of behaviour that should not happen. There are valid ways how to
use llvm OpenMP in MXNet and the current way is not one of them.

> The lack of root-causing the problem and knee-jerk solution here makes me
uncomfortable.

I hope that my efforts highlighting the problems reach you to mitigate your
uncomfort.

> if you want to see performance differences there’s an environment variable
you can set in the mxnet omp tuning code that will print overhead and
execution times for the current omp library.

I don't want to see performance differences in the current OpenMP library.
I want to remove the current OpenMP library and use the one provided by the
compiler.



Best
Anton

[1] https://github.com/intel/mkl-dnn/blame/master/README.md#L261-L265

чт, 22 нояб. 2018 г. в 16:50, Chris Olivier :

> Do you not work on CI mostly? My apologies for thinking that was some sort
> of team effort between you and a few others that were passionate about CI
> keeping the CI system running smoothly.
>
> You have source code, you have the line the assertion is on. If you can’t
> describe what’s going wrong that causes the assertion, then I don’t really
> have anything more to add to this conversation beyond what’s below:
>
> The whole “mixing omp libraries” is something that occurs in production
> every day and certainly in everything that uses mkl.  It may occasionally
> cause problems for some edge cases when there is super-complex linking
> strategies and dynamic loading.  But this is not one of those edge cases.
> Mostly blaming this is a red herring for other thread-related problems and
> people switch omp library and the timing of their code changes and they
> stop seeing the problem. I’ve spent my entire career doing heavily
> multiphreaded c++ development and i’ve seen that a million times.  is the
> suggestion that libiomp be removed from mkl? have you spoken with intel?
> have you consulted Intel at all?
>
> and what you are seeing isn’t some “hard to debug random crash”. you’re
> seeing an assertion which is probably related to omp trying to create a
> thread pool after a fork and something was done in the mxnet code to make
> that sketchy to do. I’d suggest filing an issue with the llvm openmp just
> like you’d file with any other not-well-understood behavior in mxnet.
>
> The lack of root-causing the problem and knee-jerk solution here makes me
> uncomfortable.
>
> if you want to see performance differences there’s an environment variable
> you can set in the mxnet omp tuning code that will print overhead and
> execution times for the current omp library.
>
>
>
>
>
>
>
> On Thu, Nov 22, 2018 at 7:12 AM Anton Chernov  wrote:
>
> > Hi Chris,
> >
> > Thank you for your answer. If you have noticed the initial email comes
> from
> > me, Anton Chernov (@lebeg on Github) and thus the proposal is not from
> any
> > 'Ci' team that you've mentioned, but from me personally.
> >
> > You are writing:
> >
> > > someone is doing something unhealthy when they fork ...
> >
> > I'm missing any context to understand what you mean.
> >
> > > we get a lot of performance gain from OMP ...
> >
> > There is no data that would prove this statement and therefore it is a
> > random guess.
> >
> > > in many months, no investigation has occurred as to WHY the assertion
> is
> > failing.
> >
> > The investigation has concluded that this is happening due to undefined
> > behaviour which is, in my opinion, a suffient answer that does not
> require
> > to go any deeper.
> >
> > > the pr is vetoed until such a time that the actual root cause of the
> > problem is known.
> >
> > And considering the statements above there is no valid reason to veto the
> > PR.
> >
> 

Re: [Discussion] MXNet CMake build - raise minimal required version

2018-11-22 Thread Chris Olivier
yes that flag can be removed. profiler should always be. built in

On Thu, Nov 22, 2018 at 7:44 AM Anton Chernov  wrote:

> You can find relevant information regarding the profiling flag here:
>
> https://github.com/apache/incubator-mxnet/issues/11563
>
>
> чт, 22 нояб. 2018 г. в 16:06, Chris Olivier :
>
> > what is meant by:
> >
> >
> > *Profiling*
> > The profiler is always on even for production release builds, because
> MXNet
> > can not be build without it [2].  ?
> >
> > you mean it is always built or it is turned on (ie recording and saving
> > profiling information)?  I am not aware of it being turned on by default.
> >
> >
> > profiler has no overhead when built in but not turned on.
> >
> >
> > On Thu, Nov 22, 2018 at 2:35 AM Anton Chernov 
> wrote:
> >
> > > Dear MXNet community,
> > >
> > > I propose to raise the minimal required cmake version that is needed to
> > > build MXNet to 3.10 which was tagged on March 16 2018 [1].
> > >
> > > The effort of repairing cmake scripts in general is targeting to
> > deprecate
> > > make and maintain only 1 build system.
> > >
> > > *Need*
> > >
> > > The build system is the foundation of every software project. It's
> > quality
> > > is directly impacting the quality of the project. The MXNet build
> system
> > is
> > > fragile, partially broken and not maintained.
> > >
> > > Users of MXNet and developers are confused by the fact that 2 build
> > systems
> > > exist at the same time: make and CMake.
> > >
> > > The main functional areas which are impacted by the current state of
> the
> > > cmake files are:
> > >
> > > *OpenMP*
> > > The current CMake files mix OpenMP libraries from different compliers
> > which
> > > is undefined behaviour. It leads to indeterministic crashes on some
> > > platforms. Build and deployment are very hard. No evidence exists that
> > > proves that there is any benefit of having llvm OpenMP library as a
> > > submodule in MXNet.
> > >
> > > *BLAS and LAPACK*
> > > Basic math library usage is mixed up. It is hard and confusing to
> > configure
> > > and the choosing logic of the most optimal library is not present. MKL
> > and
> > > OpenBLAS are intermixed in an unpredictable manner.
> > >
> > > *Profiling*
> > > The profiler is always on even for production release builds, because
> > MXNet
> > > can not be build without it [2].
> > >
> > > *CUDA*
> > > CUDA is detected by 3 different files in the current cmake scripts and
> > the
> > > choice of those is based on a obscure logic with involves different
> > > versions of cmake and platforms which it's building on
> > >
> > > * CMakeLists.txt
> > > * cmake/FirstClassLangCuda.cmake
> > > * 3rdparty/mshadow/cmake/Cuda.cmake
> > >
> > >
> > > *Confusing and misleading cmake user options*
> > > For example, USE_CUDA / USE_OLDCMAKECUDA. Some of them will do or not
> do
> > > what they supposed to based on cmake generator version and version of
> > cmake
> > > [3].
> > > There are currently more than 30 build parameters for MXNet none of
> them
> > > documented. Some of them not even located in the main CMakeLists.txt
> > file,
> > > for example 'BLAS'.
> > >
> > >
> > > *Issues*
> > > There is a significant amount of github issues related to cmake or
> build
> > in
> > > general. New tickets are issued frequently.
> > >
> > > * #8702 (https://github.com/apache/incubator-mxnet/issues/8702)
> > >  [DISCUSSION] Should we deprecate Makefile and only use CMake?
> > > * #5079 (https://github.com/apache/incubator-mxnet/issues/5079)
> >  troubles
> > > building python interface on raspberry pi 3
> > > * #1722 (https://github.com/apache/incubator-mxnet/issues/1722)
> >  problem:
> > > compile mxnet with hdfs
> > > * #11549 (https://github.com/apache/incubator-mxnet/issues/11549) Pip
> > > package can be much faster (OpenCV version?)
> > > * #11417 (https://github.com/apache/incubator-mxnet/issues/11417)
> > > libomp.so
> > > dependency (need REAL fix)
> > > * #8532 (https://github.com/apache/incubator-mxnet/issues/8532)
> > >  mxnet-mkl
> > > (v0.12.0) crash when using (conda-installed) numpy with MKL //
> > (indirectly)
> > > * #11131 (https://github.com/apache/incubator-mxnet/issues/11131)
> > > mxnet-cu92 low efficiency  // (indirectly)
> > > * #10743 (https://github.com/apache/incubator-mxnet/issues/10743) CUDA
> > > 9.1.xx failed if not set OLDCMAKECUDA on cmake 3.10.3 with unix
> makefile
> > or
> > > Ninja generator
> > > * #10742 (https://github.com/apache/incubator-mxnet/issues/10742) typo
> > in
> > > cpp-package/CMakeLists.txt
> > > * #10737 (https://github.com/apache/incubator-mxnet/issues/10737)
> Cmake
> > is
> > > running again when execute make install
> > > * #10543 (https://github.com/apache/incubator-mxnet/issues/10543)
> Failed
> > > to
> > > build from source when set USE_CPP_PACKAGE = 1, fatal error C1083:
> unabel
> > > to open file: “mxnet-cpp/op.h”: No such file or directory
> > > * #10217 (https://github.com/apache/incubator-mxnet/issues/10217)
> > Building
> > > 

Re: [Discussion] MXNet CMake build - raise minimal required version

2018-11-22 Thread Pedro Larroy
Thanks Anton for putting this together and your efforts here. I think
it's crucial that we maintain and bring the CMake system forward. I
have spent a lot of time dealing with CMake issues on different
platforms, we really increase developer productivity and platform
support by having a streamlined build system.
On Thu, Nov 22, 2018 at 4:06 PM Chris Olivier  wrote:
>
> what is meant by:
>
>
> *Profiling*
> The profiler is always on even for production release builds, because MXNet
> can not be build without it [2].  ?
>
> you mean it is always built or it is turned on (ie recording and saving
> profiling information)?  I am not aware of it being turned on by default.
>
>
> profiler has no overhead when built in but not turned on.
>
>
> On Thu, Nov 22, 2018 at 2:35 AM Anton Chernov  wrote:
>
> > Dear MXNet community,
> >
> > I propose to raise the minimal required cmake version that is needed to
> > build MXNet to 3.10 which was tagged on March 16 2018 [1].
> >
> > The effort of repairing cmake scripts in general is targeting to deprecate
> > make and maintain only 1 build system.
> >
> > *Need*
> >
> > The build system is the foundation of every software project. It's quality
> > is directly impacting the quality of the project. The MXNet build system is
> > fragile, partially broken and not maintained.
> >
> > Users of MXNet and developers are confused by the fact that 2 build systems
> > exist at the same time: make and CMake.
> >
> > The main functional areas which are impacted by the current state of the
> > cmake files are:
> >
> > *OpenMP*
> > The current CMake files mix OpenMP libraries from different compliers which
> > is undefined behaviour. It leads to indeterministic crashes on some
> > platforms. Build and deployment are very hard. No evidence exists that
> > proves that there is any benefit of having llvm OpenMP library as a
> > submodule in MXNet.
> >
> > *BLAS and LAPACK*
> > Basic math library usage is mixed up. It is hard and confusing to configure
> > and the choosing logic of the most optimal library is not present. MKL and
> > OpenBLAS are intermixed in an unpredictable manner.
> >
> > *Profiling*
> > The profiler is always on even for production release builds, because MXNet
> > can not be build without it [2].
> >
> > *CUDA*
> > CUDA is detected by 3 different files in the current cmake scripts and the
> > choice of those is based on a obscure logic with involves different
> > versions of cmake and platforms which it's building on
> >
> > * CMakeLists.txt
> > * cmake/FirstClassLangCuda.cmake
> > * 3rdparty/mshadow/cmake/Cuda.cmake
> >
> >
> > *Confusing and misleading cmake user options*
> > For example, USE_CUDA / USE_OLDCMAKECUDA. Some of them will do or not do
> > what they supposed to based on cmake generator version and version of cmake
> > [3].
> > There are currently more than 30 build parameters for MXNet none of them
> > documented. Some of them not even located in the main CMakeLists.txt file,
> > for example 'BLAS'.
> >
> >
> > *Issues*
> > There is a significant amount of github issues related to cmake or build in
> > general. New tickets are issued frequently.
> >
> > * #8702 (https://github.com/apache/incubator-mxnet/issues/8702)
> >  [DISCUSSION] Should we deprecate Makefile and only use CMake?
> > * #5079 (https://github.com/apache/incubator-mxnet/issues/5079)   troubles
> > building python interface on raspberry pi 3
> > * #1722 (https://github.com/apache/incubator-mxnet/issues/1722)   problem:
> > compile mxnet with hdfs
> > * #11549 (https://github.com/apache/incubator-mxnet/issues/11549) Pip
> > package can be much faster (OpenCV version?)
> > * #11417 (https://github.com/apache/incubator-mxnet/issues/11417)
> > libomp.so
> > dependency (need REAL fix)
> > * #8532 (https://github.com/apache/incubator-mxnet/issues/8532)
> >  mxnet-mkl
> > (v0.12.0) crash when using (conda-installed) numpy with MKL // (indirectly)
> > * #11131 (https://github.com/apache/incubator-mxnet/issues/11131)
> > mxnet-cu92 low efficiency  // (indirectly)
> > * #10743 (https://github.com/apache/incubator-mxnet/issues/10743) CUDA
> > 9.1.xx failed if not set OLDCMAKECUDA on cmake 3.10.3 with unix makefile or
> > Ninja generator
> > * #10742 (https://github.com/apache/incubator-mxnet/issues/10742) typo in
> > cpp-package/CMakeLists.txt
> > * #10737 (https://github.com/apache/incubator-mxnet/issues/10737) Cmake is
> > running again when execute make install
> > * #10543 (https://github.com/apache/incubator-mxnet/issues/10543) Failed
> > to
> > build from source when set USE_CPP_PACKAGE = 1, fatal error C1083: unabel
> > to open file: “mxnet-cpp/op.h”: No such file or directory
> > * #10217 (https://github.com/apache/incubator-mxnet/issues/10217) Building
> > with OpenCV causes link errors
> > * #10175 (https://github.com/apache/incubator-mxnet/issues/10175) MXNet
> > MKLDNN build dependency/flow discussion
> > * #10009 (https://github.com/apache/incubator-mxnet/issues/10009)
> > [CMAKE][IoT] 

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Chris Olivier
Do you not work on CI mostly? My apologies for thinking that was some sort
of team effort between you and a few others that were passionate about CI
keeping the CI system running smoothly.

You have source code, you have the line the assertion is on. If you can’t
describe what’s going wrong that causes the assertion, then I don’t really
have anything more to add to this conversation beyond what’s below:

The whole “mixing omp libraries” is something that occurs in production
every day and certainly in everything that uses mkl.  It may occasionally
cause problems for some edge cases when there is super-complex linking
strategies and dynamic loading.  But this is not one of those edge cases.
Mostly blaming this is a red herring for other thread-related problems and
people switch omp library and the timing of their code changes and they
stop seeing the problem. I’ve spent my entire career doing heavily
multiphreaded c++ development and i’ve seen that a million times.  is the
suggestion that libiomp be removed from mkl? have you spoken with intel?
have you consulted Intel at all?

and what you are seeing isn’t some “hard to debug random crash”. you’re
seeing an assertion which is probably related to omp trying to create a
thread pool after a fork and something was done in the mxnet code to make
that sketchy to do. I’d suggest filing an issue with the llvm openmp just
like you’d file with any other not-well-understood behavior in mxnet.

The lack of root-causing the problem and knee-jerk solution here makes me
uncomfortable.

if you want to see performance differences there’s an environment variable
you can set in the mxnet omp tuning code that will print overhead and
execution times for the current omp library.







On Thu, Nov 22, 2018 at 7:12 AM Anton Chernov  wrote:

> Hi Chris,
>
> Thank you for your answer. If you have noticed the initial email comes from
> me, Anton Chernov (@lebeg on Github) and thus the proposal is not from any
> 'Ci' team that you've mentioned, but from me personally.
>
> You are writing:
>
> > someone is doing something unhealthy when they fork ...
>
> I'm missing any context to understand what you mean.
>
> > we get a lot of performance gain from OMP ...
>
> There is no data that would prove this statement and therefore it is a
> random guess.
>
> > in many months, no investigation has occurred as to WHY the assertion is
> failing.
>
> The investigation has concluded that this is happening due to undefined
> behaviour which is, in my opinion, a suffient answer that does not require
> to go any deeper.
>
> > the pr is vetoed until such a time that the actual root cause of the
> problem is known.
>
> And considering the statements above there is no valid reason to veto the
> PR.
>
>
> Best
> Anton
>
> чт, 22 нояб. 2018 г. в 15:38, Chris Olivier :
>
> > 3x less overhead*
> >
> > On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier 
> > wrote:
> >
> > > someone is doing something unhealthy when they fork, which is causing
> an
> > > assertion in the openmp library. the same assertion that would fire in
> > mkl,
> > > which is linked to libiomp5 (exact same omp library). this is new
> > behavior
> > > and most likely due to an error or suboptimal approach in the forking
> > logic
> > > in mxnet.
> > >
> > > in order to circumvent the assert, the Ci team is proposing to remove
> the
> > > library completely which is equivalent to cutting off your leg to make
> > the
> > > pain from stubbing your toe go away.
> > >
> > > we get a lot of performance gain from OMP. is has about a 1/3 less
> > > overhead for entering omp regions and also supports omp regions after a
> > > fork, which libgomp does not.
> > >
> > > in many months, no investigation has occurred as to WHY the assertion
> is
> > > failing.
> > >
> > > the pr is vetoed until such a time that the actual root cause of the
> > > problem is known.
> > >
> > >
> > > thanks,
> > >
> > > -Chris.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov 
> > wrote:
> > >
> > >> Dear MXNet community,
> > >>
> > >> I would like to drive attention to an important issue that is present
> in
> > >> the MXNet CMake build: usage of bundled llvm OpenMP library.
> > >>
> > >> I have opened a PR to remove it:
> > >> https://github.com/apache/incubator-mxnet/pull/12160
> > >>
> > >> The issue was closed, but I am strong in my oppinion that it's the
> right
> > >> thing to do.
> > >>
> > >> *Background*
> > >> If you want to use OpenMP pragmas in your code for parallelization you
> > >> would supply a special flag to the compiler:
> > >>
> > >> - Clang / -fopenmp
> > >> https://openmp.llvm.org/
> > >>
> > >> - GCC / -fopenmp
> > >> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
> > >>
> > >> - Intel / [Q]openmp
> > >>
> > >>
> >
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
> > >>
> > >> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
> > >> 

Re: [Discussion] MXNet CMake build - raise minimal required version

2018-11-22 Thread Anton Chernov
You can find relevant information regarding the profiling flag here:

https://github.com/apache/incubator-mxnet/issues/11563


чт, 22 нояб. 2018 г. в 16:06, Chris Olivier :

> what is meant by:
>
>
> *Profiling*
> The profiler is always on even for production release builds, because MXNet
> can not be build without it [2].  ?
>
> you mean it is always built or it is turned on (ie recording and saving
> profiling information)?  I am not aware of it being turned on by default.
>
>
> profiler has no overhead when built in but not turned on.
>
>
> On Thu, Nov 22, 2018 at 2:35 AM Anton Chernov  wrote:
>
> > Dear MXNet community,
> >
> > I propose to raise the minimal required cmake version that is needed to
> > build MXNet to 3.10 which was tagged on March 16 2018 [1].
> >
> > The effort of repairing cmake scripts in general is targeting to
> deprecate
> > make and maintain only 1 build system.
> >
> > *Need*
> >
> > The build system is the foundation of every software project. It's
> quality
> > is directly impacting the quality of the project. The MXNet build system
> is
> > fragile, partially broken and not maintained.
> >
> > Users of MXNet and developers are confused by the fact that 2 build
> systems
> > exist at the same time: make and CMake.
> >
> > The main functional areas which are impacted by the current state of the
> > cmake files are:
> >
> > *OpenMP*
> > The current CMake files mix OpenMP libraries from different compliers
> which
> > is undefined behaviour. It leads to indeterministic crashes on some
> > platforms. Build and deployment are very hard. No evidence exists that
> > proves that there is any benefit of having llvm OpenMP library as a
> > submodule in MXNet.
> >
> > *BLAS and LAPACK*
> > Basic math library usage is mixed up. It is hard and confusing to
> configure
> > and the choosing logic of the most optimal library is not present. MKL
> and
> > OpenBLAS are intermixed in an unpredictable manner.
> >
> > *Profiling*
> > The profiler is always on even for production release builds, because
> MXNet
> > can not be build without it [2].
> >
> > *CUDA*
> > CUDA is detected by 3 different files in the current cmake scripts and
> the
> > choice of those is based on a obscure logic with involves different
> > versions of cmake and platforms which it's building on
> >
> > * CMakeLists.txt
> > * cmake/FirstClassLangCuda.cmake
> > * 3rdparty/mshadow/cmake/Cuda.cmake
> >
> >
> > *Confusing and misleading cmake user options*
> > For example, USE_CUDA / USE_OLDCMAKECUDA. Some of them will do or not do
> > what they supposed to based on cmake generator version and version of
> cmake
> > [3].
> > There are currently more than 30 build parameters for MXNet none of them
> > documented. Some of them not even located in the main CMakeLists.txt
> file,
> > for example 'BLAS'.
> >
> >
> > *Issues*
> > There is a significant amount of github issues related to cmake or build
> in
> > general. New tickets are issued frequently.
> >
> > * #8702 (https://github.com/apache/incubator-mxnet/issues/8702)
> >  [DISCUSSION] Should we deprecate Makefile and only use CMake?
> > * #5079 (https://github.com/apache/incubator-mxnet/issues/5079)
>  troubles
> > building python interface on raspberry pi 3
> > * #1722 (https://github.com/apache/incubator-mxnet/issues/1722)
>  problem:
> > compile mxnet with hdfs
> > * #11549 (https://github.com/apache/incubator-mxnet/issues/11549) Pip
> > package can be much faster (OpenCV version?)
> > * #11417 (https://github.com/apache/incubator-mxnet/issues/11417)
> > libomp.so
> > dependency (need REAL fix)
> > * #8532 (https://github.com/apache/incubator-mxnet/issues/8532)
> >  mxnet-mkl
> > (v0.12.0) crash when using (conda-installed) numpy with MKL //
> (indirectly)
> > * #11131 (https://github.com/apache/incubator-mxnet/issues/11131)
> > mxnet-cu92 low efficiency  // (indirectly)
> > * #10743 (https://github.com/apache/incubator-mxnet/issues/10743) CUDA
> > 9.1.xx failed if not set OLDCMAKECUDA on cmake 3.10.3 with unix makefile
> or
> > Ninja generator
> > * #10742 (https://github.com/apache/incubator-mxnet/issues/10742) typo
> in
> > cpp-package/CMakeLists.txt
> > * #10737 (https://github.com/apache/incubator-mxnet/issues/10737) Cmake
> is
> > running again when execute make install
> > * #10543 (https://github.com/apache/incubator-mxnet/issues/10543) Failed
> > to
> > build from source when set USE_CPP_PACKAGE = 1, fatal error C1083: unabel
> > to open file: “mxnet-cpp/op.h”: No such file or directory
> > * #10217 (https://github.com/apache/incubator-mxnet/issues/10217)
> Building
> > with OpenCV causes link errors
> > * #10175 (https://github.com/apache/incubator-mxnet/issues/10175) MXNet
> > MKLDNN build dependency/flow discussion
> > * #10009 (https://github.com/apache/incubator-mxnet/issues/10009)
> > [CMAKE][IoT] Remove pthread from android_arm64 build
> > * #9944 (https://github.com/apache/incubator-mxnet/issues/9944)   MXNet
> > MinGW-w64 build error // (indirectly)
> > 

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Alfredo Luque
The proposal here is not to eliminate the use of OpenMP but rather to use
the compiler's OpenMP implementation rather than a bundled one. I've been
bitten by issues with having multiple linked OpenMP implementations before
in another library and it was extremely difficult to debug.


It seems to me that tackling the issue with the assert is an orthogonal
issue altogether.

--Alfredo Luque

Software Engineer
Airbnb
Machine Learning Infrastructure

On Thu, Nov 22, 2018 at 10:12 AM Anton Chernov  wrote:

> Hi Chris,
>
> Thank you for your answer. If you have noticed the initial email comes from
> me, Anton Chernov (@lebeg on Github) and thus the proposal is not from any
> 'Ci' team that you've mentioned, but from me personally.
>
> You are writing:
>
> > someone is doing something unhealthy when they fork ...
>
> I'm missing any context to understand what you mean.
>
> > we get a lot of performance gain from OMP ...
>
> There is no data that would prove this statement and therefore it is a
> random guess.
>
> > in many months, no investigation has occurred as to WHY the assertion is
> failing.
>
> The investigation has concluded that this is happening due to undefined
> behaviour which is, in my opinion, a suffient answer that does not require
> to go any deeper.
>
> > the pr is vetoed until such a time that the actual root cause of the
> problem is known.
>
> And considering the statements above there is no valid reason to veto the
> PR.
>
>
> Best
> Anton
>
> чт, 22 нояб. 2018 г. в 15:38, Chris Olivier :
>
> > 3x less overhead*
> >
> > On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier 
> > wrote:
> >
> > > someone is doing something unhealthy when they fork, which is causing
> an
> > > assertion in the openmp library. the same assertion that would fire in
> > mkl,
> > > which is linked to libiomp5 (exact same omp library). this is new
> > behavior
> > > and most likely due to an error or suboptimal approach in the forking
> > logic
> > > in mxnet.
> > >
> > > in order to circumvent the assert, the Ci team is proposing to remove
> the
> > > library completely which is equivalent to cutting off your leg to make
> > the
> > > pain from stubbing your toe go away.
> > >
> > > we get a lot of performance gain from OMP. is has about a 1/3 less
> > > overhead for entering omp regions and also supports omp regions after a
> > > fork, which libgomp does not.
> > >
> > > in many months, no investigation has occurred as to WHY the assertion
> is
> > > failing.
> > >
> > > the pr is vetoed until such a time that the actual root cause of the
> > > problem is known.
> > >
> > >
> > > thanks,
> > >
> > > -Chris.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov 
> > wrote:
> > >
> > >> Dear MXNet community,
> > >>
> > >> I would like to drive attention to an important issue that is present
> in
> > >> the MXNet CMake build: usage of bundled llvm OpenMP library.
> > >>
> > >> I have opened a PR to remove it:
> > >> https://github.com/apache/incubator-mxnet/pull/12160
> > >>
> > >> The issue was closed, but I am strong in my oppinion that it's the
> right
> > >> thing to do.
> > >>
> > >> *Background*
> > >> If you want to use OpenMP pragmas in your code for parallelization you
> > >> would supply a special flag to the compiler:
> > >>
> > >> - Clang / -fopenmp
> > >> https://openmp.llvm.org/
> > >>
> > >> - GCC / -fopenmp
> > >> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
> > >>
> > >> - Intel / [Q]openmp
> > >>
> > >>
> >
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
> > >>
> > >> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
> > >> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
> > >>
> > >> Each of the compilers would enable the '#pragma omp' directive during
> > >> C/C++
> > >> compilation and arrange for automatic linking of the OpenMP runtime
> > >> library
> > >> supplied by each complier separately.
> > >>
> > >> Thus, to use the advantages of an OpenMP implementation one has to
> > compile
> > >> the code with the corresponding compiler.
> > >>
> > >> Currently, in MXNet CMake build scripts a bundled version of llvm
> OpenMP
> > >> is
> > >> used ([1] and [2]) to replace the OpenMP library supplied by the
> > compiler.
> > >>
> > >> I will quote here the README from the MKL-DNN (Intel(R) Math Kernel
> > >> Library
> > >> for Deep Neural Networks):
> > >>
> > >> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP
> > runtime
> > >> library to work. As different OpenMP runtimes may not be binary
> > compatible
> > >> it's important to ensure that only one OpenMP runtime is used
> throughout
> > >> the application. Having more than one OpenMP runtime initialized may
> > lead
> > >> to undefined behavior resulting in incorrect results or crashes." [3]
> > >>
> > >> And:
> > >>
> > >> "Using GNU compiler with -fopenmp and -liomp5 options will link the
> > >> application with both Intel and GNU OpenMP runtime 

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Anton Chernov
Hi Chris,

Thank you for your answer. If you have noticed the initial email comes from
me, Anton Chernov (@lebeg on Github) and thus the proposal is not from any
'Ci' team that you've mentioned, but from me personally.

You are writing:

> someone is doing something unhealthy when they fork ...

I'm missing any context to understand what you mean.

> we get a lot of performance gain from OMP ...

There is no data that would prove this statement and therefore it is a
random guess.

> in many months, no investigation has occurred as to WHY the assertion is
failing.

The investigation has concluded that this is happening due to undefined
behaviour which is, in my opinion, a suffient answer that does not require
to go any deeper.

> the pr is vetoed until such a time that the actual root cause of the
problem is known.

And considering the statements above there is no valid reason to veto the
PR.


Best
Anton

чт, 22 нояб. 2018 г. в 15:38, Chris Olivier :

> 3x less overhead*
>
> On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier 
> wrote:
>
> > someone is doing something unhealthy when they fork, which is causing an
> > assertion in the openmp library. the same assertion that would fire in
> mkl,
> > which is linked to libiomp5 (exact same omp library). this is new
> behavior
> > and most likely due to an error or suboptimal approach in the forking
> logic
> > in mxnet.
> >
> > in order to circumvent the assert, the Ci team is proposing to remove the
> > library completely which is equivalent to cutting off your leg to make
> the
> > pain from stubbing your toe go away.
> >
> > we get a lot of performance gain from OMP. is has about a 1/3 less
> > overhead for entering omp regions and also supports omp regions after a
> > fork, which libgomp does not.
> >
> > in many months, no investigation has occurred as to WHY the assertion is
> > failing.
> >
> > the pr is vetoed until such a time that the actual root cause of the
> > problem is known.
> >
> >
> > thanks,
> >
> > -Chris.
> >
> >
> >
> >
> > On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov 
> wrote:
> >
> >> Dear MXNet community,
> >>
> >> I would like to drive attention to an important issue that is present in
> >> the MXNet CMake build: usage of bundled llvm OpenMP library.
> >>
> >> I have opened a PR to remove it:
> >> https://github.com/apache/incubator-mxnet/pull/12160
> >>
> >> The issue was closed, but I am strong in my oppinion that it's the right
> >> thing to do.
> >>
> >> *Background*
> >> If you want to use OpenMP pragmas in your code for parallelization you
> >> would supply a special flag to the compiler:
> >>
> >> - Clang / -fopenmp
> >> https://openmp.llvm.org/
> >>
> >> - GCC / -fopenmp
> >> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
> >>
> >> - Intel / [Q]openmp
> >>
> >>
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
> >>
> >> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
> >> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
> >>
> >> Each of the compilers would enable the '#pragma omp' directive during
> >> C/C++
> >> compilation and arrange for automatic linking of the OpenMP runtime
> >> library
> >> supplied by each complier separately.
> >>
> >> Thus, to use the advantages of an OpenMP implementation one has to
> compile
> >> the code with the corresponding compiler.
> >>
> >> Currently, in MXNet CMake build scripts a bundled version of llvm OpenMP
> >> is
> >> used ([1] and [2]) to replace the OpenMP library supplied by the
> compiler.
> >>
> >> I will quote here the README from the MKL-DNN (Intel(R) Math Kernel
> >> Library
> >> for Deep Neural Networks):
> >>
> >> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP
> runtime
> >> library to work. As different OpenMP runtimes may not be binary
> compatible
> >> it's important to ensure that only one OpenMP runtime is used throughout
> >> the application. Having more than one OpenMP runtime initialized may
> lead
> >> to undefined behavior resulting in incorrect results or crashes." [3]
> >>
> >> And:
> >>
> >> "Using GNU compiler with -fopenmp and -liomp5 options will link the
> >> application with both Intel and GNU OpenMP runtime libraries. This will
> >> lead to undefined behavior of the application." [4]
> >>
> >> As can be seen from ldd for MXNet:
> >>
> >> $ ldd build/tests/mxnet_unit_tests | grep omp
> >> libomp.so => /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
> >> (0x7f697bc55000)
> >> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> >> (0x7f69660cd000)
> >>
> >> *Performance*
> >>
> >> The only performance data related to OpenMP in MXNet I was able to find
> is
> >> here:
> >>
> >>
> https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172
> >>
> >> Which in my understanding is testing imact of different environment
> >> variables for the same setup (using same bundled OpenMP library).
> >>
> >> The libraries may differ in implementation and 

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Chris Olivier
3x less overhead*

On Thu, Nov 22, 2018 at 6:25 AM Chris Olivier  wrote:

> someone is doing something unhealthy when they fork, which is causing an
> assertion in the openmp library. the same assertion that would fire in mkl,
> which is linked to libiomp5 (exact same omp library). this is new behavior
> and most likely due to an error or suboptimal approach in the forking logic
> in mxnet.
>
> in order to circumvent the assert, the Ci team is proposing to remove the
> library completely which is equivalent to cutting off your leg to make the
> pain from stubbing your toe go away.
>
> we get a lot of performance gain from OMP. is has about a 1/3 less
> overhead for entering omp regions and also supports omp regions after a
> fork, which libgomp does not.
>
> in many months, no investigation has occurred as to WHY the assertion is
> failing.
>
> the pr is vetoed until such a time that the actual root cause of the
> problem is known.
>
>
> thanks,
>
> -Chris.
>
>
>
>
> On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov  wrote:
>
>> Dear MXNet community,
>>
>> I would like to drive attention to an important issue that is present in
>> the MXNet CMake build: usage of bundled llvm OpenMP library.
>>
>> I have opened a PR to remove it:
>> https://github.com/apache/incubator-mxnet/pull/12160
>>
>> The issue was closed, but I am strong in my oppinion that it's the right
>> thing to do.
>>
>> *Background*
>> If you want to use OpenMP pragmas in your code for parallelization you
>> would supply a special flag to the compiler:
>>
>> - Clang / -fopenmp
>> https://openmp.llvm.org/
>>
>> - GCC / -fopenmp
>> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
>>
>> - Intel / [Q]openmp
>>
>> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>>
>> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
>> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
>>
>> Each of the compilers would enable the '#pragma omp' directive during
>> C/C++
>> compilation and arrange for automatic linking of the OpenMP runtime
>> library
>> supplied by each complier separately.
>>
>> Thus, to use the advantages of an OpenMP implementation one has to compile
>> the code with the corresponding compiler.
>>
>> Currently, in MXNet CMake build scripts a bundled version of llvm OpenMP
>> is
>> used ([1] and [2]) to replace the OpenMP library supplied by the compiler.
>>
>> I will quote here the README from the MKL-DNN (Intel(R) Math Kernel
>> Library
>> for Deep Neural Networks):
>>
>> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP runtime
>> library to work. As different OpenMP runtimes may not be binary compatible
>> it's important to ensure that only one OpenMP runtime is used throughout
>> the application. Having more than one OpenMP runtime initialized may lead
>> to undefined behavior resulting in incorrect results or crashes." [3]
>>
>> And:
>>
>> "Using GNU compiler with -fopenmp and -liomp5 options will link the
>> application with both Intel and GNU OpenMP runtime libraries. This will
>> lead to undefined behavior of the application." [4]
>>
>> As can be seen from ldd for MXNet:
>>
>> $ ldd build/tests/mxnet_unit_tests | grep omp
>> libomp.so => /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
>> (0x7f697bc55000)
>> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
>> (0x7f69660cd000)
>>
>> *Performance*
>>
>> The only performance data related to OpenMP in MXNet I was able to find is
>> here:
>>
>> https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172
>>
>> Which in my understanding is testing imact of different environment
>> variables for the same setup (using same bundled OpenMP library).
>>
>> The libraries may differ in implementation and the Thread Affinity
>> Interface [5] may have significant impact on performance.
>>
>> All compliers support it:
>>
>> - Clang / KMP_AFFINITY
>>
>> https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.cpp
>>
>> - GCC / GOMP_CPU_AFFINITY
>>
>> https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINITY.html
>>
>> - Intel / KMP_AFFINITY
>>
>> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>>
>> - Visual Studio / SetThreadAffinityMask
>>
>> https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-setthreadaffinitymask
>>
>> *Issues*
>>
>> Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
>> https://github.com/apache/incubator-mxnet/issues/10856
>>
>> libomp.so dependency (need REAL fix)
>> https://github.com/apache/incubator-mxnet/issues/11417
>>
>> mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
>> https://github.com/apache/incubator-mxnet/issues/8532
>>
>> Performance regression when OMP_NUM_THREADS environment variable is not
>> set
>> https://github.com/apache/incubator-mxnet/issues/9744
>>
>> Poor concat CPU performance on CUDA builds
>> 

Re: [Discussion] MXNet CMake build - raise minimal required version

2018-11-22 Thread Chris Olivier
i have not seen any proof that any crashes are due to llvm openmp usage.

On Thu, Nov 22, 2018 at 2:35 AM Anton Chernov  wrote:

> Dear MXNet community,
>
> I propose to raise the minimal required cmake version that is needed to
> build MXNet to 3.10 which was tagged on March 16 2018 [1].
>
> The effort of repairing cmake scripts in general is targeting to deprecate
> make and maintain only 1 build system.
>
> *Need*
>
> The build system is the foundation of every software project. It's quality
> is directly impacting the quality of the project. The MXNet build system is
> fragile, partially broken and not maintained.
>
> Users of MXNet and developers are confused by the fact that 2 build systems
> exist at the same time: make and CMake.
>
> The main functional areas which are impacted by the current state of the
> cmake files are:
>
> *OpenMP*
> The current CMake files mix OpenMP libraries from different compliers which
> is undefined behaviour. It leads to indeterministic crashes on some
> platforms. Build and deployment are very hard. No evidence exists that
> proves that there is any benefit of having llvm OpenMP library as a
> submodule in MXNet.
>
> *BLAS and LAPACK*
> Basic math library usage is mixed up. It is hard and confusing to configure
> and the choosing logic of the most optimal library is not present. MKL and
> OpenBLAS are intermixed in an unpredictable manner.
>
> *Profiling*
> The profiler is always on even for production release builds, because MXNet
> can not be build without it [2].
>
> *CUDA*
> CUDA is detected by 3 different files in the current cmake scripts and the
> choice of those is based on a obscure logic with involves different
> versions of cmake and platforms which it's building on
>
> * CMakeLists.txt
> * cmake/FirstClassLangCuda.cmake
> * 3rdparty/mshadow/cmake/Cuda.cmake
>
>
> *Confusing and misleading cmake user options*
> For example, USE_CUDA / USE_OLDCMAKECUDA. Some of them will do or not do
> what they supposed to based on cmake generator version and version of cmake
> [3].
> There are currently more than 30 build parameters for MXNet none of them
> documented. Some of them not even located in the main CMakeLists.txt file,
> for example 'BLAS'.
>
>
> *Issues*
> There is a significant amount of github issues related to cmake or build in
> general. New tickets are issued frequently.
>
> * #8702 (https://github.com/apache/incubator-mxnet/issues/8702)
>  [DISCUSSION] Should we deprecate Makefile and only use CMake?
> * #5079 (https://github.com/apache/incubator-mxnet/issues/5079)   troubles
> building python interface on raspberry pi 3
> * #1722 (https://github.com/apache/incubator-mxnet/issues/1722)   problem:
> compile mxnet with hdfs
> * #11549 (https://github.com/apache/incubator-mxnet/issues/11549) Pip
> package can be much faster (OpenCV version?)
> * #11417 (https://github.com/apache/incubator-mxnet/issues/11417)
> libomp.so
> dependency (need REAL fix)
> * #8532 (https://github.com/apache/incubator-mxnet/issues/8532)
>  mxnet-mkl
> (v0.12.0) crash when using (conda-installed) numpy with MKL // (indirectly)
> * #11131 (https://github.com/apache/incubator-mxnet/issues/11131)
> mxnet-cu92 low efficiency  // (indirectly)
> * #10743 (https://github.com/apache/incubator-mxnet/issues/10743) CUDA
> 9.1.xx failed if not set OLDCMAKECUDA on cmake 3.10.3 with unix makefile or
> Ninja generator
> * #10742 (https://github.com/apache/incubator-mxnet/issues/10742) typo in
> cpp-package/CMakeLists.txt
> * #10737 (https://github.com/apache/incubator-mxnet/issues/10737) Cmake is
> running again when execute make install
> * #10543 (https://github.com/apache/incubator-mxnet/issues/10543) Failed
> to
> build from source when set USE_CPP_PACKAGE = 1, fatal error C1083: unabel
> to open file: “mxnet-cpp/op.h”: No such file or directory
> * #10217 (https://github.com/apache/incubator-mxnet/issues/10217) Building
> with OpenCV causes link errors
> * #10175 (https://github.com/apache/incubator-mxnet/issues/10175) MXNet
> MKLDNN build dependency/flow discussion
> * #10009 (https://github.com/apache/incubator-mxnet/issues/10009)
> [CMAKE][IoT] Remove pthread from android_arm64 build
> * #9944 (https://github.com/apache/incubator-mxnet/issues/9944)   MXNet
> MinGW-w64 build error // (indirectly)
> * #9868 (https://github.com/apache/incubator-mxnet/issues/9868)   MKL and
> CMake
> * #9516 (https://github.com/apache/incubator-mxnet/issues/9516)   cmake
> cuda arch issues
> * #9105 (https://github.com/apache/incubator-mxnet/issues/9105)
>  libmxnet.so load path error
> * #9096 (https://github.com/apache/incubator-mxnet/issues/9096)   MXNet
> built with GPerftools crashes
> * #8786 (https://github.com/apache/incubator-mxnet/issues/8786)   Link
> failure on DEBUG=1 (static member symbol not defined) // (indirectly)
> * #8729 (https://github.com/apache/incubator-mxnet/issues/8729)   Build
> amalgamation using a docker // (indirectly)
> * #8667 

Re: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Chris Olivier
someone is doing something unhealthy when they fork, which is causing an
assertion in the openmp library. the same assertion that would fire in mkl,
which is linked to libiomp5 (exact same omp library). this is new behavior
and most likely due to an error or suboptimal approach in the forking logic
in mxnet.

in order to circumvent the assert, the Ci team is proposing to remove the
library completely which is equivalent to cutting off your leg to make the
pain from stubbing your toe go away.

we get a lot of performance gain from OMP. is has about a 1/3 less overhead
for entering omp regions and also supports omp regions after a fork, which
libgomp does not.

in many months, no investigation has occurred as to WHY the assertion is
failing.

the pr is vetoed until such a time that the actual root cause of the
problem is known.


thanks,

-Chris.




On Thu, Nov 22, 2018 at 4:36 AM Anton Chernov  wrote:

> Dear MXNet community,
>
> I would like to drive attention to an important issue that is present in
> the MXNet CMake build: usage of bundled llvm OpenMP library.
>
> I have opened a PR to remove it:
> https://github.com/apache/incubator-mxnet/pull/12160
>
> The issue was closed, but I am strong in my oppinion that it's the right
> thing to do.
>
> *Background*
> If you want to use OpenMP pragmas in your code for parallelization you
> would supply a special flag to the compiler:
>
> - Clang / -fopenmp
> https://openmp.llvm.org/
>
> - GCC / -fopenmp
> https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html
>
> - Intel / [Q]openmp
>
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>
> - Visual Studio: /openmp (Enable OpenMP 2.0 Support)
> https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx
>
> Each of the compilers would enable the '#pragma omp' directive during C/C++
> compilation and arrange for automatic linking of the OpenMP runtime library
> supplied by each complier separately.
>
> Thus, to use the advantages of an OpenMP implementation one has to compile
> the code with the corresponding compiler.
>
> Currently, in MXNet CMake build scripts a bundled version of llvm OpenMP is
> used ([1] and [2]) to replace the OpenMP library supplied by the compiler.
>
> I will quote here the README from the MKL-DNN (Intel(R) Math Kernel Library
> for Deep Neural Networks):
>
> "Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP runtime
> library to work. As different OpenMP runtimes may not be binary compatible
> it's important to ensure that only one OpenMP runtime is used throughout
> the application. Having more than one OpenMP runtime initialized may lead
> to undefined behavior resulting in incorrect results or crashes." [3]
>
> And:
>
> "Using GNU compiler with -fopenmp and -liomp5 options will link the
> application with both Intel and GNU OpenMP runtime libraries. This will
> lead to undefined behavior of the application." [4]
>
> As can be seen from ldd for MXNet:
>
> $ ldd build/tests/mxnet_unit_tests | grep omp
> libomp.so => /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
> (0x7f697bc55000)
> libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
> (0x7f69660cd000)
>
> *Performance*
>
> The only performance data related to OpenMP in MXNet I was able to find is
> here:
>
> https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172
>
> Which in my understanding is testing imact of different environment
> variables for the same setup (using same bundled OpenMP library).
>
> The libraries may differ in implementation and the Thread Affinity
> Interface [5] may have significant impact on performance.
>
> All compliers support it:
>
> - Clang / KMP_AFFINITY
>
> https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.cpp
>
> - GCC / GOMP_CPU_AFFINITY
>
> https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINITY.html
>
> - Intel / KMP_AFFINITY
>
> https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1
>
> - Visual Studio / SetThreadAffinityMask
>
> https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-setthreadaffinitymask
>
> *Issues*
>
> Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
> https://github.com/apache/incubator-mxnet/issues/10856
>
> libomp.so dependency (need REAL fix)
> https://github.com/apache/incubator-mxnet/issues/11417
>
> mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
> https://github.com/apache/incubator-mxnet/issues/8532
>
> Performance regression when OMP_NUM_THREADS environment variable is not set
> https://github.com/apache/incubator-mxnet/issues/9744
>
> Poor concat CPU performance on CUDA builds
> https://github.com/apache/incubator-mxnet/issues/11905
>
> I would appreciate hearing your thoughts.
>
>
> Best
> Anton
>
> [1]
>
> https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L400-L405
> [2] https://github.com/apache/incubator-mxnet/tree/master/3rdparty

RE: [Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Lv, Tao A
Thanks for the great summary, Anton. I'm curious that is there anybody builds 
mxnet successfully with ICC/ICPC?

-Original Message-
From: Anton Chernov [mailto:mecher...@gmail.com] 
Sent: Thursday, November 22, 2018 8:36 PM
To: d...@mxnet.apache.org
Subject: [Discussion] Remove bundled llvm OpenMP

Dear MXNet community,

I would like to drive attention to an important issue that is present in the 
MXNet CMake build: usage of bundled llvm OpenMP library.

I have opened a PR to remove it:
https://github.com/apache/incubator-mxnet/pull/12160

The issue was closed, but I am strong in my oppinion that it's the right thing 
to do.

*Background*
If you want to use OpenMP pragmas in your code for parallelization you would 
supply a special flag to the compiler:

- Clang / -fopenmp
https://openmp.llvm.org/

- GCC / -fopenmp
https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html

- Intel / [Q]openmp
https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1

- Visual Studio: /openmp (Enable OpenMP 2.0 Support) 
https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx

Each of the compilers would enable the '#pragma omp' directive during C/C++ 
compilation and arrange for automatic linking of the OpenMP runtime library 
supplied by each complier separately.

Thus, to use the advantages of an OpenMP implementation one has to compile the 
code with the corresponding compiler.

Currently, in MXNet CMake build scripts a bundled version of llvm OpenMP is 
used ([1] and [2]) to replace the OpenMP library supplied by the compiler.

I will quote here the README from the MKL-DNN (Intel(R) Math Kernel Library for 
Deep Neural Networks):

"Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP runtime 
library to work. As different OpenMP runtimes may not be binary compatible it's 
important to ensure that only one OpenMP runtime is used throughout the 
application. Having more than one OpenMP runtime initialized may lead to 
undefined behavior resulting in incorrect results or crashes." [3]

And:

"Using GNU compiler with -fopenmp and -liomp5 options will link the application 
with both Intel and GNU OpenMP runtime libraries. This will lead to undefined 
behavior of the application." [4]

As can be seen from ldd for MXNet:

$ ldd build/tests/mxnet_unit_tests | grep omp
libomp.so => /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
(0x7f697bc55000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f69660cd000)

*Performance*

The only performance data related to OpenMP in MXNet I was able to find is
here:
https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172

Which in my understanding is testing imact of different environment variables 
for the same setup (using same bundled OpenMP library).

The libraries may differ in implementation and the Thread Affinity Interface 
[5] may have significant impact on performance.

All compliers support it:

- Clang / KMP_AFFINITY
https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.cpp

- GCC / GOMP_CPU_AFFINITY
https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINITY.html

- Intel / KMP_AFFINITY
https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1

- Visual Studio / SetThreadAffinityMask
https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-setthreadaffinitymask

*Issues*

Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
https://github.com/apache/incubator-mxnet/issues/10856

libomp.so dependency (need REAL fix)
https://github.com/apache/incubator-mxnet/issues/11417

mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
https://github.com/apache/incubator-mxnet/issues/8532

Performance regression when OMP_NUM_THREADS environment variable is not set
https://github.com/apache/incubator-mxnet/issues/9744

Poor concat CPU performance on CUDA builds
https://github.com/apache/incubator-mxnet/issues/11905

I would appreciate hearing your thoughts.


Best
Anton

[1]
https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L400-L405
[2] https://github.com/apache/incubator-mxnet/tree/master/3rdparty
[3] https://github.com/intel/mkl-dnn/blame/master/README.md#L261-L265
[4] https://github.com/intel/mkl-dnn/blame/master/README.md#L278-L280
[5] https://software.intel.com/en-us/node/522691


[Discussion] Remove bundled llvm OpenMP

2018-11-22 Thread Anton Chernov
Dear MXNet community,

I would like to drive attention to an important issue that is present in
the MXNet CMake build: usage of bundled llvm OpenMP library.

I have opened a PR to remove it:
https://github.com/apache/incubator-mxnet/pull/12160

The issue was closed, but I am strong in my oppinion that it's the right
thing to do.

*Background*
If you want to use OpenMP pragmas in your code for parallelization you
would supply a special flag to the compiler:

- Clang / -fopenmp
https://openmp.llvm.org/

- GCC / -fopenmp
https://gcc.gnu.org/onlinedocs/libgomp/Enabling-OpenMP.html

- Intel / [Q]openmp
https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1

- Visual Studio: /openmp (Enable OpenMP 2.0 Support)
https://msdn.microsoft.com/en-us/library/tt15eb9t.aspx

Each of the compilers would enable the '#pragma omp' directive during C/C++
compilation and arrange for automatic linking of the OpenMP runtime library
supplied by each complier separately.

Thus, to use the advantages of an OpenMP implementation one has to compile
the code with the corresponding compiler.

Currently, in MXNet CMake build scripts a bundled version of llvm OpenMP is
used ([1] and [2]) to replace the OpenMP library supplied by the compiler.

I will quote here the README from the MKL-DNN (Intel(R) Math Kernel Library
for Deep Neural Networks):

"Intel MKL-DNN uses OpenMP* for parallelism and requires an OpenMP runtime
library to work. As different OpenMP runtimes may not be binary compatible
it's important to ensure that only one OpenMP runtime is used throughout
the application. Having more than one OpenMP runtime initialized may lead
to undefined behavior resulting in incorrect results or crashes." [3]

And:

"Using GNU compiler with -fopenmp and -liomp5 options will link the
application with both Intel and GNU OpenMP runtime libraries. This will
lead to undefined behavior of the application." [4]

As can be seen from ldd for MXNet:

$ ldd build/tests/mxnet_unit_tests | grep omp
libomp.so => /.../mxnet/build/3rdparty/openmp/runtime/src/libomp.so
(0x7f697bc55000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1
(0x7f69660cd000)

*Performance*

The only performance data related to OpenMP in MXNet I was able to find is
here:
https://github.com/apache/incubator-mxnet/issues/9744#issuecomment-367711172

Which in my understanding is testing imact of different environment
variables for the same setup (using same bundled OpenMP library).

The libraries may differ in implementation and the Thread Affinity
Interface [5] may have significant impact on performance.

All compliers support it:

- Clang / KMP_AFFINITY
https://github.com/clang-ykt/openmp/blob/master/runtime/src/kmp_affinity.cpp

- GCC / GOMP_CPU_AFFINITY
https://gcc.gnu.org/onlinedocs/gcc-4.7.1/libgomp/GOMP_005fCPU_005fAFFINITY.html

- Intel / KMP_AFFINITY
https://software.intel.com/en-us/node/522689#6E24682E-F411-4AE3-A04D-ECD81C7008D1

- Visual Studio / SetThreadAffinityMask
https://docs.microsoft.com/en-us/windows/desktop/api/winbase/nf-winbase-setthreadaffinitymask

*Issues*

Failed OpenMP assertion when loading MXNet compiled with DEBUG=1
https://github.com/apache/incubator-mxnet/issues/10856

libomp.so dependency (need REAL fix)
https://github.com/apache/incubator-mxnet/issues/11417

mxnet-mkl (v0.12.0) crash when using (conda-installed) numpy with MKL
https://github.com/apache/incubator-mxnet/issues/8532

Performance regression when OMP_NUM_THREADS environment variable is not set
https://github.com/apache/incubator-mxnet/issues/9744

Poor concat CPU performance on CUDA builds
https://github.com/apache/incubator-mxnet/issues/11905

I would appreciate hearing your thoughts.


Best
Anton

[1]
https://github.com/apache/incubator-mxnet/blob/master/CMakeLists.txt#L400-L405
[2] https://github.com/apache/incubator-mxnet/tree/master/3rdparty
[3] https://github.com/intel/mkl-dnn/blame/master/README.md#L261-L265
[4] https://github.com/intel/mkl-dnn/blame/master/README.md#L278-L280
[5] https://software.intel.com/en-us/node/522691


[Discussion] MXNet CMake build - raise minimal required version

2018-11-22 Thread Anton Chernov
Dear MXNet community,

I propose to raise the minimal required cmake version that is needed to
build MXNet to 3.10 which was tagged on March 16 2018 [1].

The effort of repairing cmake scripts in general is targeting to deprecate
make and maintain only 1 build system.

*Need*

The build system is the foundation of every software project. It's quality
is directly impacting the quality of the project. The MXNet build system is
fragile, partially broken and not maintained.

Users of MXNet and developers are confused by the fact that 2 build systems
exist at the same time: make and CMake.

The main functional areas which are impacted by the current state of the
cmake files are:

*OpenMP*
The current CMake files mix OpenMP libraries from different compliers which
is undefined behaviour. It leads to indeterministic crashes on some
platforms. Build and deployment are very hard. No evidence exists that
proves that there is any benefit of having llvm OpenMP library as a
submodule in MXNet.

*BLAS and LAPACK*
Basic math library usage is mixed up. It is hard and confusing to configure
and the choosing logic of the most optimal library is not present. MKL and
OpenBLAS are intermixed in an unpredictable manner.

*Profiling*
The profiler is always on even for production release builds, because MXNet
can not be build without it [2].

*CUDA*
CUDA is detected by 3 different files in the current cmake scripts and the
choice of those is based on a obscure logic with involves different
versions of cmake and platforms which it's building on

* CMakeLists.txt
* cmake/FirstClassLangCuda.cmake
* 3rdparty/mshadow/cmake/Cuda.cmake


*Confusing and misleading cmake user options*
For example, USE_CUDA / USE_OLDCMAKECUDA. Some of them will do or not do
what they supposed to based on cmake generator version and version of cmake
[3].
There are currently more than 30 build parameters for MXNet none of them
documented. Some of them not even located in the main CMakeLists.txt file,
for example 'BLAS'.


*Issues*
There is a significant amount of github issues related to cmake or build in
general. New tickets are issued frequently.

* #8702 (https://github.com/apache/incubator-mxnet/issues/8702)
 [DISCUSSION] Should we deprecate Makefile and only use CMake?
* #5079 (https://github.com/apache/incubator-mxnet/issues/5079)   troubles
building python interface on raspberry pi 3
* #1722 (https://github.com/apache/incubator-mxnet/issues/1722)   problem:
compile mxnet with hdfs
* #11549 (https://github.com/apache/incubator-mxnet/issues/11549) Pip
package can be much faster (OpenCV version?)
* #11417 (https://github.com/apache/incubator-mxnet/issues/11417) libomp.so
dependency (need REAL fix)
* #8532 (https://github.com/apache/incubator-mxnet/issues/8532)   mxnet-mkl
(v0.12.0) crash when using (conda-installed) numpy with MKL // (indirectly)
* #11131 (https://github.com/apache/incubator-mxnet/issues/11131)
mxnet-cu92 low efficiency  // (indirectly)
* #10743 (https://github.com/apache/incubator-mxnet/issues/10743) CUDA
9.1.xx failed if not set OLDCMAKECUDA on cmake 3.10.3 with unix makefile or
Ninja generator
* #10742 (https://github.com/apache/incubator-mxnet/issues/10742) typo in
cpp-package/CMakeLists.txt
* #10737 (https://github.com/apache/incubator-mxnet/issues/10737) Cmake is
running again when execute make install
* #10543 (https://github.com/apache/incubator-mxnet/issues/10543) Failed to
build from source when set USE_CPP_PACKAGE = 1, fatal error C1083: unabel
to open file: “mxnet-cpp/op.h”: No such file or directory
* #10217 (https://github.com/apache/incubator-mxnet/issues/10217) Building
with OpenCV causes link errors
* #10175 (https://github.com/apache/incubator-mxnet/issues/10175) MXNet
MKLDNN build dependency/flow discussion
* #10009 (https://github.com/apache/incubator-mxnet/issues/10009)
[CMAKE][IoT] Remove pthread from android_arm64 build
* #9944 (https://github.com/apache/incubator-mxnet/issues/9944)   MXNet
MinGW-w64 build error // (indirectly)
* #9868 (https://github.com/apache/incubator-mxnet/issues/9868)   MKL and
CMake
* #9516 (https://github.com/apache/incubator-mxnet/issues/9516)   cmake
cuda arch issues
* #9105 (https://github.com/apache/incubator-mxnet/issues/9105)
 libmxnet.so load path error
* #9096 (https://github.com/apache/incubator-mxnet/issues/9096)   MXNet
built with GPerftools crashes
* #8786 (https://github.com/apache/incubator-mxnet/issues/8786)   Link
failure on DEBUG=1 (static member symbol not defined) // (indirectly)
* #8729 (https://github.com/apache/incubator-mxnet/issues/8729)   Build
amalgamation using a docker // (indirectly)
* #8667 (https://github.com/apache/incubator-mxnet/issues/8667)
 Compiler/linker error while trying to build from source on Mac OSX Sierra
10.12.6
* #8295 (https://github.com/apache/incubator-mxnet/issues/8295)   Building
with cmake - error
* #7852 (https://github.com/apache/incubator-mxnet/issues/7852)   Trouble
installing MXNet on Raspberry Pi 3
* #13303