[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610596656
 
 
   @ptrendx I'm using a compiled version of master. Are you able to reproduce 
it using the script I attached at the beginning of the issue?
   ```
   wget 
https://gist.githubusercontent.com/sxjscience/0bd336c921396b3c66331354e1866886/raw/d618ba69cbecf04d3013db77af86c29d62fe0336/grad_req_addto_bug.py
 -O grad_req_addto_bug.py
   python grad_req_addto_bug.py --addto
   python grad_req_addto_bug.py
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610595164
 
 
   @ptrendx @zhreshold @szha I tried to run with MXNet==1.0.0 but it give me 
another error. The earliest version I can confirm that has this issue is 1.2.0. 
This is really critical and impacts the very basic functionality of a DL 
framework, i.e., autograd.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610591835
 
 
   @ptrendx @szha @zhreshold I find that the bug also exists in 1.5.0, 1.4.0, 
1.3.1, 1.2.1. In fact, results on both CPU and GPU are wrong in these versions. 
Reproducible script is given as follows (I used the legacy mx.nd).
   
   ```python
   import mxnet as mx
   import numpy as np
   
   
   for ctx in [mx.cpu(), mx.gpu()]:
   for nrepeat in range(1, 10):
   stored_grad = dict()
   for grad_req in ['write', 'add']:
   a = mx.nd.array([1], ctx=ctx)
   b = mx.nd.array([2], ctx=ctx)
   if grad_req == 'write':
   a.attach_grad(grad_req='write')
   elif grad_req == 'add':
   a.attach_grad(grad_req='add')
   a.grad[:] = 0
   with mx.autograd.record():
   for _ in range(nrepeat):
   b = b * a
   b.backward()
   stored_grad[grad_req] = a.grad.asscalar()
   print('ctx={}, nrepeat={}, write={}, add={}'.format(ctx, nrepeat, 
stored_grad['write'], stored_grad['add']))
   ```
   
   For MXNet 1.5.0, I used `pip install mxnet-cu101==1.5.0`
   For MXNet 1.4.0, I used `pip install mxnet-cu92==1.4.0`
   For MXNet 1.3.1, I used `pip install mxnet-cu92==1.3.1`
   For MXNet 1.2.1, I used `pip install mxnet-cu92==1.2.1`
   
   Output
   ```
   ctx=cpu(0), nrepeat=1, write=2.0, add=2.0
   ctx=cpu(0), nrepeat=2, write=4.0, add=4.0
   ctx=cpu(0), nrepeat=3, write=6.0, add=6.0
   ctx=cpu(0), nrepeat=4, write=8.0, add=8.0
   ctx=cpu(0), nrepeat=5, write=10.0, add=62.0
   ctx=cpu(0), nrepeat=6, write=12.0, add=126.0
   ctx=cpu(0), nrepeat=7, write=14.0, add=254.0
   ctx=cpu(0), nrepeat=8, write=16.0, add=16.0
   ctx=cpu(0), nrepeat=9, write=18.0, add=18.0
   ctx=gpu(0), nrepeat=1, write=2.0, add=2.0
   ctx=gpu(0), nrepeat=2, write=4.0, add=4.0
   ctx=gpu(0), nrepeat=3, write=6.0, add=6.0
   ctx=gpu(0), nrepeat=4, write=8.0, add=8.0
   ctx=gpu(0), nrepeat=5, write=10.0, add=62.0
   ctx=gpu(0), nrepeat=6, write=12.0, add=126.0
   ctx=gpu(0), nrepeat=7, write=14.0, add=254.0
   ctx=gpu(0), nrepeat=8, write=16.0, add=16.0
   ctx=gpu(0), nrepeat=9, write=18.0, add=18.0
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610220167
 
 
   @eric-haibin-lin @szha @szhengac @zhreshold This is the worst problem I've 
found and it impacts all models with `grad_req=add`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610219553
 
 
   Just verified that there is no problem when `ctx=mx.cpu()` . Also, I've 
found a simpler script to reproduce the problem:
   
   ```python
   
   import mxnet as mx
   import numpy as np
   mx.npx.set_np()
   
   
   for ctx in [mx.cpu(), mx.gpu()]:
   for nrepeat in range(1, 10):
   stored_grad = dict()
   for grad_req in ['write', 'add']:
   a = mx.np.array(1, ctx=ctx)
   b = mx.np.array(2, ctx=ctx)
   if grad_req == 'write':
   a.attach_grad(grad_req='write')
   elif grad_req == 'add':
   a.attach_grad(grad_req='add')
   a.grad[()] = 0
   with mx.autograd.record():
   for _ in range(nrepeat):
   b = b * a
   b.backward()
   stored_grad[grad_req] = a.grad.asnumpy()
   print('ctx={}, nrepeat={}, write={}, add={}'.format(ctx, nrepeat, 
stored_grad['write'], stored_grad['add']))
   ```
   
   Result:
   ```
   ctx=cpu(0), nrepeat=1, write=2.0, add=2.0
   ctx=cpu(0), nrepeat=2, write=4.0, add=4.0
   ctx=cpu(0), nrepeat=3, write=6.0, add=6.0
   ctx=cpu(0), nrepeat=4, write=8.0, add=8.0
   ctx=cpu(0), nrepeat=5, write=10.0, add=10.0
   ctx=cpu(0), nrepeat=6, write=12.0, add=12.0
   ctx=cpu(0), nrepeat=7, write=14.0, add=14.0
   ctx=cpu(0), nrepeat=8, write=16.0, add=16.0
   ctx=cpu(0), nrepeat=9, write=18.0, add=18.0
   ctx=gpu(0), nrepeat=1, write=2.0, add=2.0
   ctx=gpu(0), nrepeat=2, write=4.0, add=4.0
   ctx=gpu(0), nrepeat=3, write=6.0, add=6.0
   ctx=gpu(0), nrepeat=4, write=8.0, add=8.0
   ctx=gpu(0), nrepeat=5, write=10.0, add=62.0
   ctx=gpu(0), nrepeat=6, write=12.0, add=126.0
   ctx=gpu(0), nrepeat=7, write=14.0, add=254.0
   ctx=gpu(0), nrepeat=8, write=16.0, add=16.0
   ctx=gpu(0), nrepeat=9, write=18.0, add=18.0
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610213687
 
 
   I discovered this bug when trying to use different parameters of ALBERT. In 
the original ALBERT, the number of layers are 12 or 24. Both of them won't 
trigger the bug, so it took me some time to localize the issue.
   
   ```
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto --nrepeat 12
   
   foo0_dense0_weight 5.412447e-06
   foo0_dense0_bias 19.946749
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --nrepeat 12
   foo0_dense0_weight 5.412447e-06
   foo0_dense0_bias 19.946749
   ```
   
   ```
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto --nrepeat 24
   
   foo0_dense0_weight 5.706055e-14
   foo0_dense0_bias 19.946749
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --nrepeat 24
   foo0_dense0_weight 5.706055e-14
   ```
   
   Also, the bug occurs in the hybridized case.
   
   ```
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto --nrepeat 5 
--hybridize
   
   foo0_dense0_weight 1.7802463
   foo0_dense0_bias 315.05945
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py  --nrepeat 5 
--hybridize
   foo0_dense0_weight 0.19310227
   foo0_dense0_bias 19.947094
   ```
   
   It also appears in the legacy `mx.nd` interface:
   
   ```python
   import mxnet as mx
   from mxnet.gluon import nn, HybridBlock
   import numpy as np
   import argparse
   
   np.random.seed(123)
   mx.random.seed(123)
   
   
   parser = argparse.ArgumentParser(
   description='Grad req bug minimal example')
   parser.add_argument('--addto', action='store_true')
   parser.add_argument('--hybridize', action='store_true')
   parser.add_argument('--nrepeat', type=int, default=5)
   args = parser.parse_args()
   
   class Foo(HybridBlock):
   def __init__(self, prefix=None, params=None):
   super().__init__(prefix=prefix, params=params)
   with self.name_scope():
   self.layer = nn.Dense(16)
   
   def hybrid_forward(self, F, dat):
   out = dat
   for _ in range(args.nrepeat):
   out = self.layer(out)
   return out
   
   foo = Foo()
   if args.hybridize:
  foo.hybridize()
   foo.initialize(ctx=mx.gpu())
   
   if args.addto:
   for p in foo.collect_params().values():
   p.grad_req = 'add'
   
   
   
   dat = mx.nd.random.normal(0, 1, (32, 16), ctx=mx.gpu())
   og = mx.nd.random.normal(0, 1, (32, 16), ctx=mx.gpu())
   with mx.autograd.record():
   out = foo(dat)
   loss = (out * og).sum()
   loss.backward()
   for k, v in foo.collect_params().items():
   print(k, mx.nd.norm(v.grad()))
   
   
   ```
   
   ```bash
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug_nd.py  --nrepeat 5 
--hybridize
   foo0_dense0_weight 
   [0.16300175]
   
   foo0_dense0_bias 
   [27.344622]
   
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug_nd.py  --nrepeat 5 
--hybridize --addto
   foo0_dense0_weight 
   [1.3425881]
   
   foo0_dense0_bias 
   [424.70026]
   
   ```
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610208876
 
 
   Also, adding zero_grad before the `mx.autograd.record()` won't solve this 
problem. I've revised the script in gist and you may try the new version:
   ```bash
   wget 
https://gist.githubusercontent.com/sxjscience/0bd336c921396b3c66331354e1866886/raw/d618ba69cbecf04d3013db77af86c29d62fe0336/grad_req_addto_bug.py
 -O grad_req_addto_bug.py
   python grad_req_addto_bug.py --addto
   python grad_req_addto_bug.py
   
   ```
   
   
   ```log
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto
   
   foo0_dense0_weight 1.7802463
   foo0_dense0_bias 315.05945
   ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py
   foo0_dense0_weight 0.19310227
   foo0_dense0_bias 19.947094
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'

2020-04-07 Thread GitBox
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of 
grad_req='add'
URL: 
https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610206398
 
 
   Also, a deeper dive into the problem shows that the issue appears when one 
layer is reused for >=5 times:
   ```bash
   wget 
https://gist.githubusercontent.com/sxjscience/0bd336c921396b3c66331354e1866886/raw/80a428980fd91110455e847c1a02aef4ae2cba7f/grad_req_addto_bug.py
 -O grad_req_addto_bug.py
   for nrepeat in 1 2 3 4 5 6 7 8 9 10
   do
   echo "nrepeat=${nrepeat}"
   echo "with addto"
   python grad_req_addto_bug.py --addto --nrepeat ${nrepeat}
   echo "without addto"
   python grad_req_addto_bug.py --nrepeat ${nrepeat}
   done
   ```
   Result:
   ```
   nrepeat=1
   with addto
   foo0_dense0_weight 86.363464
   foo0_dense0_bias 19.548544
   without addto
   foo0_dense0_weight 86.363464
   foo0_dense0_bias 19.548544
   nrepeat=2
   with addto
   foo0_dense0_weight 21.480562
   foo0_dense0_bias 19.870453
   without addto
   foo0_dense0_weight 21.480562
   foo0_dense0_bias 19.870453
   nrepeat=3
   with addto
   foo0_dense0_weight 4.64952
   foo0_dense0_bias 19.938385
   without addto
   foo0_dense0_weight 4.64952
   foo0_dense0_bias 19.938385
   nrepeat=4
   with addto
   foo0_dense0_weight 0.94337225
   foo0_dense0_bias 19.947392
   without addto
   foo0_dense0_weight 0.94337225
   foo0_dense0_bias 19.947392
   nrepeat=5
   with addto
   foo0_dense0_weight 1.7802463
   foo0_dense0_bias 315.05945
   without addto
   foo0_dense0_weight 0.19310227
   foo0_dense0_bias 19.947094
   nrepeat=6
   with addto
   foo0_dense0_weight 0.6738244
   foo0_dense0_bias 630.11865
   without addto
   foo0_dense0_weight 0.041728128
   foo0_dense0_bias 19.946844
   nrepeat=7
   with addto
   foo0_dense0_weight 0.26325437
   foo0_dense0_bias 1260.2372
   without addto
   foo0_dense0_weight 0.009131842
   foo0_dense0_bias 19.946758
   nrepeat=8
   with addto
   foo0_dense0_weight 0.0020059107
   foo0_dense0_bias 19.946749
   without addto
   foo0_dense0_weight 0.0020059107
   foo0_dense0_bias 19.946749
   nrepeat=9
   with addto
   foo0_dense0_weight 0.00045126013
   foo0_dense0_bias 19.946749
   without addto
   foo0_dense0_weight 0.00045126013
   foo0_dense0_bias 19.946749
   nrepeat=10
   with addto
   foo0_dense0_weight 0.00010413639
   foo0_dense0_bias 19.946749
   without addto
   foo0_dense0_weight 0.00010413639
   foo0_dense0_bias 19.946749
   ```
   
   This shows that it's only wrong when `nrepeat` is 5,6,7


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services