[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610596656 @ptrendx I'm using a compiled version of master. Are you able to reproduce it using the script I attached at the beginning of the issue? ``` wget https://gist.githubusercontent.com/sxjscience/0bd336c921396b3c66331354e1866886/raw/d618ba69cbecf04d3013db77af86c29d62fe0336/grad_req_addto_bug.py -O grad_req_addto_bug.py python grad_req_addto_bug.py --addto python grad_req_addto_bug.py ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610595164 @ptrendx @zhreshold @szha I tried to run with MXNet==1.0.0 but it give me another error. The earliest version I can confirm that has this issue is 1.2.0. This is really critical and impacts the very basic functionality of a DL framework, i.e., autograd. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610591835 @ptrendx @szha @zhreshold I find that the bug also exists in 1.5.0, 1.4.0, 1.3.1, 1.2.1. In fact, results on both CPU and GPU are wrong in these versions. Reproducible script is given as follows (I used the legacy mx.nd). ```python import mxnet as mx import numpy as np for ctx in [mx.cpu(), mx.gpu()]: for nrepeat in range(1, 10): stored_grad = dict() for grad_req in ['write', 'add']: a = mx.nd.array([1], ctx=ctx) b = mx.nd.array([2], ctx=ctx) if grad_req == 'write': a.attach_grad(grad_req='write') elif grad_req == 'add': a.attach_grad(grad_req='add') a.grad[:] = 0 with mx.autograd.record(): for _ in range(nrepeat): b = b * a b.backward() stored_grad[grad_req] = a.grad.asscalar() print('ctx={}, nrepeat={}, write={}, add={}'.format(ctx, nrepeat, stored_grad['write'], stored_grad['add'])) ``` For MXNet 1.5.0, I used `pip install mxnet-cu101==1.5.0` For MXNet 1.4.0, I used `pip install mxnet-cu92==1.4.0` For MXNet 1.3.1, I used `pip install mxnet-cu92==1.3.1` For MXNet 1.2.1, I used `pip install mxnet-cu92==1.2.1` Output ``` ctx=cpu(0), nrepeat=1, write=2.0, add=2.0 ctx=cpu(0), nrepeat=2, write=4.0, add=4.0 ctx=cpu(0), nrepeat=3, write=6.0, add=6.0 ctx=cpu(0), nrepeat=4, write=8.0, add=8.0 ctx=cpu(0), nrepeat=5, write=10.0, add=62.0 ctx=cpu(0), nrepeat=6, write=12.0, add=126.0 ctx=cpu(0), nrepeat=7, write=14.0, add=254.0 ctx=cpu(0), nrepeat=8, write=16.0, add=16.0 ctx=cpu(0), nrepeat=9, write=18.0, add=18.0 ctx=gpu(0), nrepeat=1, write=2.0, add=2.0 ctx=gpu(0), nrepeat=2, write=4.0, add=4.0 ctx=gpu(0), nrepeat=3, write=6.0, add=6.0 ctx=gpu(0), nrepeat=4, write=8.0, add=8.0 ctx=gpu(0), nrepeat=5, write=10.0, add=62.0 ctx=gpu(0), nrepeat=6, write=12.0, add=126.0 ctx=gpu(0), nrepeat=7, write=14.0, add=254.0 ctx=gpu(0), nrepeat=8, write=16.0, add=16.0 ctx=gpu(0), nrepeat=9, write=18.0, add=18.0 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610220167 @eric-haibin-lin @szha @szhengac @zhreshold This is the worst problem I've found and it impacts all models with `grad_req=add`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610219553 Just verified that there is no problem when `ctx=mx.cpu()` . Also, I've found a simpler script to reproduce the problem: ```python import mxnet as mx import numpy as np mx.npx.set_np() for ctx in [mx.cpu(), mx.gpu()]: for nrepeat in range(1, 10): stored_grad = dict() for grad_req in ['write', 'add']: a = mx.np.array(1, ctx=ctx) b = mx.np.array(2, ctx=ctx) if grad_req == 'write': a.attach_grad(grad_req='write') elif grad_req == 'add': a.attach_grad(grad_req='add') a.grad[()] = 0 with mx.autograd.record(): for _ in range(nrepeat): b = b * a b.backward() stored_grad[grad_req] = a.grad.asnumpy() print('ctx={}, nrepeat={}, write={}, add={}'.format(ctx, nrepeat, stored_grad['write'], stored_grad['add'])) ``` Result: ``` ctx=cpu(0), nrepeat=1, write=2.0, add=2.0 ctx=cpu(0), nrepeat=2, write=4.0, add=4.0 ctx=cpu(0), nrepeat=3, write=6.0, add=6.0 ctx=cpu(0), nrepeat=4, write=8.0, add=8.0 ctx=cpu(0), nrepeat=5, write=10.0, add=10.0 ctx=cpu(0), nrepeat=6, write=12.0, add=12.0 ctx=cpu(0), nrepeat=7, write=14.0, add=14.0 ctx=cpu(0), nrepeat=8, write=16.0, add=16.0 ctx=cpu(0), nrepeat=9, write=18.0, add=18.0 ctx=gpu(0), nrepeat=1, write=2.0, add=2.0 ctx=gpu(0), nrepeat=2, write=4.0, add=4.0 ctx=gpu(0), nrepeat=3, write=6.0, add=6.0 ctx=gpu(0), nrepeat=4, write=8.0, add=8.0 ctx=gpu(0), nrepeat=5, write=10.0, add=62.0 ctx=gpu(0), nrepeat=6, write=12.0, add=126.0 ctx=gpu(0), nrepeat=7, write=14.0, add=254.0 ctx=gpu(0), nrepeat=8, write=16.0, add=16.0 ctx=gpu(0), nrepeat=9, write=18.0, add=18.0 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610213687 I discovered this bug when trying to use different parameters of ALBERT. In the original ALBERT, the number of layers are 12 or 24. Both of them won't trigger the bug, so it took me some time to localize the issue. ``` ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto --nrepeat 12 foo0_dense0_weight 5.412447e-06 foo0_dense0_bias 19.946749 ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --nrepeat 12 foo0_dense0_weight 5.412447e-06 foo0_dense0_bias 19.946749 ``` ``` ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto --nrepeat 24 foo0_dense0_weight 5.706055e-14 foo0_dense0_bias 19.946749 ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --nrepeat 24 foo0_dense0_weight 5.706055e-14 ``` Also, the bug occurs in the hybridized case. ``` ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto --nrepeat 5 --hybridize foo0_dense0_weight 1.7802463 foo0_dense0_bias 315.05945 ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --nrepeat 5 --hybridize foo0_dense0_weight 0.19310227 foo0_dense0_bias 19.947094 ``` It also appears in the legacy `mx.nd` interface: ```python import mxnet as mx from mxnet.gluon import nn, HybridBlock import numpy as np import argparse np.random.seed(123) mx.random.seed(123) parser = argparse.ArgumentParser( description='Grad req bug minimal example') parser.add_argument('--addto', action='store_true') parser.add_argument('--hybridize', action='store_true') parser.add_argument('--nrepeat', type=int, default=5) args = parser.parse_args() class Foo(HybridBlock): def __init__(self, prefix=None, params=None): super().__init__(prefix=prefix, params=params) with self.name_scope(): self.layer = nn.Dense(16) def hybrid_forward(self, F, dat): out = dat for _ in range(args.nrepeat): out = self.layer(out) return out foo = Foo() if args.hybridize: foo.hybridize() foo.initialize(ctx=mx.gpu()) if args.addto: for p in foo.collect_params().values(): p.grad_req = 'add' dat = mx.nd.random.normal(0, 1, (32, 16), ctx=mx.gpu()) og = mx.nd.random.normal(0, 1, (32, 16), ctx=mx.gpu()) with mx.autograd.record(): out = foo(dat) loss = (out * og).sum() loss.backward() for k, v in foo.collect_params().items(): print(k, mx.nd.norm(v.grad())) ``` ```bash ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug_nd.py --nrepeat 5 --hybridize foo0_dense0_weight [0.16300175] foo0_dense0_bias [27.344622] ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug_nd.py --nrepeat 5 --hybridize --addto foo0_dense0_weight [1.3425881] foo0_dense0_bias [424.70026] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610208876 Also, adding zero_grad before the `mx.autograd.record()` won't solve this problem. I've revised the script in gist and you may try the new version: ```bash wget https://gist.githubusercontent.com/sxjscience/0bd336c921396b3c66331354e1866886/raw/d618ba69cbecf04d3013db77af86c29d62fe0336/grad_req_addto_bug.py -O grad_req_addto_bug.py python grad_req_addto_bug.py --addto python grad_req_addto_bug.py ``` ```log ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py --addto foo0_dense0_weight 1.7802463 foo0_dense0_bias 315.05945 ubuntu@ip-172-31-27-255:~$ python grad_req_addto_bug.py foo0_dense0_weight 0.19310227 foo0_dense0_bias 19.947094 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add'
sxjscience commented on issue #17989: [Gradient Addto] Very serious bug of grad_req='add' URL: https://github.com/apache/incubator-mxnet/issues/17989#issuecomment-610206398 Also, a deeper dive into the problem shows that the issue appears when one layer is reused for >=5 times: ```bash wget https://gist.githubusercontent.com/sxjscience/0bd336c921396b3c66331354e1866886/raw/80a428980fd91110455e847c1a02aef4ae2cba7f/grad_req_addto_bug.py -O grad_req_addto_bug.py for nrepeat in 1 2 3 4 5 6 7 8 9 10 do echo "nrepeat=${nrepeat}" echo "with addto" python grad_req_addto_bug.py --addto --nrepeat ${nrepeat} echo "without addto" python grad_req_addto_bug.py --nrepeat ${nrepeat} done ``` Result: ``` nrepeat=1 with addto foo0_dense0_weight 86.363464 foo0_dense0_bias 19.548544 without addto foo0_dense0_weight 86.363464 foo0_dense0_bias 19.548544 nrepeat=2 with addto foo0_dense0_weight 21.480562 foo0_dense0_bias 19.870453 without addto foo0_dense0_weight 21.480562 foo0_dense0_bias 19.870453 nrepeat=3 with addto foo0_dense0_weight 4.64952 foo0_dense0_bias 19.938385 without addto foo0_dense0_weight 4.64952 foo0_dense0_bias 19.938385 nrepeat=4 with addto foo0_dense0_weight 0.94337225 foo0_dense0_bias 19.947392 without addto foo0_dense0_weight 0.94337225 foo0_dense0_bias 19.947392 nrepeat=5 with addto foo0_dense0_weight 1.7802463 foo0_dense0_bias 315.05945 without addto foo0_dense0_weight 0.19310227 foo0_dense0_bias 19.947094 nrepeat=6 with addto foo0_dense0_weight 0.6738244 foo0_dense0_bias 630.11865 without addto foo0_dense0_weight 0.041728128 foo0_dense0_bias 19.946844 nrepeat=7 with addto foo0_dense0_weight 0.26325437 foo0_dense0_bias 1260.2372 without addto foo0_dense0_weight 0.009131842 foo0_dense0_bias 19.946758 nrepeat=8 with addto foo0_dense0_weight 0.0020059107 foo0_dense0_bias 19.946749 without addto foo0_dense0_weight 0.0020059107 foo0_dense0_bias 19.946749 nrepeat=9 with addto foo0_dense0_weight 0.00045126013 foo0_dense0_bias 19.946749 without addto foo0_dense0_weight 0.00045126013 foo0_dense0_bias 19.946749 nrepeat=10 with addto foo0_dense0_weight 0.00010413639 foo0_dense0_bias 19.946749 without addto foo0_dense0_weight 0.00010413639 foo0_dense0_bias 19.946749 ``` This shows that it's only wrong when `nrepeat` is 5,6,7 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services