Sure, cuda-memcheck log is full of this: ========= CUDA-MEMCHECK ========= Invalid __global__ write of size 4 ========= at 0x00000310 in cudnn::maxwell::gemm::computeOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams) ========= by thread (95,0,0) in block (63,0,0) ========= Address 0x2316ffa5bc is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x204205] ========= Host Frame:/usr/local/cuda/lib64/libcudnn.so.5 [0x4ca501] ========= Host Frame:/usr/local/cuda/lib64/libcudnn.so.5 [0x4e68d3] ========= Host Frame:/usr/local/cuda/lib64/libcudnn.so.5 [0xf959e] ========= Host Frame:/usr/local/cuda/lib64/libcudnn.so.5 [0xa6883] ========= Host Frame:/usr/local/cuda/lib64/libcudnn.so.5 (cudnnConvolutionForward + 0x7a9) [0x3a4b9] ========= Host Frame:/home/facenx/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-3.5.2-64/tmpofz7z19g/m79b38ce26ae216596dbaccfe67469d8b.so [0x2b9b] ========= Host Frame:/home/facenx/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-3.5.2-64/lazylinker_ext/lazylinker_ext.so [0x3d5c] ========= Host Frame:/home/facenx/.theano/compiledir_Linux-4.4--generic-x86_64-with-Ubuntu-16.04-xenial-x86_64-3.5.2-64/lazylinker_ext/lazylinker_ext.so [0x47c7]
There's about 1K of such messages, with the address that is out of bounds gradually increasing by 4. Regards, Sergey. суббота, 8 апреля 2017 г., 3:45:10 UTC+3 пользователь Pascal Lamblin написал: > > Could you try running the whole thing inside cuda-memcheck? > > On Wed, Apr 05, 2017, Sergey Ovcharenko wrote: > > Hi, > > > > I'm struggling to get a theano graph spread over two GPU's working, but > I > > keep encountering the GpuArrayException: b'an illegal memory access was > > encountered' error (full traceback is in the end of this email). > > The basic idea is to do a forward pass through two neural networks, each > > located on a separate device each and combine the outputs. > > > > I'm using the latest Theano, libgpuarray and Lasagne to build the > networks, > > and have hacked Lasagne a bit to able to pass target='device' to the > shared > > variable constructor during weights initialization. > > > > I have THEANO_FLAGS="contexts=dev1->cuda1;dev2->cuda2" and the output > after > > theano import is: > > Using cuDNN version 5005 on context None > > Mapped name None to device cuda: GeForce GTX 980 (0000:0A:00.0) > > Using cuDNN version 5005 on context dev1 > > Mapped name dev1 to device cuda1: GeForce GTX 980 (0000:09:00.0) > > Using cuDNN version 5005 on context dev2 > > Mapped name dev2 to device cuda2: GeForce GTX 980 (0000:06:00.0) > > > > > > The networks definition is quite lengthy (and doesn't always reproduce > on > > toy graphs), so I'm providing a simpified example of what I'm doing. > > inp_0 = T.tensor4('inp0') > > r0 = build_model('dev1', input_var=inp_0) > > inp_1 = T.tensor4('inp1') > > r1 = build_model("dev2", input_var=inp_1) > > > > r0_out = lasagne.layers.get_output(r0['fc6'], deterministic=False) > > r1_out = lasagne.layers.get_output(r1['fc6'], deterministic=False) > > > > train_r0 = theano.function( > > [inp_0, inp_1], > > [r0_out, r1_out] > > ) > > > > result0 = train_r0(x, x2) > > This code fails with the aforementioned error. > > > > I've also tried to compile a separate function for each of the networks, > > like > > train_r0 = theano.function( > > [inp_0], > > [r0_out] > > ) > > > > train_r1 = theano.function( > > [inp_1], > > [r1_out] > > ) > > > > And running either train_r0 or train_r1 fails. But compiling and running > a > > single function (no matter train_r0 or train_r1) works just fine. > > Could someone help me debug this? Please let me know if I should provide > > additional code/info. > > > > Thanks, > > Sergey. > > > > The full traceback: > > > > RuntimeError Traceback (most recent call > last) > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/compile/function_module.py > > in __call__(self, *args, **kwargs) > > 883 outputs =\ > > --> 884 self.fn() if output_subset is None else\ > > 885 self.fn(output_subset=output_subset) > > > > RuntimeError: Error in the elemwise call > > > > During handling of the above exception, another exception occurred: > > > > GpuArrayException Traceback (most recent call > last) > > <ipython-input-11-902c3b4617f7> in <module>() > > ----> 1 result0 = train_r0(x, x2) > > 2 #result1 = train_r1(x2) > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/compile/function_module.py > > in __call__(self, *args, **kwargs) > > 896 > node=self.fn.nodes[self.fn.position_of_error], > > 897 thunk=thunk, > > --> 898 storage_map=getattr(self.fn, 'storage_map', > None)) > > 899 else: > > 900 # old-style linkers raise their own exceptions > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/gof/link.py > > in raise_with_op(node, thunk, exc_info, storage_map) > > 139 > > 140 hints = [] > > --> 141 detailed_err_msg = "\nApply node that caused the error: " + > str(node) > > 142 if exc_value.__applynode_index__ is not None: > > 143 detailed_err_msg += "\nToposort index: %d" % node_index > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/gof/graph.py > > in __str__(self) > > 178 > > 179 def __str__(self): > > --> 180 return op_as_string(self.inputs, self) > > 181 > > 182 def __repr__(self): > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/gof/graph.py > > in op_as_string(i, op, leaf_formatter, node_formatter) > > 1256 between i and o > > 1257 """ > > -> 1258 strs = as_string(i, op.inputs, leaf_formatter, > node_formatter) > > 1259 return node_formatter(op, strs) > > 1260 > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/gof/graph.py > > in as_string(i, o, leaf_formatter, node_formatter) > > 1336 return leaf_formatter(r) > > 1337 > > -> 1338 return [describe(output) for output in o] > > 1339 > > 1340 > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/gof/graph.py > > in <listcomp>(.0) > > 1336 return leaf_formatter(r) > > 1337 > > -> 1338 return [describe(output) for output in o] > > 1339 > > 1340 > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/gof/graph.py > > in describe(r) > > 1334 return s > > 1335 else: > > -> 1336 return leaf_formatter(r) > > 1337 > > 1338 return [describe(output) for output in o] > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/theano/gpuarray/type.py > > in __str__(self) > > 604 except gpuarray.GpuArrayException: > > 605 np_data = self.data > > --> 606 return "GpuArrayConstant{%s}" % np_data > > 607 > > 608 > > > > pygpu/gpuarray.pyx in pygpu.gpuarray.GpuArray.__str__ > (pygpu/gpuarray.c:28703)() > > > > > /home/facenx/.virtualenvs/multitheano/lib/python3.5/site-packages/numpy/core/numeric.py > > in asarray(a, dtype, order) > > 529 > > 530 """ > > --> 531 return array(a, dtype, copy=False, order=order) > > 532 > > 533 > > > > pygpu/gpuarray.pyx in pygpu.gpuarray.GpuArray.__array__ > (pygpu/gpuarray.c:21616)() > > > > pygpu/gpuarray.pyx in pygpu.gpuarray._pygpu_as_ndarray > (pygpu/gpuarray.c:18322)() > > > > pygpu/gpuarray.pyx in pygpu.gpuarray.array_read > (pygpu/gpuarray.c:6923)() > > > > GpuArrayException: b'an illegal memory access was encountered' > > > > > > > > -- > > > > --- > > You received this message because you are subscribed to the Google > Groups "theano-users" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to theano-users...@googlegroups.com <javascript:>. > > For more options, visit https://groups.google.com/d/optout. > > > -- > Pascal > -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to theano-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.