On the GPU, not all indexing are fast. The slices are fast (just a view).
But on advanced indexing, only this version have been well optimized:
a_tensor[a_vector_of_int]
The vector_of_int can be on any of the dimensions from memory. But for sure
on the first dimensions.
We have code that
Hi,
I am holding an array on the GPU (in a shared variable), and I'm sampling
random minibatches from it, but it seems there is a call to HostFromGpu at
every index, which causes significant delay. Is there a way to avoid this?
Here is a minimal code example, plus the debug and profiling