I would like to compute the result of i*j for a number of i's and j's, and I would like to do so concurrently. If I use the scan function over my sequence of i's and j's, I will get my desired result, but it will not perform the operations concurrently. If I have 100 cores in my single GPU, I would like there to be 100 asynchronous computations (technically more since each core has multiple threads) of the multiplication and final assignment to one vector that will be returned. This is similar to how multiprocessing works in base python with CPU cores. The Theano tutorial claims that it uses GPU asynchronous capabilities, but I am not sure of that as I have ran scan functions, and they seems to go as fast or slower than the CPU.
Should I not use scan? Can this even be done in Theano? Do I have to use PyCUDA? -- --- You received this message because you are subscribed to the Google Groups "theano-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
