Hi Karl,
This looks great! Thank you very much for this effort. I will attempt to
implement around this next week.
Cheers
Chris
On 25 April 2017 at 09:25, Karl Rupp <[email protected]> wrote:
> Hi Chris,
>
> the copy-CTOR for compressed_matrix is now implemented:
> https://github.com/viennacl/viennacl-dev/commit/0d62d8e0fb9a
> 3eefc37aa225b5eb7195256181c9
>
> You should get the desired behavior of just updating numerical values on
> the GPU with code similar to the following:
>
> viennacl::context host_ctx(viennacl::MAIN_MEMORY);
> viennacl::compressed_matrix<T> A(N,N, host_ctx); //your 'host matrix'
> /* fill A here */
>
> viennacl::compressed_matrix<T> B(A); //create copy of A
> viennacl::context gpu_ctx(viennacl::CUDA_MEMORY);
> B.switch_memory_context(gpu_ctx); //migrate B to CUDA memory
>
> // write to B, starting at offset 0, copy 'nnz' elements
> // use host data from nonzero floating point values of A
> viennacl::backend::memory_write(B.handle(), 0, sizeof(T) * A.nnz(),
> A.handle().ram_handle().get());
>
> Just repeat the last line every time you need to update the numerical
> values on the GPU.
>
> Please let me know how this turns out.
>
> Best regards,
> Karli
>
>
> On 04/21/2017 09:06 PM, Chris Marsh wrote:
>
>> Karl,
>>
>> No problem, the copy-constructor sounds like a perfect solution. Thanks
>> for doing this.
>>
>> How big is your system?
>>
>> The sparse matrix is approx 10^10 with about 1 million total non-zero
>> elements.
>>
>>
>> 2.5min for 5 time steps sounds a lot to me.
>>
>> I should have been more clear, sorry. The 2.5min includes a bunch of
>> other routines that are being run for the timestep, so it is more than
>> just the matrix solve. However, that 12s is entirely attributable to the
>> difference between STL and the copy and the opterator() access. Also,
>> running on a single laptop core instead of a cluster like it should be!
>>
>> However, one still has to compare against the available column indices
>>
>> Makes sense. In my case, I think I can just say I need the 3rd, or 4th
>> non-zero row item as I "know" where things are. but that's a non-generic
>> case.
>>
>> Cheers
>> Chris
>>
>>
>>
>> On 21 April 2017 at 04:34, Karl Rupp <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hi Chris,
>>
>> please apologize my late reply.
>>
>>
>> This is a local search operation
>>
>>
>> Oh, that isn't at all what I expected. I assumed with the
>> row, col
>> offset it could just index the CSR array directly?
>>
>>
>> when you call operator(), you pass the row and column index. The row
>> index jumps at the beginning of nonzeros for that row in the CSR
>> array. However, one still has to compare against the available
>> column indices to finally pick the correct entry (or create a new
>> one...). Only for dense matrices you can locate the respective entry
>> in the matrix directly.
>>
>>
>>
>> By how much does your code slow down?
>>
>>
>> The "optimization"? Over 5 time steps or so it was 12 s
>> slower, out
>> of a total of 2.5min or so. So enough that when I run it for
>> 15000
>> time steps it adds up!
>>
>>
>> So it's 10 percent. How big is your system? 2.5min for 5 time steps
>> sounds a lot to me.
>>
>>
>> Also, do you fill the CSR matrix by increasing row
>> index, or is
>> your code filling rows at random?
>>
>>
>> I'm filling the CSR via operator(), and that is by
>> increasing row
>> index.
>>
>>
>> Ok, this should be acceptable in terms of performance.
>>
>>
>> However, when it is run in parallel with openmp, it will
>> effectively be random.
>>
>>
>> In parallel you should really fill the CSR array directly (possibly
>> with the exception of the first time step, where you build the
>> sparsity pattern)
>>
>>
>> What are you trying to accomplish?
>>
>>
>> With a OpenMP backend, I want to avoid the copy from STL ->
>> compressed_matrix. So my idea is to pre-allocate A, a
>> compressed_matrix on the host, regardless of what backend
>> I'm using
>> (instead of the STL variant). Then I want to either solve
>> directly
>> using A, or I want to copy A to a GPU and solve it on the GPU
>> if
>> configured. For the former, this is currently working well,
>> barring
>> the operator() issues we are discussing above. The problem
>> arises
>> with the 2nd case. I could do the context change, but once
>> it's been
>> copied to the GPU I have to copy it *back* to take advantage
>> of the
>> pre-allocated matrix. That is, I'd like to avoid any
>> additional
>> memory allocations. I would like to just copy(A,gpu_A) when
>> gpu is
>> available. However, there is no copy for compressed_matrix to
>> comprssed_matrix.
>>
>>
>> Thanks, that helps me with understanding the setting better. Let me
>> add a copy-constructor for compressed_matrix for you, so you can
>> avoid the unnecessary copy back to the host. Copying the numerical
>> entries for a fixed sparsity pattern can be done efficiently; I'll
>> send you a code snippet when I'm done with the copy-constructor.
>>
>> Best regards,
>> Karli
>>
>>
>>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
ViennaCL-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-support