Hi Chris,

the copy-CTOR for compressed_matrix is now implemented:
https://github.com/viennacl/viennacl-dev/commit/0d62d8e0fb9a3eefc37aa225b5eb7195256181c9

You should get the desired behavior of just updating numerical values on 
the GPU with code similar to the following:

  viennacl::context host_ctx(viennacl::MAIN_MEMORY);
  viennacl::compressed_matrix<T> A(N,N, host_ctx); //your 'host matrix'
  /* fill A here */

  viennacl::compressed_matrix<T> B(A);   //create copy of A
  viennacl::context gpu_ctx(viennacl::CUDA_MEMORY);
  B.switch_memory_context(gpu_ctx);      //migrate B to CUDA memory

  // write to B, starting at offset 0, copy 'nnz' elements
  // use host data from nonzero floating point values of A
  viennacl::backend::memory_write(B.handle(), 0, sizeof(T) * A.nnz(), 
A.handle().ram_handle().get());

Just repeat the last line every time you need to update the numerical 
values on the GPU.

Please let me know how this turns out.

Best regards,
Karli


On 04/21/2017 09:06 PM, Chris Marsh wrote:
> Karl,
>
> No problem, the copy-constructor sounds like a perfect solution. Thanks
> for doing this.
>
>     How big is your system?
>
> The sparse matrix is approx 10^10  with about 1 million total non-zero
> elements.
>
>
>     2.5min for 5 time steps sounds a lot to me.
>
> I should have been more clear, sorry. The 2.5min includes a bunch of
> other routines that are being run for the timestep, so it is more than
> just the matrix solve. However, that 12s is entirely attributable to the
> difference between STL and the copy and the opterator() access. Also,
> running on a single laptop core instead of a cluster like it should be!
>
>     However, one still has to compare against the available column indices
>
> Makes sense. In my case, I think I can just say I need the 3rd, or 4th
> non-zero row item as I "know" where things are. but that's a non-generic
> case.
>
> Cheers
> Chris
>
>
>
> On 21 April 2017 at 04:34, Karl Rupp <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi Chris,
>
>     please apologize my late reply.
>
>
>                 This is a local search operation
>
>
>             Oh, that isn't at all what I expected. I assumed with the
>         row, col
>             offset it could just index the CSR array directly?
>
>
>     when you call operator(), you pass the row and column index. The row
>     index jumps at the beginning of nonzeros for that row in the CSR
>     array. However, one still has to compare against the available
>     column indices to finally pick the correct entry (or create a new
>     one...). Only for dense matrices you can locate the respective entry
>     in the matrix directly.
>
>
>
>                  By how much does your code slow down?
>
>
>             The "optimization"? Over 5 time steps or so it was 12 s
>         slower, out
>             of a total of 2.5min or so. So enough that when I run it for
>         15000
>             time steps it adds up!
>
>
>     So it's 10 percent. How big is your system? 2.5min for 5 time steps
>     sounds a lot to me.
>
>
>                  Also, do you fill the CSR matrix by increasing row
>         index, or is
>                 your code filling rows at random?
>
>
>             I'm filling the CSR via operator(), and that is by
>         increasing row
>             index.
>
>
>     Ok, this should be acceptable in terms of performance.
>
>
>         However, when it is run in parallel with openmp, it will
>             effectively be random.
>
>
>     In parallel you should really fill the CSR array directly (possibly
>     with the exception of the first time step, where you build the
>     sparsity pattern)
>
>
>                  What are you trying to accomplish?
>
>
>             With a OpenMP backend, I want to avoid the copy from STL ->
>             compressed_matrix. So my idea is to pre-allocate A, a
>             compressed_matrix on the host, regardless of what backend
>         I'm using
>             (instead of the STL variant). Then I want to either solve
>         directly
>             using A, or I want to copy A to a GPU and solve it on the GPU if
>             configured. For the former, this is currently working well,
>         barring
>             the operator() issues we are discussing above.  The problem
>         arises
>             with the 2nd case. I could do the context change, but once
>         it's been
>             copied to the GPU I have to copy it *back* to take advantage
>         of the
>             pre-allocated matrix. That is, I'd like to avoid any additional
>             memory allocations. I would like to just copy(A,gpu_A) when
>         gpu is
>             available. However, there is no copy for compressed_matrix to
>             comprssed_matrix.
>
>
>     Thanks, that helps me with understanding the setting better. Let me
>     add a copy-constructor for compressed_matrix for you, so you can
>     avoid the unnecessary copy back to the host. Copying the numerical
>     entries for a fixed sparsity pattern can be done efficiently; I'll
>     send you a code snippet when I'm done with the copy-constructor.
>
>     Best regards,
>     Karli
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
ViennaCL-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-support

Reply via email to