[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-26 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment. In D86376#2236704 , @tra wrote: > > It's still suspiciously high. AFAICT, config/push/pull is just an std::vector > push/pop. It should not take *that* long. Few function calls should not lead > to microseconds of overhead,

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-25 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment. In D86376#2236501 , @yaxunl wrote: > My previous measurements did not warming up, which caused some one time > overhead due to device initialization and loading of device binary. With warm > up, the call of

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-25 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment. In D86376#2234824 , @tra wrote: > In D86376#2234719 , @yaxunl wrote: > >>> This patch appears to be somewhere in the gray area to me. My prior >>> experience with CUDA suggests that it will

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment. In D86376#2234719 , @yaxunl wrote: >> This patch appears to be somewhere in the gray area to me. My prior >> experience with CUDA suggests that it will make little to no difference. On >> the other hand, AMD GPUs may be different

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment. In D86376#2234547 , @tra wrote: > I'm OK with how the patch is implemented. > I'm still on the fence regarding whether it should be implemented. > > In D86376#2234458 , @yaxunl wrote: > >>

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment. I'm OK with how the patch is implemented. I'm still on the fence regarding whether it should be implemented. In D86376#2234458 , @yaxunl wrote: > `__hipPushConfiguration/__hipPopConfiguration' and kernel stub can cause 40 > ns

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl added a comment. In D86376#2234259 , @tra wrote: > How much does this inlining buy you in practice? I.e. what's a typical launch > latency before/after the patch? For CUDA, config push/pop is negligible > compared to the cost of actually

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-24 Thread Artem Belevich via Phabricator via cfe-commits
tra added a comment. How much does this inlining buy you in practice? I.e. what's a typical launch latency before/after the patch? For CUDA, config push/pop is negligible compared to the cost of actually launching the kernel on the GPU. It is measurable if the launch is asynchronous, but

[PATCH] D86376: [HIP] Improve kernel launching latency

2020-08-21 Thread Yaxun Liu via Phabricator via cfe-commits
yaxunl created this revision. yaxunl added reviewers: tra, rjmccall. yaxunl requested review of this revision. Currently clang emits emits the following code for triple chevron kernel call for HIP: __hipPushCallConfiguration(grids, blocks, shmem, stream); kernel_stub(); whereas for each