yzh119 edited a comment on pull request #10207:
URL: https://github.com/apache/tvm/pull/10207#issuecomment-1035993696
> Looks like the perf improvement isn't very much? Only when n = 4 the
shuffle-down implementation is better than the shared memory implementation 樂
My typo, I have
yzh119 edited a comment on pull request #10207:
URL: https://github.com/apache/tvm/pull/10207#issuecomment-1034535980
Sure, below is the measured time of the kernel:
```python
@T.prim_func
def reduce(a: T.handle, b: T.handle, n: T.int32) -> None:
A = T.match_buffer(a,
yzh119 edited a comment on pull request #10207:
URL: https://github.com/apache/tvm/pull/10207#issuecomment-1035995100
> BTW do we have this requirement in the codebase now?
@MasterJH5574 yes there is a notion of `group_extent` and `reduce_extent`.
--
This is an automated message
yzh119 edited a comment on pull request #10207:
URL: https://github.com/apache/tvm/pull/10207#issuecomment-1034575574
Some other notes:
If in the following case:
```python
@T.prim_func
def reduce(a: T.handle, b: T.handle, n: T.int32) -> None:
A = T.match_buffer(a, [1,
yzh119 edited a comment on pull request #10207:
URL: https://github.com/apache/tvm/pull/10207#issuecomment-1034535980
Sure, below is the measured time of the kernel:
```python
@T.prim_func
def reduce(a: T.handle, b: T.handle, n: T.int32) -> None:
A = T.match_buffer(a,
yzh119 edited a comment on pull request #10207:
URL: https://github.com/apache/tvm/pull/10207#issuecomment-1034575574
There are some issues to be solved:
If in the following case:
```python
@T.prim_func
def reduce(a: T.handle, b: T.handle, n: T.int32) -> None:
A =
yzh119 edited a comment on pull request #10207:
URL: https://github.com/apache/tvm/pull/10207#issuecomment-1034575574
There are some issues to be solved:
If in the following case:
```python
@T.prim_func
def reduce(a: T.handle, b: T.handle, n: T.int32) -> None:
A =