Re: [x265] [PATCH] RISCV64: add copy_cnt assembly optimization

chen Tue, 08 Jul 2025 07:35:37 -0700

Hi Changsheng,




Thank you for providing so much detailed information. I am glad to see that 
RISC-VV is gradually maturing.




I conducted a simple experiment on GCC, it can automatically generate 
vectorized code now. However, I found a special instruction vlseg4e32.v in 
output, and consulted the RISC-VV documentation. I got many doubts.

Such as registers is (V8, V9, V10, V11) in document, or (V8, V10, V12, V14) in 
GCC.

And, there are still many ASM instruction's details that only exist in 
fragmented PPT from different user/companies, without a unified official 
document




These issues may lead to repeated code rework in the future, which makes me 
still not recommend accepting RISC-VV now.

If possible, please help the RISC-V community to improving the RISC-VV ISA 
documentation. I am pleased to accept RISC-VV as one of the target platforms in 
the future.




Regards,
Chen




Code

int test1(int *x, int N)
{
    int sum = 0;
    if (__builtin_expect(N % 4, 0))
    {
        for(int i = 0; i < N; i+=4)
        {
            sum += (x[i+0] + x[i+1] + x[i+2] + x[i+3]);
        }
    }
    else
        __builtin_unreachable();
    return sum;
}




GCC output

test1(int*, int):
        ble     a1,zero,.L4
        vsetvlia5,zero,e32,m2,ta,ma
        addiw   a4,a1,-1
        vmv.v.iv4,0
        srliw   a4,a4,2
        addiw   a4,a4,1
.L3:
        vsetvlia5,a4,e32,m2,tu,ma
        vlseg4e32.v     v8,(a0)
        slli    a3,a5,4
        sub     a4,a4,a5
        add     a0,a0,a3
        vadd.vvv2,v10,v8
        vadd.vvv2,v2,v12
        vadd.vvv2,v2,v14
        vadd.vvv4,v4,v2
        bne     a4,zero,.L3
        vsetvlia5,zero,e32,m2,ta,ma
        vmv.s.xv1,zero
        vredsum.vs      v4,v4,v1
        vmv.x.sa0,v4
        ret
.L4:
        li      a0,0
        ret




RISC-V Document
7.8.2. Vector Strided Segment Loads and Stores

Vector strided segment loads and stores move contiguous segments where each 
segment is separated by the byte-stride offset given in the rs2 GPR argument.

|
Note
| Negative and zero strides are supported. |
    # Format
    vlsseg<nf>e<eew>.v vd, (rs1), rs2, vm          # Strided segment loads
    vssseg<nf>e<eew>.v vs3, (rs1), rs2, vm         # Strided segment stores

    # Examples
    vsetvli a1, t0, e8, ta, ma
    vlsseg3e8.v v4, (x5), x6   # Load bytes at addresses x5+i*x6   into v4[i],
                              #  and bytes at addresses x5+i*x6+1 into v5[i],
                              #  and bytes at addresses x5+i*x6+2 into v6[i].

    # Examples
    vsetvli a1, t0, e32, ta, ma
    vssseg2e32.v v2, (x5), x6   # Store words from v2[i] to address x5+i*x6
                                #   and words from v3[i] to address x5+i*x6+4



At 2025-07-07 15:24:16, wu.changsh...@sanechips.com.cn wrote:

Hi Chen,




Thank you for your previous feedback.




I'd like to supplement some information about RISC-V Vector V1.0 and hope you 
can reconsider x265 support for the RISC-V architecture.   




1. The RISC-V community considers Vector V1.0 a stable version. The RISC-V 
Vector V1.0 was officially approved and released in 2021. The server profile 
RVA23, released in October 2024, also specifies Vector V1.0, and The RISC-V 
Instruction Set Manual Volume published the same year adopts Vector V1.0 as 
well. 

2. Many chip manufacturers already support RISC-V Vector Extension V1.0, such 
as the already released SiFive P670/P470, Andes NX27V, Alibaba C920, and 
SpaceMIT X100 CPUs. In the next year or two, many more vendors will launch 
chips supporting Vector V1.0.

3. GCC experimentally introduced RISC-V Vector support in GCC 12 (May 2022) and 
officially supported RISC-V Vector V1.0 in GCC 14 (May 2024).

4. The Linux kernel merged support for RISC-V Vector V1.0 in June 2023 and 
released it in the LTS 6.21 version.  

5. Our company has already planned to deploy RISC-V servers in data centers, 
with x265 video encoding being one of the key business scenarios. We will 
continue contributing RISC-V architecture patches.




RISC-V has garnered widespread attention and strong investment, leading to 
rapid development. I believe it will become another mainstream architecture 
following x86 and Arm. RISC-V is now commercially viable and deserves adoption 
by the x265 community.







Best Wishes！

Changsheng Wu

M: +86 13776570034

E：wu.changsh...@sanechips.com.cn

SANECHIPS TECHNOLOGY CO.,LTD.




Original
From: chen <chenm...@163.com>
To: 吴昌盛0318004250;
Cc: x265-devel@videolan.org 
<x265-devel@videolan.org>;mah...@multicorewareinc.com 
<mah...@multicorewareinc.com>;pavan.ta...@multicorewareinc.com 
<pavan.ta...@multicorewareinc.com>;沈显来0318003851;袁佳0318004243;吴昌盛0318004250;
Date: 2025年07月07日 03:28
Subject: Re:[x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Hi Changsheng,




Thank for the patches.




However, I don't think RISC-V Extension-V stable enough nowadays.

v1.0 frozen at September 2021

v1.1 public review at May 2023

no more update until July 2025



And most instructions has not behavior description,




For example, vredsum.vs in the patch

vredsum.vs  vd, vs2, vs1, vm   # vd[0] =  sum( vs1[0] , vs2[*] )




I just guess it is
vd[0] =  vs1[0] + sum(vs2[*])




Another example is vlse8.v,

I may guess it is equal to x86 PSHUFB or ARM VTBL,




Above example I just guess, I can't confirm my concept in past couple years, 
too many similar problem inside RISC-V Extension-V

So, I suggest do not integrate / implement RISC-V patch, until specification 
become stable enough.




Rgards,

Chen

2025-07-06 10:08:25，wu.changsh...@sanechips.com.cn 

From 7562e3a834a6a5ea76ab1b97acf915e095646cd5 Mon Sep 17 00:00:00 2001


From: Changsheng Wu <wu.changsh...@sanechips.com.cn>

Date: Sat, 5 Jul 2025 23:09:14 +0800

Subject: [PATCH] RISCV64: add copy_cnt assembly optimization




TestBench test result:

  copy_cnt[4x4] |        1.34x |          123.12   |      165.06

  copy_cnt[8x8] |        2.64x |          214.07   |      564.26

copy_cnt[16x16] |        3.96x |          563.83   |      2232.00

copy_cnt[32x32] |        7.44x |          2144.80  |      15954.42

_______________________________________________
x265-devel mailing list
x265-devel@videolan.org
https://mailman.videolan.org/listinfo/x265-devel

Re: [x265] [PATCH] RISCV64: add copy_cnt assembly optimization

Reply via email to