On 5/29/2023 11:29 AM, Wu, Fei wrote: > On 5/28/2023 1:06 AM, Petr Pavlu wrote: >> On 21. Apr 23 17:25, Jojo R wrote: >>> We consider to add RVV/Vector [1] feature in valgrind, there are some >>> challenges. >>> RVV like ARM's SVE [2] programming model, it's scalable/VLA, that means the >>> vector length is agnostic. >>> ARM's SVE is not supported in valgrind :( >>> >>> There are three major issues in implementing RVV instruction set in Valgrind >>> as following: >>> >>> 1. Scalable vector register width VLENB >>> 2. Runtime changing property of LMUL and SEW >>> 3. Lack of proper VEX IR to represent all vector operations >>> >>> We propose applicable methods to solve 1 and 2. As for 3, we explore several >>> possible but maybe imperfect approaches to handle different cases. >>> I did a very basic prototype for vlen Vector-IR, particularly on RISC-V Vector (RVV):
* Define new iops such as Iop_VAdd8/16/32/64, the difference from existing SIMD version is that no element number is specified like Iop_Add8x32 * Define new IR type Ity_VLen along side existing types such as Ity_I64, Ity_V256 * Define new class HRcVecVLen in HRegClass for vlen vector registers The real length is embedded in both IROp and IRType for vlen ops/types, it's runtime-decided and already known when handling insn such as vadd, this leads to more flexibility, e.g. backend can issue extra vsetvl if necessary. With the above, RVV instruction in the guest can be passed from frontend, to memcheck, to the backend, and generate the final RVV insn during host isel, a very basic testcase has been tested. Now here comes to the complexities: 1. RVV has the concept of LMUL, which groups multiple (or partial) vector registers, e.g. when LMUL==2, v2 means the real v2+v3. This complicates the register allocation. 2. RVV uses the "implicit" v0 for mask, its content must be loaded to the exact "v0" register instead of any other ones if host isel wants to leverage RVV insn, this implicitness in ISA requires more explicitness in Valgrind implementation. For #1 LMUL, a new register allocation algorithm for it can be added, and it will be great if someone is willing to try it, I'm not sure how much effort it will take. The other way is splitting it into multiple ops which only takes one vector register, taking vadd for example, 2 vadd will run with LMUL=1 for one vadd with LMUL=2, this is still okay for the widening insn, most of the arithmetic insns can be covered in this way. The exception could be register gather insn vrgather, which we can consult other ways for it, e.g. scalar or helper. For #2 v0 mask, one way is to handle the mask in the very beginning at guest_riscv64_toIR.c, similar to what AVX port does: a) Read the whole dest register without mask b) Generate unmasked result by running op without mask c) Applying mask to a,b and generate the final dest by doing this, insn with mask is converted to non-mask ones, although more insns are generated but the performance should be acceptable. There are still exceptions, e.g. vadc (Add-with-Carry), v0 is not used as mask but as carry, but just as mentioned above, it's okay to use other ways for a few insns. Eventually, we can pass v0 mask down to the backend if it's proved a better solution. This approach will introduce a bunch of new vlen Vector IRs, especially the arithmetic IRs such as vadd, my goal is for a good solution which takes reasonable time to reach usable status, yet still be able to evolve and generic enough for other vector ISA. Any comments? Best Regards, Fei. >>> We start from 1. As each guest register should be described in VEXGuestState >>> struct, the vector registers with scalable width of VLENB can be added into >>> VEXGuestState as arrays using an allowable maximum length like 2048/4096. >> >> Size of VexGuestRISCV64State is currently 592 bytes. Adding these large >> vector registers will bump it by 32*2048/8=8192 bytes. >> > Yes, that's the reason in my RFC patches the vlen is set to 128, that's > the largest room for vector in current design. > >> The baseblock layout in VEX is: the guest state, two equal sized areas >> for shadow state and then a spill area. The RISC-V port accesses the >> baseblock in generated code via x8/s0. The register is set to the >> address of the baseblock+2048 (file >> coregrind/m_dispatch/dispatch-riscv64-linux.S). The extra offset is >> a small optimization to utilize the fact that load/store instructions in >> RVI have a signed offset in range [-2048,2047]. The end result is that >> it is possible to access the baseblock data using only a single >> instruction. >> > Nice design. > >> Adding the new vector registers will cause that more instructions will >> be necessary. For instance, accessing any shadow guest state would >> naively require a sequence of LUI+ADDI+LOAD/STORE. >> >> I suspect this could affect performance quite a bit and might need some >> optimizing. >> > Yes, can we separate the vector registers from the other ones, is it > able to use two baseblocks? Or we can do some experiments to measure the > overhead. > >>> >>> The actual available access range can be determined at Valgrind startup time >>> by querying the CPU for its vector capability or some suitable setup steps. >> >> Something to consider is that the virtual CPU provided by Valgrind does >> not necessarily need to match the host CPU. For instance, VEX could >> hardcode that its vector registers are only 128 bits in size. >> >> I was originally hoping that this is how support for the V extension >> could be added, but the LMUL grouping looks to break this model. >> > Originally I had the same idea, but 128 vlen hardware cannot run the > software built for larger vlen, e.g. clang has option > -riscv-v-vector-bits-min, if it's set to 256, then it assumes the > underlying hardware has at least 256 vlen. > >>> >>> >>> To solve problem 2, we are inspired by already-proven techniques in QEMU, >>> where translation blocks are broken up when certain critical CSRs are set. >>> Because the guest code to IR translation relies on the precise value of >>> LMUL/SEW and they may change within a basic block, we can break up the basic >>> block each time encountering a vsetvl{i} instruction and return to the >>> scheduler to execute the translated code and update LMUL/SEW. Accordingly, >>> translation cache management should be refactored to detect the changing of >>> LMUL/SEW to invalidate outdated code cache. Without losing the generality, >>> the LMUL/SEW should be encoded into an ULong flag such that other >>> architectures can leverage this flag to store their arch-dependent >>> information. The TTentry struct should also take the flag into account no >>> matter insertion or deletion. By doing this, the flag carries the newest >>> LMUL/SEW throughout the simulation and can be passed to disassemble >>> functions using the VEXArchInfo struct such that we can get the real and >>> newest value of LMUL and SEW to facilitate our translation. >>> >>> Also, some architecture-related code should be taken care of. Like >>> m_dispatch part, disp_cp_xindir function looks up code cache using hardcoded >>> assembly by checking the requested guest state IP and translation cache >>> entry address with no more constraints. Many other modules should be checked >>> to ensure the in-time update of LMUL/SEW is instantly visible to essential >>> parts in Valgrind. >>> >>> >>> The last remaining big issue is 3, which we introduce some ad-hoc approaches >>> to deal with. We summarize these approaches into three types as following: >>> >>> 1. Break down a vector instruction to scalar VEX IR ops. >>> 2. Break down a vector instruction to fixed-length VEX IR ops. >>> 3. Use dirty helpers to realize vector instructions. >> >> I would also look at adding new VEX IR ops for scalable vector >> instructions. In particular, if it could be shown that RVV and SVE can >> use same new ops then it could make a good argument for adding them. >> >> Perhaps interesting is if such new scalable vector ops could also >> represent fixed operations on other architectures, but that is just me >> thinking out loud. >> > It's a good idea to consolidate all vector/simd together, the challenge > is to verify its feasibility and to speedup the adaption progress, as > it's supposed to take more efforts and longer time. Is there anyone with > knowledge or experience of other ISA such as avx/sve on valgrind can > share the pain and gain, or we can do some quick prototype? > > Thanks, > Fei. > >>> [...] >>> In summary, it is far to reach a truly applicable solution in adding vector >>> extensions in Valgrind. We need to do detailed and comprehensive estimations >>> on different vector instruction categories. >>> >>> Any feedback is welcome in github [3] also. >>> >>> >>> [1] https://github.com/riscv/riscv-v-spec >>> >>> [2] >>> https://community.arm.com/arm-research/b/articles/posts/the-arm-scalable-vector-extension-sve >>> >>> [3] https://github.com/petrpavlu/valgrind-riscv64/issues/17 >> >> Sorry for not being more helpful at this point. As mentioned in the >> GitHub issue, I still need to get myself more familiar with RVV and how >> Valgrind handles vector instructions. >> >> Thanks, >> Petr >> >> >> >> _______________________________________________ >> Valgrind-developers mailing list >> valgrind-develop...@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > > > _______________________________________________ > Valgrind-developers mailing list > valgrind-develop...@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/valgrind-developers _______________________________________________ Valgrind-users mailing list Valgrind-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/valgrind-users