Re: UncheckedArray conversion
> I am aware of this. I am just scaffolding the library. Just to avoid doing > something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). > In the future, the library should be more general. In this case, I take > advance that it is always 1 byte per sample. To be honest I don't know why it > is using float in the C++ version. The API states that both the read and > write pointer are uint8. Doing floating point vector is something that I > should try in the future. I'm pretty sure it's supposed to be `void *` and the `sampleType` and `bitsPerSample` are telling you what to cast it to. Anyway, very nice to hear your improvements. Arraymancer indexing via `t[i, j]` would be slow because: 1. You are copying the data to the tensor. I have a PR pending to allow zero-copy tensors over pointers but still pending 2. Arraymancer has slightly more overhead on iteration that might suggestion because it can handles N-dimensional tensors not just (Plane, Height, Width) images. It can handle (Frame, Plane, Depth, Height, Width) for a 3D video for example. There are iterators for simple case but convolution is something I'm leaving for a proper compiler. I.e. hopefully in the future the way to specify a convolution would be similar to Halide ([https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf](https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf)) and that would lead to optimal code: proc convolve(A, Filter: Function): Function = # Generator for the convolution function var i, j: Domain var convH, convW: Function # Definition of the result function # Filter F = [1, 2, 1] # So 2D application is: # [[1, 2, 1], # [2, 4, 2], # [1, 2, 1]] # convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2] convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * F[2] # Optimization of the definition - completely optional convH.compute_at(convW, j) # Compute the required region of convH before convW start (i.e. merged) # convH.compute_root() # Alternatively do the height convolve before the width convolve (i.e. 2 steps) convW.tile(i, ii, 64) # Split i in tile of size 64, with ii as the tile iterator convW.parallel(i) # Iterating on tiles is done in parallel convW.tile(j, jj, 64) # Split j in tile of size 64 as well with jj as the tile iterator convW.vectorize(jj, 8) # Iterating on jj is done with size 8 vectors if possible (i.e. AVX on x86) return convW generate convolve: proc foobar(a: Tensor[float32], filter: array[3, float32]): Tensor[float32] Run (From Lux: [https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58](https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58)) The importance of loop tiling (also called loop blocking) is highlighted in this simple transposition benchmark: [https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim) There is a factor 3x between simple parallel transposition (`||` is the OpenMP parallel for loop) proc benchNaive(a: seq[float32], nb_samples: int) = var output = newSeq[float32](out_size) withCompilerOptimHints() let pa{.restrict.} = cast[ptr UncheckedArray[float32]](a[0].unsafeAddr) let po{.restrict.} = cast[ptr UncheckedArray[float32]](output[0].addr) bench("Naive transpose"): discard do: for i in `||`(0, M-1): for j in `||`(0, N-1, "simd"): # This only add "#pragma omp simd" po[i+j*M] = pa[j+i*N] Run 1D tiling proc benchCacheBlocking(a: seq[float32], nb_samples: int) = var output = newSeq[float32](out_size) withCompilerOptimHints() let pa{.restrict.} = a[0].unsafeAddr let po{.restrict.} = output[0].addr const blck = 64 bench("Cache blocking"): discard do: {.emit: """ // No min function in C ... #define min(a,b) (((a)<(b))?(a):(b)) #pragma omp parallel for for (int i = 0; i < `M`; i+=`blck`) for (int j = 0; j < `N`; ++j) #pragma omp simd for (int ii = i; ii < min(i+`blck`,`M`); ++ii) `po`[ii+j*`M`] = `pa`[j+ii*`N`]; """.} Run And 2D tiling proc bench2Dtiling(a: seq[float32], nb_samples: int) = var output = newSeq[float32](out_size) withCompilerOptimHints() let pa{.restrict.} = a[0].unsafeAddr let po{.restrict.} = output[0].addr const blck = 128 bench("2D Tiling"): discard do: {.emit: """
Re: UncheckedArray conversion
Well now I am amazed but on the other side of the pic. I need to check if I have done something really wrong, but I have managed to get: 6074fps using float32 (I was expecting 3000fps) and 10700fps using int32, just by taking advance of multithreading.
Re: UncheckedArray conversion
Lot to digest here. 1. I am aware of this. I am just scaffolding the library. Just to avoid doing something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). In the future, the library should be more general. In this case, I take advance that it is always 1 byte per sample. To be honest I don't know why it is using float in the C++ version. The [API](http://www.vapoursynth.com/doc/api/vapoursynth.h.html#getreadptr) states that both the read and write pointer are uint8. Doing floating point vector is something that I should try in the future. 2. I bit like chinese for me :O/. I understand that the left column is the C generated from Nim, and the rigth column is the assembler generated. The bottleneck is because of the 22s for the row1? I am surprised. At the end of the day, I first iterate on rows and then in columns. I used both strategies as can be seen here: proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane = let ini = frame.getPtr(plane) let stride = frame.stride(plane) return Plane(ini:ini,stride:stride) Run I will give it another shot as you suggest. Regarding the triple iteration, I realized that, but I don't want to change that (yet) until I get similar performance to the C++ version with similar algorithm. I tried to perform a filter with [ArrayMancer](https://github.com/mantielero/VapourSynth.nim/blob/master/src/filters/Mancer.nim) but I got worst. My short term objective is trying to reach in the order of the 3000fps using float32 and multithreading (keeping similar to the C++ algorithm). Next step, to eliminate the triple loop on the rows and to see if I can reduce the number of times that I call getStride and getPtr. In the long term I'd like to try vectorization.
Re: UncheckedArray conversion
I'm having a look at the moment, here are things that I find sketchy, note that I don't know the low-level VapourSynth API. 1\. You are not using `sampleType` or `bitsPerSample` in your example. > From what I understood, those format fields are supposed to tell you what > your `ptr UncheckedArray[byte]` really is, if you are like in the C++ case, > it's a type-erased float pointer so you should use the same. Note that > float32 vs int might be very relevant as floating point vector operations are > way more optimized than integers (for matrix multiplication the ratio is > between 10x to 100x depending on SSE4.1 AVX2 / AVX512 and fused multiply add > support) 2\. Profiling shows that row0 access is a bottleneck > 3\. Your iteration strategy seems problematic with repeated query to the API > to get the stride and pointer in vsframe.nim proc getPtr(frame:ptr VSFrameRef, plane:cint):uint = ## Retrieves an `uint` rather than the pointer in order to perform pointer arithmetic with it cast[uint]( API.getWritePtr(frame, plane) ) proc stride(frame:ptr VSFrameRef, plane:cint):uint = API.getStride( frame, plane ).uint #type # Row = ptr UncheckedArray[uint8] proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:uint):ptr UncheckedArray[uint8] = let ini:uint = frame.getPtr(plane)# Don't use other thing here let stride:uint = frame.getStride( plane ) result = cast[ptr UncheckedArray[uint8]]( ini + row * stride ) #assert(result != nil) proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:cint):ptr UncheckedArray[uint8] = let ini:uint = frame.getPtr(plane)# Don't use other thing here let stride:uint = frame.getStride( plane ) result = cast[ptr UncheckedArray[uint8]]( ini + row.uint * stride ) proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane = let ini = frame.getPtr(plane) let stride = frame.stride(plane) return Plane(ini:ini,stride:stride) proc `[]`*(plane:Plane, row:cint):Row = #ptr UncheckedArray[uint8] = cast[ptr UncheckedArray[uint8]]( plane.ini + row.uint * plane.stride ) Run Instead I suggest you use a scheme like this, only get the plane/query the API once and use the plane object that holds the metadata. This way the compiler can do inlining and loop hoisting on all those accesses. The main issue with calling the VapourSynth API repeatedly is that all those function calls are just polluting your cache, requires the compiler to move the function parameter and result to and from the stack in an inner loop. import ../src/vapoursynth type Plane*[T: SomeNumber] = object # T depends on the frame format: # - sample_type (uint or float) # - bitsPerSample (8, 16, 32, 64) # The plane must be instantiated with # the proper uint8 ... 64, float32, float64 data: ptr UncheckedArray[T] rows: int cols: int rowStride: int func getPlane*(frame: ptr VSFrameRef, plane: int, T: typedesc[SomeInteger]): Plane[T] {.inline.} = ## Extract a plane result.data = cast[ptr UncheckedArray[T]](API.getWritePtr(frame, plane)) result.rows = API.getFrameHeight(frame, plane) result.cols = API.getFrameWidth(frame, plane) result.rowStride = API.getStride(frame, plane) template `[]`*[T](plane: Plane[T], row: int, col: int): T = plane.data[plane.row * plane.rowStride + plane.col] template `[]=`*[T](plane: Plane[T], row: int, col: int, val: T) = plane.data[plane.row * plane.rowStride + plane.col] = val Run Regarding iteration strategy, your scheme is doing 3 times more memory access than a basic scheme for each row and each column (total 9x). Instead you should iterate once per row/column but in-place accumulate with `dst[dstIdx] += src[srcIdx] * filter[filterIdx]` A starter point could be my basic convolution scheme in Laser [https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73) It's for batch of images with potentially different plane numbers (called channels) between in and out (for example RGB --> Grayscale) but the iteration should be faster than yours: Note that here I tracked the indices manually instead of using an intermediate `Plane` object but both methods should be the same with an optimizing compiler. proc conv2d_direct*[T]( oim: var Tensor[T], # Output images iim: Tensor[T], # Input images ishape: TensorShape,# Shape of a single image kernel: Tensor[T], # filter kernel kshape: KernelShape,# kernel shape (should be const) padding: Padding, # Padding (should be const) strides: Strides# Strides (should be
Re: UncheckedArray conversion
**C++ stuff** In order to test the C++ version: 1. The code can be downloaded from [here](https://github.com/IFeelBloated/test_c_filters). 2. I compiled it like: $ g++ -Wall -O3 -shared -fPIC -I. -o libfilter.so GaussBlur.cxx 3\. Create a VapourSynth python filter like test_filter.vpy: import vapoursynth as vs core = vs.get_core() core.std.LoadPlugin(path='./libfilter.so') core.std.SetMaxCPU('none') clip = core.std.BlankClip(format=vs.GRAYS, length=10, fpsnum=24000, fpsden=1001, keep=True) clip = core.testc.GaussBlur(clip) clip.set_output() Run 4\. Testing: $ vspipe test_filter.vpy /dev/null Output 10 frames in 29.53 seconds (3386.27 fps) So I am just getting 3386fps with the pure C++ version while I am getting 1200fps with the Nim version. I am not comparing apples to apples here: 1. The Nim version uses int32 while the C++ version uses float. 2. The Nim version is not using getFrameAsync, so it is not taking advantage of multithreading as discussed [here](https://github.com/vapoursynth/vapoursynth/issues/551#issuecomment-610473633).
Re: UncheckedArray conversion
**Nim stuff** I have updated the [repository](https://github.com/mantielero/VapourSynth.nim). I have a local package.json with: [ { "name": "vapoursynth", "url": "https://github.com/mantielero/VapourSynth.nim;, "method": "git", "tags": [ "video", "vapoursynth" ], "description": "Video processing based on Vapoursynth", "license": "BSD" } ] With that you only need: nimble install vapoursynth The files to play with are [custom_filter.nim](https://github.com/mantielero/VapourSynth.nim/blob/master/test/custom_filter.nim) which makes use of [DrawFrame.nim](https://github.com/mantielero/VapourSynth.nim/blob/master/test/DrawFrame.nim). So you need these two files in the same folder and compile them like: nim c -d:release -d:danger custom_filter And in order to get the performance, just run it: $ ./custom_filter I will check how to do the same in C++.
Re: UncheckedArray conversion
If you have instructions to reproduce the benchmark in Nim and C++ I can help. I just need a repo to clone, the scripts to run and the dataset. I already have Vapoursynth working. Ideally you have a profiler like Intel Instruments or Apple VTune to dive into assembly. For example this is my approach in debugging performance issue: [https://github.com/nim-lang/Nim/issues/9514](https://github.com/nim-lang/Nim/issues/9514) For memory bottlenecks it's a bit different, I use the roofline model as mentioned in my convolution optimization resources: [https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity](https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity) For example I know that matrix multiplication and convolution can reach 90% of the peak CPU GFlop because their arithmetic intensity is high (i.e. you do over 10 operations (add/mul) per byte) so if you don't reach that perf it's because you are moving bytes to much instead of computing with them. The theoretical peak of your CPU is easy to compute: * single threaded: > `CpuGhz * VectorWidth * InstrCycle * FlopInstr` > for a CPU that supports AVX2 on float32 (so packing 8 float32) that can issue > 2 Fused-Multiply-Add per cycle at 3GHz we have `3 (GHz) * 8 (packed float32 in AVX) * 2 (FMA per cycle) * 2 (FMA = 1 add + 1 mul)` `= 96 GFlops` * multithreaded: Just multiply the single result by the number of cores. For example 10 cores would be 960 GFlops or 0.9 TFlops And then the usual way to benchmark numerical algorithm is, you know the number of operations required by your algorithm, you divide that by the time spent to do them and you have your actual flops. And you compare your actual Flops with the theoretical peak. If you only reach 20% of the peak, you have a memory bottleneck and probably need to repack before processing to optimize cache usage, if not you need to look into SIMD vectorization, prefetching, ... All of that is quite complex so what I can do is reach the naive C++ implementation performance. Going beyond is something that I want to do but it's time-consuming and I feel that it would be better to spend my time on an image processing compiler similar to what's discussed here: [https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890](https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890) and with a proof of concept here: * [https://github.com/numforge/laser/tree/master/laser/lux_compiler](https://github.com/numforge/laser/tree/master/laser/lux_compiler) * [https://github.com/numforge/laser/tree/master/laser/lux_compiler/core](https://github.com/numforge/laser/tree/master/laser/lux_compiler/core)
Re: UncheckedArray conversion
You are probably right. For me is the first time doing this sort of stuff. But the feeling that I have is that even small modifications in the algorithm (avoiding conversions as much as possible, for example) has a significant impact. I think that the reason is that I am interating on 100_000 frames of 640x480 pixels. For instance I have just reached 1200fps just by replacing how the rows numbers are calculated. Changing this: template clamp(val:cint, max_val:cint):untyped = min( max(val, 0), max_val).uint . for row1 in 0..(vsapi->getReadPtr(frame, plane)); auto dstp = reinterpret_cast(vsapi->getWritePtr(dst, plane)); auto h = vsapi->getFrameHeight(frame, plane); auto w = vsapi->getFrameWidth(frame, plane); auto gauss = [&](auto y, auto x) { auto clamp = [&](auto val, auto bound) { return std::min(std::max(val, 0), bound - 1); }; auto above = srcp + clamp(y - 1, h) * w; auto current = srcp + y * w; auto below = srcp + clamp(y + 1, h) * w; auto conv = above[clamp(x - 1, w)] + above[x] * 2 + above[clamp(x + 1, w)] + current[clamp(x - 1, w)] * 2 + current[x] * 4 + current[clamp(x + 1, w)] * 2 + below[clamp(x - 1, w)] + below[x] * 2 + below[clamp(x + 1, w)]; return conv / 16; }; for (auto y = 0; y < h; y++) for (auto x = 0; x < w; x++) (dstp + y * w)[x] = gauss(y, x); Run When you mention memory bottleneck, what do you mean? Where should I look at? Thanks a lot
Re: UncheckedArray conversion
Type conversion from uint8 to uint32 (or uint64) is technically free because registers have a size of 32-bit or 64-bit. If anything it's less costly because at a low-level loading a 8-bit value requires movzb (mov to register and zero extend the byte) which has a slightly high latency than plain mov for uint32 and uint64. The issue in your convolutions is memory bottleneck not compute.
Re: UncheckedArray conversion
This is part of a convolution filter. The result is in the same order of magnitude (uint8): let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + r0[col2].int32 + r1[col0].int32 * 2 + r1[col1].int32 * 4 + r1[col2].int32 * 2 + r2[col0].int32 + r2[col1].int32 * 2 + r2[col2].int32 w1[col1] = (value / 16.0).uint8 Run I am looking how to improve performance-wise without entering into SIMD stuff (that I have never used by the way). I think that all those type conversions are killing the performance that I should achieve.
Re: UncheckedArray conversion
out of curiosity, isn't this prone to integer overflow or are you using value in fact as an uint32 ?
Re: UncheckedArray conversion
`# rename to whatever, col can also be typed template get(col): int32 = r0[col].int32 let value:int32 = get(col0) + get(col1) * 2 + ... ` Run
UncheckedArray conversion
I have something like the following: > let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + ... where r0 has type ptr UncheckedArray[uint8]. How can I avoid adding .int32 each time I use one item from the unckeckedarray?