Re: UncheckedArray conversion

2020-04-08 Thread mratsim
> I am aware of this. I am just scaffolding the library. Just to avoid doing 
> something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). 
> In the future, the library should be more general. In this case, I take 
> advance that it is always 1 byte per sample. To be honest I don't know why it 
> is using float in the C++ version. The API states that both the read and 
> write pointer are uint8. Doing floating point vector is something that I 
> should try in the future.

I'm pretty sure it's supposed to be `void *` and the `sampleType` and 
`bitsPerSample` are telling you what to cast it to.

Anyway, very nice to hear your improvements.

Arraymancer indexing via `t[i, j]` would be slow because:

  1. You are copying the data to the tensor. I have a PR pending to allow 
zero-copy tensors over pointers but still pending
  2. Arraymancer has slightly more overhead on iteration that might suggestion 
because it can handles N-dimensional tensors not just (Plane, Height, Width) 
images. It can handle (Frame, Plane, Depth, Height, Width) for a 3D video for 
example. There are iterators for simple case but convolution is something I'm 
leaving for a proper compiler.



I.e. hopefully in the future the way to specify a convolution would be similar 
to Halide 
([https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf](https://halide-lang.org/assets/lectures/Halide_CVPR_intro.pdf))
 and that would lead to optimal code:


proc convolve(A, Filter: Function): Function =
  # Generator for the convolution function
  var i, j: Domain
  var convH, convW: Function
  
  # Definition of the result function
  # Filter F = [1, 2, 1]
  # So 2D application is:
  # [[1, 2, 1],
  #  [2, 4, 2],
  #  [1, 2, 1]]
  #
  convH[i, j] = A[i-1, j] * F[0] + A[i, j] * F[1] + A[i+1, j] * F[2]
  convW[i, j] = convH[i, j-1] * F[0] + convH[i, j] * F[1] + convH[i, j] * 
F[2]
  
  # Optimization of the definition - completely optional
  convH.compute_at(convW, j) # Compute the required region of convH before 
convW start (i.e. merged)
  # convH.compute_root() # Alternatively do the height convolve before 
the width convolve (i.e. 2 steps)
  
  convW.tile(i, ii, 64)  # Split i in tile of size 64, with ii as the 
tile iterator
  convW.parallel(i)  # Iterating on tiles is done in parallel
  convW.tile(j, jj, 64)  # Split j in tile of size 64 as well with jj 
as the tile iterator
  convW.vectorize(jj, 8) # Iterating on jj is done with size 8 vectors 
if possible (i.e. AVX on x86)
  
  return convW
  
  generate convolve:
proc foobar(a: Tensor[float32], filter: array[3, float32]): 
Tensor[float32]


Run

(From Lux: 
[https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58](https://github.com/numforge/laser/blob/d1e6ae61/laser/lux_compiler/lux_dsl.nim#L43-L58))

The importance of loop tiling (also called loop blocking) is highlighted in 
this simple transposition benchmark: 
[https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/transpose/transpose_bench.nim)

There is a factor 3x between simple parallel transposition (`||` is the OpenMP 
parallel for loop)


proc benchNaive(a: seq[float32], nb_samples: int) =
  var output = newSeq[float32](out_size)
  withCompilerOptimHints()
  
  let pa{.restrict.} = cast[ptr UncheckedArray[float32]](a[0].unsafeAddr)
  let po{.restrict.} = cast[ptr UncheckedArray[float32]](output[0].addr)
  
  bench("Naive transpose"):
discard
  do:
for i in `||`(0, M-1):
  for j in `||`(0, N-1, "simd"): # This only add "#pragma omp simd"
po[i+j*M] = pa[j+i*N]


Run

1D tiling


proc benchCacheBlocking(a: seq[float32], nb_samples: int) =
  var output = newSeq[float32](out_size)
  withCompilerOptimHints()
  
  let pa{.restrict.} = a[0].unsafeAddr
  let po{.restrict.} = output[0].addr
  const blck = 64
  
  bench("Cache blocking"):
discard
  do:
{.emit: """
// No min function in C ...
#define min(a,b) (((a)<(b))?(a):(b))

#pragma omp parallel for
for (int i = 0; i < `M`; i+=`blck`)
  for (int j = 0; j < `N`; ++j)
#pragma omp simd
for (int ii = i; ii < min(i+`blck`,`M`); ++ii)
  `po`[ii+j*`M`] = `pa`[j+ii*`N`];
""".}


Run

And 2D tiling 


proc bench2Dtiling(a: seq[float32], nb_samples: int) =
  var output = newSeq[float32](out_size)
  withCompilerOptimHints()
  
  let pa{.restrict.} = a[0].unsafeAddr
  let po{.restrict.} = output[0].addr
  const blck = 128
  
  bench("2D Tiling"):
discard
  do:
{.emit: """

Re: UncheckedArray conversion

2020-04-08 Thread mantielero
Well now I am amazed but on the other side of the pic.

I need to check if I have done something really wrong, but I have managed to 
get: 6074fps using float32 (I was expecting 3000fps) and 10700fps using int32, 
just by taking advance of multithreading. 


Re: UncheckedArray conversion

2020-04-08 Thread mantielero
Lot to digest here.

  1. I am aware of this. I am just scaffolding the library. Just to avoid doing 
something fundamentally wrong (like reaching 80fps instead of >1000fps ;oP). In 
the future, the library should be more general. In this case, I take advance 
that it is always 1 byte per sample. To be honest I don't know why it is using 
float in the C++ version. The 
[API](http://www.vapoursynth.com/doc/api/vapoursynth.h.html#getreadptr) states 
that both the read and write pointer are uint8. Doing floating point vector is 
something that I should try in the future.
  2. I bit like chinese for me :O/. I understand that the left column is the C 
generated from Nim, and the rigth column is the assembler generated. The 
bottleneck is because of the 22s for the row1? I am surprised. At the end of 
the day, I first iterate on rows and then in columns.



I used both strategies as can be seen here:


proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane =
  let ini = frame.getPtr(plane)
  let stride = frame.stride(plane)
  return Plane(ini:ini,stride:stride)

Run

I will give it another shot as you suggest.

Regarding the triple iteration, I realized that, but I don't want to change 
that (yet) until I get similar performance to the C++ version with similar 
algorithm.

I tried to perform a filter with 
[ArrayMancer](https://github.com/mantielero/VapourSynth.nim/blob/master/src/filters/Mancer.nim)
 but I got worst.

My short term objective is trying to reach in the order of the 3000fps using 
float32 and multithreading (keeping similar to the C++ algorithm).

Next step, to eliminate the triple loop on the rows and to see if I can reduce 
the number of times that I call getStride and getPtr.

In the long term I'd like to try vectorization.


Re: UncheckedArray conversion

2020-04-07 Thread mratsim
I'm having a look at the moment, here are things that I find sketchy, note that 
I don't know the low-level VapourSynth API.

1\. You are not using `sampleType` or `bitsPerSample` in your example.

> From what I understood, those format fields are supposed to tell you what 
> your `ptr UncheckedArray[byte]` really is, if you are like in the C++ case, 
> it's a type-erased float pointer so you should use the same. Note that 
> float32 vs int might be very relevant as floating point vector operations are 
> way more optimized than integers (for matrix multiplication the ratio is 
> between 10x to 100x depending on SSE4.1 AVX2 / AVX512 and fused multiply add 
> support)

2\. Profiling shows that row0 access is a bottleneck

> 3\. Your iteration strategy seems problematic with repeated query to the API 
> to get the stride and pointer in vsframe.nim


proc getPtr(frame:ptr VSFrameRef, plane:cint):uint =
  ## Retrieves an `uint` rather than the pointer in order to perform 
pointer arithmetic with it
  cast[uint]( API.getWritePtr(frame, plane) )

proc stride(frame:ptr VSFrameRef, plane:cint):uint =
  API.getStride( frame, plane ).uint
#type
#  Row = ptr UncheckedArray[uint8]

proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:uint):ptr 
UncheckedArray[uint8] =
  let ini:uint = frame.getPtr(plane)# Don't use other thing here
  let stride:uint = frame.getStride( plane )
  result = cast[ptr UncheckedArray[uint8]]( ini + row * stride )
  #assert(result != nil)

proc `[]`*(frame:ptr VSFrameRef, plane:cint, row:cint):ptr 
UncheckedArray[uint8] =
  let ini:uint = frame.getPtr(plane)# Don't use other thing here
  let stride:uint = frame.getStride( plane )
  result = cast[ptr UncheckedArray[uint8]]( ini + row.uint * stride )

proc `[]`*(frame:ptr VSFrameRef, plane:cint ):Plane =
  let ini = frame.getPtr(plane)
  let stride = frame.stride(plane)
  return Plane(ini:ini,stride:stride)

proc `[]`*(plane:Plane, row:cint):Row = #ptr UncheckedArray[uint8] =
  cast[ptr UncheckedArray[uint8]]( plane.ini + row.uint * plane.stride )


Run

Instead I suggest you use a scheme like this, only get the plane/query the API 
once and use the plane object that holds the metadata. This way the compiler 
can do inlining and loop hoisting on all those accesses.

The main issue with calling the VapourSynth API repeatedly is that all those 
function calls are just polluting your cache, requires the compiler to move the 
function parameter and result to and from the stack in an inner loop.


import ../src/vapoursynth

type Plane*[T: SomeNumber] = object
  # T depends on the frame format:
  # - sample_type (uint or float)
  # - bitsPerSample (8, 16, 32, 64)
  # The plane must be instantiated with
  # the proper uint8 ... 64, float32, float64
  data: ptr UncheckedArray[T]
  rows: int
  cols: int
  rowStride: int

func getPlane*(frame: ptr VSFrameRef, plane: int, T: 
typedesc[SomeInteger]): Plane[T] {.inline.} =
  ## Extract a plane
  result.data = cast[ptr UncheckedArray[T]](API.getWritePtr(frame, plane))
  result.rows = API.getFrameHeight(frame, plane)
  result.cols = API.getFrameWidth(frame, plane)
  result.rowStride = API.getStride(frame, plane)

template `[]`*[T](plane: Plane[T], row: int, col: int): T =
  plane.data[plane.row * plane.rowStride + plane.col]

template `[]=`*[T](plane: Plane[T], row: int, col: int, val: T) =
  plane.data[plane.row * plane.rowStride + plane.col] = val


Run

Regarding iteration strategy, your scheme is doing 3 times more memory access 
than a basic scheme for each row and each column (total 9x). Instead you should 
iterate once per row/column but in-place accumulate with `dst[dstIdx] += 
src[srcIdx] * filter[filterIdx]`

A starter point could be my basic convolution scheme in Laser 
[https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73](https://github.com/numforge/laser/blob/d1e6ae61/benchmarks/convolution/conv2d_direct_convolution.nim#L8-L73)

It's for batch of images with potentially different plane numbers (called 
channels) between in and out (for example RGB --> Grayscale) but the iteration 
should be faster than yours:

Note that here I tracked the indices manually instead of using an intermediate 
`Plane` object but both methods should be the same with an optimizing compiler.


proc conv2d_direct*[T](
oim: var Tensor[T], # Output images
iim: Tensor[T], # Input images
ishape: TensorShape,# Shape of a single image
kernel: Tensor[T],  # filter kernel
kshape: KernelShape,# kernel shape (should be const)
padding: Padding,   # Padding (should be const)
strides: Strides# Strides (should be 

Re: UncheckedArray conversion

2020-04-07 Thread mantielero
**C++ stuff**

In order to test the C++ version:

  1. The code can be downloaded from 
[here](https://github.com/IFeelBloated/test_c_filters).
  2. I compiled it like:

$ g++ -Wall -O3 -shared -fPIC -I. -o libfilter.so GaussBlur.cxx




3\. Create a VapourSynth python filter like test_filter.vpy:


import vapoursynth as vs
core = vs.get_core()
core.std.LoadPlugin(path='./libfilter.so')
core.std.SetMaxCPU('none')
clip = core.std.BlankClip(format=vs.GRAYS, length=10, fpsnum=24000, 
fpsden=1001, keep=True)
clip = core.testc.GaussBlur(clip)
clip.set_output()

Run

4\. Testing:


$ vspipe test_filter.vpy /dev/null
Output 10 frames in 29.53 seconds (3386.27 fps)

So I am just getting 3386fps with the pure C++ version while I am getting 
1200fps with the Nim version.

I am not comparing apples to apples here:

  1. The Nim version uses int32 while the C++ version uses float.
  2. The Nim version is not using getFrameAsync, so it is not taking advantage 
of multithreading as discussed 
[here](https://github.com/vapoursynth/vapoursynth/issues/551#issuecomment-610473633).




Re: UncheckedArray conversion

2020-04-07 Thread mantielero
**Nim stuff**

I have updated the [repository](https://github.com/mantielero/VapourSynth.nim). 
I have a local package.json with:


[
{
"name": "vapoursynth",
"url": "https://github.com/mantielero/VapourSynth.nim;,
"method": "git",
"tags": [
"video",
"vapoursynth"
],
"description": "Video processing based on Vapoursynth",
"license": "BSD"

}
]

With that you only need:


nimble install vapoursynth

The files to play with are 
[custom_filter.nim](https://github.com/mantielero/VapourSynth.nim/blob/master/test/custom_filter.nim)
 which makes use of 
[DrawFrame.nim](https://github.com/mantielero/VapourSynth.nim/blob/master/test/DrawFrame.nim).
 So you need these two files in the same folder and compile them like:


nim c -d:release -d:danger custom_filter

And in order to get the performance, just run it:


$ ./custom_filter

I will check how to do the same in C++. 


Re: UncheckedArray conversion

2020-04-07 Thread mratsim
If you have instructions to reproduce the benchmark in Nim and C++ I can help.

I just need a repo to clone, the scripts to run and the dataset. I already have 
Vapoursynth working.

Ideally you have a profiler like Intel Instruments or Apple VTune to dive into 
assembly.

For example this is my approach in debugging performance issue: 
[https://github.com/nim-lang/Nim/issues/9514](https://github.com/nim-lang/Nim/issues/9514)

For memory bottlenecks it's a bit different, I use the roofline model as 
mentioned in my convolution optimization resources: 
[https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity](https://github.com/numforge/laser/blob/master/research/convolution_optimisation_resources.md#computational-complexity)

For example I know that matrix multiplication and convolution can reach 90% of 
the peak CPU GFlop because their arithmetic intensity is high (i.e. you do over 
10 operations (add/mul) per byte) so if you don't reach that perf it's because 
you are moving bytes to much instead of computing with them.

The theoretical peak of your CPU is easy to compute:

  * single threaded:

> `CpuGhz * VectorWidth * InstrCycle * FlopInstr`

> for a CPU that supports AVX2 on float32 (so packing 8 float32) that can issue 
> 2 Fused-Multiply-Add per cycle at 3GHz we have

`3 (GHz) * 8 (packed float32 in AVX) * 2 (FMA per cycle) * 2 (FMA = 1 add + 1 
mul)` `= 96 GFlops`

  * multithreaded: Just multiply the single result by the number of cores. For 
example 10 cores would be 960 GFlops or 0.9 TFlops



And then the usual way to benchmark numerical algorithm is, you know the number 
of operations required by your algorithm, you divide that by the time spent to 
do them and you have your actual flops. And you compare your actual Flops with 
the theoretical peak. If you only reach 20% of the peak, you have a memory 
bottleneck and probably need to repack before processing to optimize cache 
usage, if not you need to look into SIMD vectorization, prefetching, ...

All of that is quite complex so what I can do is reach the naive C++ 
implementation performance.

Going beyond is something that I want to do but it's time-consuming and I feel 
that it would be better to spend my time on an image processing compiler 
similar to what's discussed here: 
[https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890](https://github.com/mratsim/Arraymancer/issues/347#issuecomment-459351890)
 and with a proof of concept here:

  * 
[https://github.com/numforge/laser/tree/master/laser/lux_compiler](https://github.com/numforge/laser/tree/master/laser/lux_compiler)
  * 
[https://github.com/numforge/laser/tree/master/laser/lux_compiler/core](https://github.com/numforge/laser/tree/master/laser/lux_compiler/core)




Re: UncheckedArray conversion

2020-04-07 Thread mantielero
You are probably right. For me is the first time doing this sort of stuff. But 
the feeling that I have is that even small modifications in the algorithm 
(avoiding conversions as much as possible, for example) has a significant 
impact. I think that the reason is that I am interating on 100_000 frames of 
640x480 pixels.

For instance I have just reached 1200fps just by replacing how the rows numbers 
are calculated. Changing this:


template clamp(val:cint, max_val:cint):untyped =
   min( max(val, 0), max_val).uint
 
 .
for row1 in 0..(vsapi->getReadPtr(frame, plane));
auto dstp = reinterpret_cast(vsapi->getWritePtr(dst, plane));
auto h = vsapi->getFrameHeight(frame, plane);
auto w = vsapi->getFrameWidth(frame, plane);

auto gauss = [&](auto y, auto x) {

auto clamp = [&](auto val, auto bound) {
return std::min(std::max(val, 0), bound - 1);
};

auto above = srcp + clamp(y - 1, h) * w;
auto current = srcp + y * w;
auto below = srcp + clamp(y + 1, h) * w;

auto conv = above[clamp(x - 1, w)] + above[x] * 2 + above[clamp(x + 1, 
w)] +
current[clamp(x - 1, w)] * 2 + current[x] * 4 + current[clamp(x + 
1, w)] * 2 +
below[clamp(x - 1, w)] + below[x] * 2 + below[clamp(x + 1, w)];
return conv / 16;
};

for (auto y = 0; y < h; y++)
for (auto x = 0; x < w; x++)
(dstp + y * w)[x] = gauss(y, x);

Run

When you mention memory bottleneck, what do you mean? Where should I look at?

Thanks a lot


Re: UncheckedArray conversion

2020-04-07 Thread mratsim
Type conversion from uint8 to uint32 (or uint64) is technically free because 
registers have a size of 32-bit or 64-bit.

If anything it's less costly because at a low-level loading a 8-bit value 
requires movzb (mov to register and zero extend the byte) which has a slightly 
high latency than plain mov for uint32 and uint64.

The issue in your convolutions is memory bottleneck not compute.


Re: UncheckedArray conversion

2020-04-07 Thread mantielero
This is part of a convolution filter. The result is in the same order of 
magnitude (uint8):


let value:int32  = r0[col0].int32 + r0[col1].int32 * 2 + r0[col2].int32 
+
   r1[col0].int32 * 2 + r1[col1].int32 * 4 + r1[col2].int32 
* 2 +
   r2[col0].int32 + r2[col1].int32 * 2 + r2[col2].int32
w1[col1] = (value  / 16.0).uint8

Run

I am looking how to improve performance-wise without entering into SIMD stuff 
(that I have never used by the way). I think that all those type conversions 
are killing the performance that I should achieve. 


Re: UncheckedArray conversion

2020-04-07 Thread lucian
out of curiosity, isn't this prone to integer overflow or are you using value 
in fact as an uint32 ?


Re: UncheckedArray conversion

2020-04-06 Thread Hlaaftana
`# rename to whatever, col can also be typed template get(col): int32 = 
r0[col].int32 let value:int32 = get(col0) + get(col1) * 2 + ... `

Run


UncheckedArray conversion

2020-04-06 Thread mantielero
I have something like the following:

> let value:int32 = r0[col0].int32 + r0[col1].int32 * 2 + ...

where r0 has type ptr UncheckedArray[uint8].

How can I avoid adding .int32 each time I use one item from the unckeckedarray?