Re: [brlcad-devel] bool_eval()

2017-08-14 Thread Vasco Alexandre da Silva Costa
Ok, I ran those tests on a machine with an AMD FX 8350 CPU and an NVIDIA
GTX TITAN GPU.

Results attached. It's up to 2x faster on the GPU than the CPU but the
speedup is less interesting in goliath.g, because of the depth complexity
of the scene I think.

We basically need to do a breadth-first rather than a depth-first ray tree
traversal like the ANSI C code in trunk/ does if we want to have improved
render performance in these high depth complexity scenes.

On Fri, Aug 11, 2017 at 11:33 PM, Marco Domingues <
marcodomingue...@gmail.com> wrote:

> Hello,
>
> Today I gathered some new times to compare the ANSI C boolean evaluation
> with the OpenCL implementation, and also to identify possible new
> optimizations to make on the code. Now using release builds and increasing
> the ray complexity (-s1024).
>
> I couldn’t figure out a way to track the time spent in each kernel yet,
> but I will keep looking for a way to do this. In the attached document,
> there are some comparisons between the current ANSI C code, and some
> variations of it (tracing the ray till the end, and without performing
> boolean operations), and also with the OpenCL implementations when running
> the code over the AMD/Intel OpenCL SDK on the CPU.
>
> I’ve also added side by side image comparisons in the document to show the
> current state of the OpenCL boolean implementation. There are still some
> shading differences, but the geometry seems correct (also you can notice
> missing primitives, this is the case for primitives that are not supported
> in OpenCL yet, i.e pipes in the goliath.g).
>
> In the document you can see that the current OpenCL implementation is
> slower than the ANSI C code, when running on the same hardware. But to be
> fair, the OpenCL version calculates intersections for the entire ray, and
> some major changes to the rendering loop had to be done to replicate the
> current behaviour of the ANSI C code, where ray intersections and boolean
> evaluations are done in parcial fashion.
>
> Finally, I’ve also committed the changes to the bool_eval() function that
> follows the behaviour of the current bool_eval() function in the trunk.
> Here you can see a comparison between the previous code (bool_eval() using
> the RPN tree) and with the new tree representation: https://
> brlcad.org/wiki/User:Marco-domingues/GSoC17/Log#10_August
>
> Cheers!
> Marco
>
>
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> BRL-CAD Developer mailing list
> brlcad-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/brlcad-devel
>
>


-- 
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
=== hardware ===
$ cat /proc/cpuinfo 
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 21
model   : 2
model name  : AMD FX(tm)-8350 Eight-Core Processor
stepping: 0
microcode   : 0x600084f
cpu MHz : 4013.421
cache size  : 2048 KB
physical id : 0
siblings: 8
core id : 0
cpu cores   : 4
apicid  : 16
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni 
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c 
lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch 
osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core 
perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save 
tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs: fxsave_leak sysret_ss_attrs
bogomips: 8026.84
TLB size: 1536 4K pages
clflush size: 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor   : 1
vendor_id   : AuthenticAMD
cpu family  : 21
model   : 2
model name  : AMD FX(tm)-8350 Eight-Core Processor
stepping: 0
microcode   : 0x600084f
cpu MHz : 4013.421
cache size  : 2048 KB
physical id : 0
siblings: 8
core id : 1
cpu cores   : 4
apicid  : 17
initial apicid  : 1
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb 
rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni 
pclmulqdq monitor 

Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Vasco Alexandre da Silva Costa
On Fri, Aug 11, 2017 at 12:40 AM, Vasco Alexandre da Silva Costa <
vasco.co...@gmail.com> wrote:
...

> The other issue that should be causing slowdowns is that the ANSI C code
> in rt_shootray() intersects one solid at a time and calls rt_bootfinal() on
> the partial results before continuing. In scenes with high depth complexity
> this means it will terminate earlier. So if you want an apples to apples
> comparison you'll have to force the ANSI C code to also process all the
> hits.
>

src/librt/shoot.c:rt_shootray():
"If a_onehit == 0 and a_ray_length <= 0, then the ray is traced to
+infinity."

So we'll want to test trunk/ with default settings and with those settings
above to have a more apples to apples comparison vs the current OpenCL
algorithm which basically does that.

-- 
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Vasco Alexandre da Silva Costa
On Fri, Aug 11, 2017 at 12:40 AM, Vasco Alexandre da Silva Costa <
vasco.co...@gmail.com> wrote:

> PS: If you really wanna know what I think is happening i suspect that in
> complex scenes, where the bitvectors are really sparse, we would have been
> better off using lists like the ANSI C code does instead. But the thing is,
> I don't like the fact that the ANSI C code uses bu_pbtl_ins_unique()
> either. I suspect we are better off emitting all the entries in a list and
> removing the duplicates later. The other issue that should be causing
> slowdowns is that the ANSI C code in rt_shootray() intersects one solid at
> a time and calls rt_bootfinal() on the partial results before continuing.
> In scenes with high depth complexity this means it will terminate earlier.
> So if you want an apples to apples comparison you'll have to force the ANSI
> C code to also process all the hits.
>
> To solve the first problem means we'll need more compute kernels which we
> don't have currently (e.g. prefix sum, reduce, radix sort) and to bound the
> memory used. To solve the second problem means we need to rethink how to
> process the primitives and change the rendering loop.
>

PPS: Needless to say this is outside the scope of the current GSoC. However
it would be nice if we could identify the bottlenecks so this can be solved
later on. One way would be to measure the size of the these intermediate
lists, with and without duplicates,  the size of the bitvectors and the
sparsity of the same bitvectors.

The current bitvector code is a lot simpler than code that uses lists
because it has much, much simpler and more predictable memory usage and
compute algorithms. If it's a win in performance terms or not though,
depends on a lot of factors related to the type of scenes being processed.
If we used the lists with sorts those could also end up being quite slow.
While something that uses a linear time list insertion function, like the
current code, should not be exactly fast either.

-- 
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Vasco Alexandre da Silva Costa
On Fri, Aug 11, 2017 at 12:20 AM, Vasco Alexandre da Silva Costa <
vasco.co...@gmail.com> wrote:

> On Thu, Aug 10, 2017 at 11:36 PM, Marco Domingues <
> marcodomingue...@gmail.com> wrote:
>
>> Hi,
>>
>> Thanks for reviewing my code and making the adjustments, Vasco! I’ve
>> integrated the changes in my patch.
>>
>> I’ve finished the port of the new bool_eval() function to OpenCL, and
>> although the improved performance, it wasn’t enough to outperform the ANSI
>> C code with the Release build.
>>
>> For the havoc scene, I got 1.56sec now vs 2.10sec before, when running
>> the OpenCL code on my GPU. (command ‘rt -z1 -l5 -s1024’). For reference,
>> the same scene renders in 0.63sec with the ANSI C code currently in the
>> trunk.
>>
>> Despite that, when I ran the OpenCL code in my CPU, I got 0.64sec now vs
>> 2.79sec before. (command ‘rt -z1 -l5 -s1024’).
>>
>
> So let me get this straight. The OpenCL backend is slower in your GPU than
> the CPU based trunk/ ANSI C backend. That's not totally unexpected. You
> have a consumer GPU with nerfed DP FP.
>
> What I want you to do tomorrow is to compare the trunk/ ANSI C backend
> with the OpenCL backend over your CPU with the AMD and Intel OpenCL
> implementations. I also want you to time the results with the single-hit
> mode if you have the time for that.
>
> Why are the 'rt -s1024' times in your July 27 post different from the
> times in your August 7 post?
>
>
>> I am a little intrigued with this, because smaller scenes like the
>> operators.g are clearly faster when using the GPU, (0.06 sec gpu vs 0.16sec
>> cpu). Any explanation?
>>
>
> Those scenes are fillrate limited with little depth or scene complexity.
>
>
>> Other thing that caught my attention was how close the lines RTFM and
>> wallclock from the ‘rt’ output are when running the OpenCL code in the CPU,
>> compared with the same lines from running the OpenCL code in the GPU. (i.e
>> 0.60 and 0.65 sec - cpu vs 0.32 and 1.65 sec - gpu).
>> Couldn’t the big difference on the GPU side be caused from transfers
>> between CPU-GPU and not by performing ray-intersections, boolean evaluation
>> and shading operations? Is there a way to investigate this?
>>
>
> The best way is to use a profiler like the ones I mentioned before.
> Alternatively one can time the transfers vs the computations by timing the
> appropriate CL calls making sure to clFinish() the queue before you measure
> the time.
>
>
>> Tomorrow I will update the previous tables that I shared before on my
>> document, now using Release builds. And will also include side by side
>> image comparisons between the ANCI C and OpenCL results, for each scene.
>>
>
> Ok. Make sure to try with both the AMD and Intel OpenCL SDKs over the CPU.
> I'm not interested in your GPU results right now as it would only
> complicate the comparison.
>

PS: If you really wanna know what I think is happening i suspect that in
complex scenes, where the bitvectors are really sparse, we would have been
better off using lists like the ANSI C code does instead. But the thing is,
I don't like the fact that the ANSI C code uses bu_pbtl_ins_unique()
either. I suspect we are better off emitting all the entries in a list and
removing the duplicates later. The other issue that should be causing
slowdowns is that the ANSI C code in rt_shootray() intersects one solid at
a time and calls rt_bootfinal() on the partial results before continuing.
In scenes with high depth complexity this means it will terminate earlier.
So if you want an apples to apples comparison you'll have to force the ANSI
C code to also process all the hits.

To solve the first problem means we'll need more compute kernels which we
don't have currently (e.g. prefix sum, reduce, radix sort) and to bound the
memory used. To solve the second problem means we need to rethink how to
process the primitives and change the rendering loop.

-- 
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Marco Domingues

> On 11 Aug 2017, at 00:20, Vasco Alexandre da Silva Costa 
>  wrote:
> 
> On Thu, Aug 10, 2017 at 11:36 PM, Marco Domingues  > wrote:
> Hi,
> 
> Thanks for reviewing my code and making the adjustments, Vasco! I’ve 
> integrated the changes in my patch.
> 
> I’ve finished the port of the new bool_eval() function to OpenCL, and 
> although the improved performance, it wasn’t enough to outperform the ANSI C 
> code with the Release build.
> 
> For the havoc scene, I got 1.56sec now vs 2.10sec before, when running the 
> OpenCL code on my GPU. (command ‘rt -z1 -l5 -s1024’). For reference, the same 
> scene renders in 0.63sec with the ANSI C code currently in the trunk.
> 
> Despite that, when I ran the OpenCL code in my CPU, I got 0.64sec now vs 
> 2.79sec before. (command ‘rt -z1 -l5 -s1024’).
> 
> So let me get this straight. The OpenCL backend is slower in your GPU than 
> the CPU based trunk/ ANSI C backend. That's not totally unexpected. You have 
> a consumer GPU with nerfed DP FP.

Yes that is right.

> 
> What I want you to do tomorrow is to compare the trunk/ ANSI C backend with 
> the OpenCL backend over your CPU with the AMD and Intel OpenCL 
> implementations. I also want you to time the results with the single-hit mode 
> if you have the time for that.
> 
> Why are the 'rt -s1024' times in your July 27 post different from the times 
> in your August 7 post?

At the time I was running the ‘rt -s1024’ over a Debug build, and also running 
it with the ANSI C code from the OpenCL branch, that used the rpn tree 
representation, which is considerable slower than running the code with a 
Release build and with the bool_eval() function currently in the trunk.

>  
> I am a little intrigued with this, because smaller scenes like the 
> operators.g are clearly faster when using the GPU, (0.06 sec gpu vs 0.16sec 
> cpu). Any explanation?
> 
> Those scenes are fillrate limited with little depth or scene complexity.
>  
> Other thing that caught my attention was how close the lines RTFM and 
> wallclock from the ‘rt’ output are when running the OpenCL code in the CPU, 
> compared with the same lines from running the OpenCL code in the GPU. (i.e  
> 0.60 and 0.65 sec - cpu vs 0.32 and 1.65 sec - gpu).
> Couldn’t the big difference on the GPU side be caused from transfers between 
> CPU-GPU and not by performing ray-intersections, boolean evaluation and 
> shading operations? Is there a way to investigate this?
> 
> The best way is to use a profiler like the ones I mentioned before. 
> Alternatively one can time the transfers vs the computations by timing the 
> appropriate CL calls making sure to clFinish() the queue before you measure 
> the time.
>  
> Tomorrow I will update the previous tables that I shared before on my 
> document, now using Release builds. And will also include side by side image 
> comparisons between the ANCI C and OpenCL results, for each scene.
> 
> Ok. Make sure to try with both the AMD and Intel OpenCL SDKs over the CPU. 
> I'm not interested in your GPU results right now as it would only complicate 
> the comparison.

Okay, from now on I will run the OpenCL code on my CPU! Most of the previous 
times in my tables were using the GPU. Sorry about that.

Cheers!
Marco

> 
> Regards,
> 
> -- 
> Vasco Alexandre da Silva Costa
> PhD in Computer Engineering (Computer Graphics)
> Instituto Superior Técnico/University of Lisbon, Portugal
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! 
> http://sdm.link/slashdot___
> BRL-CAD Developer mailing list
> brlcad-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/brlcad-devel

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Vasco Alexandre da Silva Costa
On Thu, Aug 10, 2017 at 11:36 PM, Marco Domingues <
marcodomingue...@gmail.com> wrote:

> Hi,
>
> Thanks for reviewing my code and making the adjustments, Vasco! I’ve
> integrated the changes in my patch.
>
> I’ve finished the port of the new bool_eval() function to OpenCL, and
> although the improved performance, it wasn’t enough to outperform the ANSI
> C code with the Release build.
>
> For the havoc scene, I got 1.56sec now vs 2.10sec before, when running the
> OpenCL code on my GPU. (command ‘rt -z1 -l5 -s1024’). For reference, the
> same scene renders in 0.63sec with the ANSI C code currently in the trunk.
>
> Despite that, when I ran the OpenCL code in my CPU, I got 0.64sec now vs
> 2.79sec before. (command ‘rt -z1 -l5 -s1024’).
>

So let me get this straight. The OpenCL backend is slower in your GPU than
the CPU based trunk/ ANSI C backend. That's not totally unexpected. You
have a consumer GPU with nerfed DP FP.

What I want you to do tomorrow is to compare the trunk/ ANSI C backend with
the OpenCL backend over your CPU with the AMD and Intel OpenCL
implementations. I also want you to time the results with the single-hit
mode if you have the time for that.

Why are the 'rt -s1024' times in your July 27 post different from the times
in your August 7 post?


> I am a little intrigued with this, because smaller scenes like the
> operators.g are clearly faster when using the GPU, (0.06 sec gpu vs 0.16sec
> cpu). Any explanation?
>

Those scenes are fillrate limited with little depth or scene complexity.


> Other thing that caught my attention was how close the lines RTFM and
> wallclock from the ‘rt’ output are when running the OpenCL code in the CPU,
> compared with the same lines from running the OpenCL code in the GPU. (i.e
> 0.60 and 0.65 sec - cpu vs 0.32 and 1.65 sec - gpu).
> Couldn’t the big difference on the GPU side be caused from transfers
> between CPU-GPU and not by performing ray-intersections, boolean evaluation
> and shading operations? Is there a way to investigate this?
>

The best way is to use a profiler like the ones I mentioned before.
Alternatively one can time the transfers vs the computations by timing the
appropriate CL calls making sure to clFinish() the queue before you measure
the time.


> Tomorrow I will update the previous tables that I shared before on my
> document, now using Release builds. And will also include side by side
> image comparisons between the ANCI C and OpenCL results, for each scene.
>

Ok. Make sure to try with both the AMD and Intel OpenCL SDKs over the CPU.
I'm not interested in your GPU results right now as it would only
complicate the comparison.

Regards,

-- 
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Marco Domingues
Hi,

Thanks for reviewing my code and making the adjustments, Vasco! I’ve integrated 
the changes in my patch.

I’ve finished the port of the new bool_eval() function to OpenCL, and although 
the improved performance, it wasn’t enough to outperform the ANSI C code with 
the Release build. 

For the havoc scene, I got 1.56sec now vs 2.10sec before, when running the 
OpenCL code on my GPU. (command ‘rt -z1 -l5 -s1024’). For reference, the same 
scene renders in 0.63sec with the ANSI C code currently in the trunk.

Despite that, when I ran the OpenCL code in my CPU, I got 0.64sec now vs 
2.79sec before. (command ‘rt -z1 -l5 -s1024’).

I am a little intrigued with this, because smaller scenes like the operators.g 
are clearly faster when using the GPU, (0.06 sec gpu vs 0.16sec cpu). Any 
explanation?

Other thing that caught my attention was how close the lines RTFM and wallclock 
from the ‘rt’ output are when running the OpenCL code in the CPU, compared with 
the same lines from running the OpenCL code in the GPU. (i.e  0.60 and 0.65 sec 
- cpu vs 0.32 and 1.65 sec - gpu).
Couldn’t the big difference on the GPU side be caused from transfers between 
CPU-GPU and not by performing ray-intersections, boolean evaluation and shading 
operations? Is there a way to investigate this?

Tomorrow I will update the previous tables that I shared before on my document, 
now using Release builds. And will also include side by side image comparisons 
between the ANCI C and OpenCL results, for each scene. 

Regards,
Marco



new_bool_eval.patch
Description: Binary data



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Vasco Alexandre da Silva Costa
On Thu, Aug 10, 2017 at 5:15 PM, Christopher Sean Morrison 
wrote:

> ...
>
> Related to these specific changes, these numbers pass a sanity check.  If
> you ran a profile (e.g., perf), you’d see that only a fraction of time is
> spent in boolean code (10-30% of time, depending on the model).  The other
> 2/3rds are roughly in traversal and intersection code.  It would be
> interesting to have isolated timings of just the boolean evaluation, but
> we’d still want to run the benchmark to see what the big picture impact is
> (and verify we’re still getting valid results).
>

Yes, this is something we need to do. First we made a relatively simple
prototype that demonstrated the functionality in OpenCL. Then we started
changing it to not to have worse worst case complexity than the code in
trunk/. But we must do proper code profiling eventually. The time spent on
each stage is important, as is the processor utilization rate, and memory
footprint.

We need to have the profile data to guide future work on optimizing the
code. I wouldn't be surprised if when Marco is done with these changes the
OpenCL backend will be 2-3x faster than the existing code. But I'm not sure
how much faster we can go without major algorithmic changes. Does the
current code use one thread per physical processor or does it use one
thread per virtual processor (i.e. Hyperthreading)? If it does use SMT
perhaps the 8x I thought were possible are actually impossible to achieve
and a ~4x performance increase is the best we can hope for with the current
solution.

I have some ideas on how to reduce thread divergence in the OpenCL pipeline
to further improve performance but those go beyond the scope and time frame
of this GSoC. If the 2-3x gains in performance over the same CPU hardware
do materialize though that is certainly nothing to sneeze at.

-- 
Vasco Alexandre da Silva Costa
PhD in Computer Engineering (Computer Graphics)
Instituto Superior Técnico/University of Lisbon, Portugal
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Christopher Sean Morrison

> Yes, at the time I ran three different benchmarks over the non-opencl version 
> because I was trying to compare three different versions of the ANSI C code. 
> This was, the ANSI C code currently in the trunk, the code in the trunk with 
> the patch #473 applied (patch from Vasco that removed the gotos from the 
> bool_eval(), and the ANSI C code in the opencl branch, which uses a boolean 
> tree in postfix notation. But perhaps this is not the best approach to 
> compare the performance. Well, at least I couldn’t really understand which 
> number in the benchmark log I should look at to compare. Is is the “Total 
> testing time elapsed” or the “VGR” metric? The “VGR” number didn’t make much 
> sense, because the version that got higher VGR is the code in which the ‘rt’ 
> command takes more time to run.

Ah, that explains things.  All three results were within volatility tolerance, 
effectively identical timings (within 1% deviation).  More importantly, they 
are “RIGHT” aka valid results, which speaks well to the refactoring cleanup.  
Performance may be in the noise, but correctness means it could be committed.

The benchmark is actually just right for understanding the real impact of these 
types of changes.  It’s a validation and timing framework.  As for the timings, 
just pay attention to the summary VGR metric (e.g., 32189).  All the number 
tell different things, but that’s a linear one relevant to tracking performance.

If you made a change and VGR goes from 32k to 35k, that means you sped things 
up about 10%.  It’s a simple linear metric.

The “total testing time” is irrelevant for what you’re doing.  That is 
measuring convergence rate, which is intrinsically volatile.  All that is 
telling is how long it took to see stable performance numbers to within a +-1% 
confidence interval (DEVIATION=1), which is going to take a little while given 
the TIMEFRAME=60 is also telling it to make sure there’s at least 1 minute of 
ray tracing work per frame.  In general, it’s going to need to run 2-5 frames 
before the performance is stable. 

Related to these specific changes, these numbers pass a sanity check.  If you 
ran a profile (e.g., perf), you’d see that only a fraction of time is spent in 
boolean code (10-30% of time, depending on the model).  The other 2/3rds are 
roughly in traversal and intersection code.  It would be interesting to have 
isolated timings of just the boolean evaluation, but we’d still want to run the 
benchmark to see what the big picture impact is (and verify we’re still getting 
valid results).

Cheers!
Sean

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Marco Domingues
Hi,

> On 10 Aug 2017, at 10:13, Christopher Sean Morrison  wrote:
> 
> Marco,
> 
> I saw your benchmark logs and it looks like you ran the non-opencl version 
> three times, which is almost certainly why the performance rankings were 
> nearly identical.  You’ll want to run “TIMEFRAME=60 DEVIATION=1 benchmark 
> run” for the non-ocl path and “TIMEFRAME=60 DEVIATION=1 benchmark run -z1” to 
> get the ocl path.  It should show that additional -z1 rt option in the 
> summary log.
> 

Yes, at the time I ran three different benchmarks over the non-opencl version 
because I was trying to compare three different versions of the ANSI C code. 
This was, the ANSI C code currently in the trunk, the code in the trunk with 
the patch #473 applied (patch from Vasco that removed the gotos from the 
bool_eval(), and the ANSI C code in the opencl branch, which uses a boolean 
tree in postfix notation. But perhaps this is not the best approach to compare 
the performance. Well, at least I couldn’t really understand which number in 
the benchmark log I should look at to compare. Is is the “Total testing time 
elapsed” or the “VGR” metric? The “VGR” number didn’t make much sense, because 
the version that got higher VGR is the code in which the ‘rt’ command takes 
more time to run. 

Cheers!
Marco

> Vasco,
> 
> Awesome TrueReg realization… I dug through the history and that surprisingly 
> goes all the way back, nearly to the beginning.  Half-guessing as to the 
> intention but there’s a hint that it was going to be used to hand back more 
> information to callers for XOR and overlap cases where there is ambiguity.  
> Instead, it looks like a decision was made to simply rewrite the nodes and 
> handle them in other ways. Running a performance impact check now.
> 
> Cheers!
> Sean
> 
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> BRL-CAD Developer mailing list
> brlcad-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/brlcad-devel


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] bool_eval()

2017-08-10 Thread Christopher Sean Morrison
Marco,

I saw your benchmark logs and it looks like you ran the non-opencl version 
three times, which is almost certainly why the performance rankings were nearly 
identical.  You’ll want to run “TIMEFRAME=60 DEVIATION=1 benchmark run” for the 
non-ocl path and “TIMEFRAME=60 DEVIATION=1 benchmark run -z1” to get the ocl 
path.  It should show that additional -z1 rt option in the summary log.

Vasco,

Awesome TrueReg realization… I dug through the history and that surprisingly 
goes all the way back, nearly to the beginning.  Half-guessing as to the 
intention but there’s a hint that it was going to be used to hand back more 
information to callers for XOR and overlap cases where there is ambiguity.  
Instead, it looks like a decision was made to simply rewrite the nodes and 
handle them in other ways. Running a performance impact check now.

Cheers!
Sean


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel