Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-23 Thread Christopher Sean Morrison

On Mar 21, 2015, at 12:55 PM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 I looked some more at the code and yeah the weave booleans step is branch 
 heavy and has gotos in it. So this is most likely a bad candidate for GPU 
 acceleration since a GPU will have like a quarter of the clockspeed and worse 
 branch prediction hardware than the CPU. We want something that is compute 
 heavy and has low thread divergence.

I wouldn’t discredit it because of how it’s currently implemented.  Similar to 
the various research demonstrations years ago that showed how to encode a 
(branch-heavy) kd-tree efficiently on the GPU, it should be possible to do the 
same with Boolean weaving.  The algorithm is very similar to depth buffering 
and depth peeling.  Like I said, you could almost certainly land a Siggraph or 
JCGT paper on this particular topic.  It’s an under-explored area of research.

 So I think it is best to focus on ray dispatch+traversal at this time. 
 Generate rays in bundles, traverse some sort of acceleration structure, 
 gather hit candidate ids, and compute hit intersections on CPU. Not having 
 the primitive tests on the GPU will reduce the performance advantage since we 
 will spend time transmitting data back and forth though.

This also sounds reasonable to me.  This will almost certainly involve more 
modifications to our front-end application logic (and obviously replaced 
back-end traversal logic).  Considerably more code that you’ll have to become 
familiar with compared with Boolean weaving, but not nearly as complicated or 
validation-sensitive.

In the end, probably a tradeoff wash.  Less code, harder algo vs More code, 
easier algo.  Both highly valuable, so propose whatever piques your interest!

 The shading pass also seems like low hanging fruit. We need to send the image 
 data to the GPU to display anyway so why not let it do some of the work. But 
 this depends on how the code is organized and output is displayed.

For some shaders, this is certainly possible, but I’m not married to our shader 
system.  More inclined to leverage someone else’s rendering infrastructure 
(e.g., OSL and/or Appleseed) for anything other than simple phong/flat shading.

 The main issue is that we would like that the 
 shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel steps 
 would be done in clearly separated stages for all tested rays, in order to 
 reduce the amount of CL kernel calls and maximize coherence, but from what I 
 understand everything currently happens in one giant megakernel. So perhaps a 
 lot of effort should be spent changing this way of things before doing any CL 
 coding at all.

Almost certainly.  There’s a front-end layer (rt) that dispatches to a back end 
layer (librt) for traversed intersections, then the front-end sends the results 
to another layer (liboptical) for coloring/effects, and that all is then 
composited by the front-end into an image (via libicv/libfb).  The application 
layer (used by dozens of applications) is somewhat relevant and described in 
detail here: http://brlcad.org/wiki/Developing_applications

Cheers!
Sean


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-21 Thread Christopher Sean Morrison

On Mar 20, 2015, at 11:29 PM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 I see. Then I need to examine this part better and propose a plan focused on 
 this. I only have CS grad student like knowledge of CSG since most of my 
 experience in RT is with acceleration structures and ray processing so I am 
 interested in seeing how you do the weaving. If there are any specific 
 optimizations, etc.

Old school optimizations.  Complex goto branching and highly tuned tolerance 
testing/merging/splitting.  For a first take, though, a simpler approach could 
be taken since the performance considerations will be considerably different 
with coherency and/or GPU involved.
  
 So the only planning concern is that this isn’t an incremental plan.  It’s 
 always a big red flag when a proposal has a big “test and document” at the 
 end.  If anything goes wrong or takes a couple weeks longer than planned with 
 one of your early steps, we usually end up with something incomplete that 
 cannot be shipped to users.
  
 I'm used to doing incremental development with continuous integration. I 
 don't know how to do it any other way. I always test every step in order to 
 not break the code. But in my experience something always needs fine tuning 
 or debugging in the end. Also the docs always lag behind everything. Not 
 dedicating time specifically in the plan for docs usually means next to no 
 docs will be produced.

Coding complete isn’t about incremental development or continuous testing.  
Those are good too.  Taken to extreme, complete means you develop in a manner 
such that even if tragedy struck (e.g., you got hit by a bus tomorrow), there’s 
actually nothing anyone would have to undo/fix/finish/revert.  You focus on 
making sure your code is always “done”, in a production-ready state.

If writing code were a graph traversal, it’s about going depth-first instead of 
breadth-first.  Not leaving TODO and FIXME notes for issues directly related to 
your implementation, making sure your code is always fully compliant to 
standards, not getting something “good enough” before moving on to the next 
piece, etc.  If you’re complete, docs never lag, implementations are verified 
as they’re implemented, and reverified as they become integrated with other 
components.

If a callout is warranted in the schedule, there should be several instances of 
“test, debug, document” throughout your timeline, but still not one big 2.5 
week one at the end.  That’d be breadth-first waterfall-style cleanup.  It’s 
just not acceptable from a risk-management perspective, and essentially 
something we’ve learned to not even allow for GSoC.

 I was an OSS project maintainer (Freeciv) for close to 4 years. You never 
 ever break the code with a commit. BTW what's your process for patch code 
 reviews? You discuss patches on the mailing-list?

Ah, a fellow game dev and with OSS experience — fantastic.  Please read our 
developer HACKING guide (in source tree).  It covers patches, commit 
expectations, code reviews.  It’s an informal process primarily in the hands of 
existing devs to decide what to review.  What is deemed acceptable is discussed 
throughout HACKING, automated tests, and core developer mindshare.

As for commits breaking code, HACKING also talks about that topic.  We actually 
allow a (temporarily) unstable trunk.  We have STABLE and RELEASE branches that 
have much different quality and commit requirements.

 I basically wanted to have a whole (optional) accelerated rendering path from 
 end to end so we could then work on the primitives afterwards when this thing 
 ends. But if weaving is that compute intensive in BRL-CAD you are correct 
 that it may make more sense to initially focus on just that. I have not 
 profiled BRL-CAD librt to know where the perf bottlenecks are but you guys 
 have actual experience with this.

I completely understood why you proposed it that way. :)

The problem is that approach means that you could be 100% successful in your 
project, even ahead of schedule, and users could end up with no benefit when 
GSoC is over until enough primitives have a shot routine transcoded, tested, 
and verified.  Some of them will be exceptionally complex to get working in 
OpenCL, months of effort each.  There are about 10 primitives that are in 
pervasive use — some incredibly complex.  Why would we want optional 
acceleration?  By simply shifting the order/focus around a little bit, we can 
avoid optional.  It will ensure the focus remains on the user, not the status 
of implementation.

If you actually accomplish accelerating the whole pipeline by the end, that 
would be fantastic.  If anything takes longer than expected, and it will, 
everything will still be fine.  Remember to account for time throughout to 
provably demonstrate consistency with the existing implementation, to document 
progress as you go, to engage in discussions, towards answering questions, etc. 
 

As for performance, 

Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-21 Thread Christopher Sean Morrison

On Mar 21, 2015, at 10:11 AM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 On Sat, Mar 21, 2015 at 7:19 AM, Christopher Sean Morrison brl...@mac.com 
 wrote:
 
 If a callout is warranted in the schedule, there should be several instances 
 of “test, debug, document” throughout your timeline, but still not one big 
 2.5 week one at the end.  That’d be breadth-first waterfall-style cleanup.  
 It’s just not acceptable from a risk-management perspective, and essentially 
 something we’ve learned to not even allow for GSoC.
 
 Not necessarily. For example if you issue a code freeze before doing a 
 version bump where you only accept bugfixes and don't accept any feature 
 improvements it doesn't mean you are using a waterfall model. Is the Linux 
 kernel development model a waterfall model? I would say it isn’t.

I would agree.  When you have ongoing continuous integration cycles and 
depending on the length of the cycles relative to the overall effort invested, 
it’s arguably not waterfall.  When it’s only one cycle, about 17% of the 
timeline, and at the end — it fits the classic verification/debugging 
characterization rather well.  Regardless, it really doesn’t matter what we 
call it.  It’s highly “undesirable” for GSoC projects.  :)

 I will change the planning timetable to work in the fashion you describe.

Appreciated.  For everyone following the thread, I will add that discussions 
like these are good to see.  We expect give and take compromises and 
collaborative discussions throughout development from everyone.  We certainly 
won’t always agree and contrary decisions will suck.  It takes time and 
commitment to earn a meritocratic position (which I’m sure, Vasco, you're well 
aware of given your OSS experience).  With that, however, also comes the 
ability to influence change, take BRL-CAD in directions we weren’t even 
considering (looking at all you web devs). :)  Networking and building working 
relationships are really the most important aspect of GSoC.

Cheers!
Sean

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-21 Thread Vasco Alexandre da Silva Costa
On Sat, Mar 21, 2015 at 12:30 AM, Christopher Sean Morrison brl...@mac.com
...

 Something to consider, you could even propose *only* focusing on this
 aspect for GSoC.  What I mean is that you could spend time restructuring
 the rt dispatch logic, which is something like
 forAllPixels-shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel
 that iterates over all pixels depth-first pipeline style.  You’d
 restructure it into something like phase1:
 forAllPixels-shootOneRay-traverseScene-evalHits then phase2:
 CLweaveAllBooleans-CLcolorizePixels.

...

 That’s why I’d suggest either focusing only on boolean weaving, or on
 bundled ray dispatch+traversal, or hit+result gathering, etc — something
 that could be put into immediate use, even if it’s not going to give the
 10x speedup until the rest of the pipeline is converted.  Changing
 BRL-CAD’s render pipeline to support this style of evaluation is going to
 be a lot of work.


I looked some more at the code and yeah the weave booleans step is branch
heavy and has gotos in it. So this is most likely a bad candidate for GPU
acceleration since a GPU will have like a quarter of the clockspeed and
worse branch prediction hardware than the CPU. We want something that is
compute heavy and has low thread divergence.

So I think it is best to focus on ray dispatch+traversal at this time.
Generate rays in bundles, traverse some sort of acceleration structure,
gather hit candidate ids, and compute hit intersections on CPU. Not having
the primitive tests on the GPU will reduce the performance advantage since
we will spend time transmitting data back and forth though.

On Sat, Mar 21, 2015 at 7:19 AM, Christopher Sean Morrison brl...@mac.com
wrote:

 As for performance, weaving can be a big or small factor, as it’s heavily
 dependent on the geometry being rendered, number of primitives, number of
 regions, depth of hierarchy, number of operations, etc.  A traditional
 model like havoc.g or m35.g in our sample db directory should benefit
 greatly while a pure NURBS model (e.g., no Boolean operations) won’t
 benefit nearly as much.  We’re not going to see huge performance until the
 whole pipeline is connected, and that’s okay if users get speedups for some
 geometry situations.  Generally speaking, dispatched traversal is roughly
 usually 5-15%, prep is 0-20% (except brep), shot intersection testing is
 30-70%, weaving is 5-30%, and shading is 0-25%.  Models with complex
 entities spend more time in prep and shot; models with complex hierarchies
 spend more time in weaving and traversal.


The shading pass also seems like low hanging fruit. We need to send the
image data to the GPU to display anyway so why not let it do some of the
work. But this depends on how the code is organized and output is displayed.

The main issue is that we would like that the
shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel
steps would be done in clearly separated stages for all tested rays, in
order to reduce the amount of CL kernel calls and maximize coherence, but
from what I understand everything currently happens in one giant
megakernel. So perhaps a lot of effort should be spent changing this way of
things before doing any CL coding at all.

-- 
Vasco Alexandre da Silva Costa
PhD Student at Department of Information Systems and Computer Science
Instituto Superior Técnico/University of Lisbon, Portugal
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-21 Thread Vasco Alexandre da Silva Costa
On Sat, Mar 21, 2015 at 7:19 AM, Christopher Sean Morrison brl...@mac.com
wrote:


 If a callout is warranted in the schedule, there should be several
 instances of “test, debug, document” throughout your timeline, but still
 not one big 2.5 week one at the end.  That’d be breadth-first
 waterfall-style cleanup.  It’s just not acceptable from a risk-management
 perspective, and essentially something we’ve learned to not even allow for
 GSoC.

 Not necessarily. For example if you issue a code freeze before doing a
version bump where you only accept bugfixes and don't accept any feature
improvements it doesn't mean you are using a waterfall model. Is the Linux
kernel development model a waterfall model? I would say it isn't.

I will change the planning timetable to work in the fashion you describe.

Vasco Alexandre da Silva Costa
PhD Student at Department of Information Systems and Computer Science
Instituto Superior Técnico/University of Lisbon, Portugal
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-20 Thread Christopher Sean Morrison

On Mar 20, 2015, at 7:17 PM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 I came up with this tentative work plan for improving the OpenCL (CL) RT code 
 on BRL-CAD:

Excellent!

 - Hook into shot and db load routines (hooks disabled by default) in order to 
 capture ellipsoid primitives and shots into the CL side. Get familiar with 
 these top level interfaces of the code. (2 weeks)

This might require a little more inspection.  For places where you might “get 
opencl ready”, I would expect you to hook in when a raytrace context is created 
(e.g., rt_dirbuild()) or during geometry prep, which every object implements 
and is called during directory building.  If you see the rt_ell_prep() function 
in the opencl branch, that’s effectively where this is already happening for 
our quick test.

 - CL megakernel for boolean operations. i need to check how involved this 
 will be to have a more accurate time estimate. Probably requires storing some 
 primitive hierarchy on CL. Integration with C++ side of things for other  
 primitives may be problematic. (2 weeks?)

This may be a large task, possibly the bulk of your timeframe, as boolean 
weaving is one of the most complex aspects in BRL-CAD.  Understanding it is 
almost certainly going to take you some time.  However, you’ll probably be able 
to publish a paper on it when you’re done! :)

Something to consider, you could even propose *only* focusing on this aspect 
for GSoC.  What I mean is that you could spend time restructuring the rt 
dispatch logic, which is something like 
forAllPixels-shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel
 that iterates over all pixels depth-first pipeline style.  You’d restructure 
it into something like phase1: 
forAllPixels-shootOneRay-traverseScene-evalHits then phase2: 
CLweaveAllBooleans-CLcolorizePixels.

If you got it working, you’d speed up ray tracing somewhere between 25-50% (as 
that is about how much time Boolean weaving takes) for all geometry, not just 
ellipsoids and it could go into immediate production use.  It’d be in users 
hands.  More on this later.

With time-permitting, you could then work on the rest of the pipeline like 
implementing the top-level bundled dispatch and scene traversal acceleration, 
and primitive shot() routines.

 - Implement CL rectilinear grids [1] (improvement on the Mike Gigante Nugrid 
 currently used in BRL-CAD) spatial partitioning construction and traversal. 
 Should reuse most of the regular grid construction code but requires some 
 extra construction steps and has different traversal scheme (2.5 weeks)

Sounds good, but know that we don’t actually use Nugrid.  It’s faster for some 
scenes, slower for others — overall a wash.  I think there’s also a bug 
somewhere in there.  We use a custom BSP traversal for production work that is 
much more thoroughly tested and robust.  Grids generally do much better on the 
GPU than they do on the CPU.

 - Cleanups, bugfixes, final tests, docs. (2.5 weeks)

So the only planning concern is that this isn’t an incremental plan.  It’s 
always a big red flag when a proposal has a big “test and document” at the end. 
 If anything goes wrong or takes a couple weeks longer than planned with one of 
your early steps, we usually end up with something incomplete that cannot be 
shipped to users.

Assume something will go wrong, that it will take longer, or your computer will 
explode next week.  How would you change your approach so that users still get 
something?  If you had to immediately stop working, there should be something 
usable (without requiring additional work by you or others to make it usable). 
This is an aspect of “coding complete” mentioned on 
http://brlcad.org/wiki/Google_Summer_of_Code/Acceptance#Write_complete_code

Basically, this means thinking through how you can incrementally break up the 
task into phases where each phase gets tested and could be put into production 
use without detriment.  It not only means you’re continually cleaning up, 
testing, and updating docs, it means you’re continually focused on the user and 
their benefit instead of your development comfort.  If that means getting a 
little less done, that’s okay.  Users will love you for it.

My other concern was that your objective doesn’t result in a feature that 
affects users.  Yay, devs can raytrace ellipsoids MUCH faster … and users see 
nothing.  That’s why I’d suggest either focusing only on boolean weaving, or on 
bundled ray dispatch+traversal, or hit+result gathering, etc — something that 
could be put into immediate use, even if it’s not going to give the 10x speedup 
until the rest of the pipeline is converted.  Changing BRL-CAD’s render 
pipeline to support this style of evaluation is going to be a lot of work. 

Still, outstanding proposal progress.  This and other proposals are making me 
pretty darn excited!

Cheers!
Sean

p.s. Some technical libraries to consider in lieu of directly binding to OpenCL:


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-20 Thread Vasco Alexandre da Silva Costa
On Sat, Mar 21, 2015 at 12:30 AM, Christopher Sean Morrison brl...@mac.com
wrote:
...

 This may be a large task, possibly the bulk of your timeframe, as boolean
 weaving is one of the most complex aspects in BRL-CAD.  Understanding it is
 almost certainly going to take you some time.  However, you’ll probably be
 able to publish a paper on it when you’re done! :)

 Something to consider, you could even propose *only* focusing on this
 aspect for GSoC.  What I mean is that you could spend time restructuring
 the rt dispatch logic, which is something like
 forAllPixels-shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel
 that iterates over all pixels depth-first pipeline style.  You’d
 restructure it into something like phase1:
 forAllPixels-shootOneRay-traverseScene-evalHits then phase2:
 CLweaveAllBooleans-CLcolorizePixels.


I see. Then I need to examine this part better and propose a plan focused
on this. I only have CS grad student like knowledge of CSG since most of my
experience in RT is with acceleration structures and ray processing so I am
interested in seeing how you do the weaving. If there are any specific
optimizations, etc.


 If you got it working, you’d speed up ray tracing somewhere between 25-50%
 (as that is about how much time Boolean weaving takes) for all geometry,
 not just ellipsoids and it could go into immediate production use.  It’d be
 in users hands.  More on this later.


Yeah this seems to be a good idea. I wanted to do something that would
leave an actual performance improvement for users in the end. If this is
the critical compute intensive task then it is what needs to be addressed.


 With time-permitting, you could then work on the rest of the pipeline like
 implementing the top-level bundled dispatch and scene traversal
 acceleration, and primitive shot() routines.

...

 Sounds good, but know that we don’t actually use Nugrid.  It’s faster for
 some scenes, slower for others — overall a wash.  I think there’s also a
 bug somewhere in there.  We use a custom BSP traversal for production work
 that is much more thoroughly tested and robust.  Grids generally do much
 better on the GPU than they do on the CPU.


That is news to me. When i glanced at librt/shoot.c I saw a lot of
references to 'nugrid'. There's only one match for 'nubsp' and it is in a
comment. I guess I need to start peeling this and see what is actually
being called.

 - Cleanups, bugfixes, final tests, docs. (2.5 weeks)


 So the only planning concern is that this isn’t an incremental plan.  It’s
 always a big red flag when a proposal has a big “test and document” at the
 end.  If anything goes wrong or takes a couple weeks longer than planned
 with one of your early steps, we usually end up with something incomplete
 that cannot be shipped to users.


I'm used to doing incremental development with continuous integration. I
don't know how to do it any other way. I always test every step in order to
not break the code. But in my experience something always needs fine tuning
or debugging in the end. Also the docs always lag behind everything. Not
dedicating time specifically in the plan for docs usually means next to no
docs will be produced.


 Assume something will go wrong, that it will take longer, or your computer
 will explode next week.  How would you change your approach so that users
 still get something?  If you had to immediately stop working, there should
 be something usable (without requiring additional work by you or others to
 make it usable). This is an aspect of “coding complete” mentioned on
 http://brlcad.org/wiki/Google_Summer_of_Code/Acceptance#Write_complete_code


I was an OSS project maintainer (Freeciv) for close to 4 years. You never
ever break the code with a commit. BTW what's your process for patch code
reviews? You discuss patches on the mailing-list?


 Basically, this means thinking through how you can incrementally break up
 the task into phases where each phase gets tested and could be put into
 production use without detriment.  It not only means you’re continually
 cleaning up, testing, and updating docs, it means you’re continually
 focused on the user and their benefit instead of your development comfort.
 If that means getting a little less done, that’s okay.  Users will love you
 for it.

 My other concern was that your objective doesn’t result in a feature that
 affects users.  Yay, devs can raytrace ellipsoids MUCH faster … and users
 see nothing.  That’s why I’d suggest either focusing only on boolean
 weaving, or on bundled ray dispatch+traversal, or hit+result gathering, etc
 — something that could be put into immediate use, even if it’s not going to
 give the 10x speedup until the rest of the pipeline is converted.  Changing
 BRL-CAD’s render pipeline to support this style of evaluation is going to
 be a lot of work.


I basically wanted to have a whole (optional) accelerated rendering path
from end to end so we could then work on the primitives 

Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-06 Thread benson chepkwony
You mentioned that:
There is some OpenCL prototype code in there for testing intersections with 
ellipsoids. But this kind of architecture is not going to work performance 
wise. It is allocating buffers and sending geometry data to the device (GPU or 
whatever) on every single ray/primitive intersection call 
(src/librt/primitives/sph/sph.c:clt_shot). I also checked the main intersection 
code and it uses a function table to figure out which ray/primitive 
intersection routine to call (src/librt/primitives/table.c)
 
Since this subject talks about Performance I was going to ask:
1. First do you have/carry BrlCad with Manual Guide or a Guide book?
2. Is the main ray tracing written from OpenCL library/code? And does this 
affects the following files:
src/other/opennurbs
src/librt/primitives/brep 
src/librt/opennurbs_ext.* 
src/proc-db/csgbrep.cpp
src/proc-db/brep* 

3. Can we Modify/Edit these libraries such as: OpenCL, Opennurbs and Rhino3D 
folks or should we write a ray tracer from scratch, let say if we need to 
optimize it?
4. Is NURBS used to find ray intersection on objects? and what about OpenCL 
Code does it also do the same thing as NURBS? which one is better?

I am interested in working on the NURBS Optimization and Cleanup and I would 
like to optimize its algorithm as this often plays a pivotal role in improving 
performance. This is just one classic place to optimize. 
To optimize ray tracer, I am planning on conducting a couple of these steps:
  1. Implementing Kd-tree structure.
  2. Implementing Bezier clipping method for NURBS Surface
  3. Implementing BVH Ray traversal
  4. Implementing Bounding Volume Hierarchy
  5. Fine-tuning a NURBS-curve by specifying different weight
  with knot-vector

Bounding Volume Hierarchy:
In order to reduce intersection cost, a bounding volume should be drawn around 
each polygon.

You also mentioned that:
The way ray tracing currently occurs, there’s even room to speed up primitives 
on the host (CPU) by an order of magnitude or better ensuring better data 
coherency and eliminating branch logic.Because you mentioned it, I would like 
to look further in this area.

From,
Benson


On Wednesday, March 4, 2015 11:25 PM, Christopher Sean Morrison 
brl...@mac.com wrote:
 



On Mar 4, 2015, at 6:06 PM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 I have been reading the 'librt' source code.

Excellent!

 There is some OpenCL prototype code in there for testing intersections with 
 ellipsoids. But this kind of architecture is not going to work performance 
 wise. It is allocating buffers and sending geometry data to the device (GPU 
 or whatever) on every single ray/primitive intersection call 
 (src/librt/primitives/sph/sph.c:clt_shot). I also checked the main 
 intersection code and it uses a function table to figure out which 
 ray/primitive intersection routine to call (src/librt/primitives/table.c)…

Astute observations.  The OpenCL code was intentionally structured that way, 
not at all for performance obviously, but so we could perform a 1-1 comparison 
to validate that the implementation behavior was correct.  Our librt ray 
tracing logic is integrated in a number of scientific analysis applications, so 
we go to great lengths to ensure we consistently report hits/misses, that the 
ray intersection implementation is correct especially given inaccurate floating 
point math.

Indeed, the first OpenCL translation of the ellipsoid logic — which seemed 100% 
identical to the C code — was wrong.  It resulted in different calculations.  
The issues were sorted out a few days later, but this level of rigor and 
validation is necessary.  Of course, this is not the desired end-state, but was 
done to easily test the implementation.  The inefficient buffer allocation 
overhead was also intentional (and also temporary) just to get quick numbers on 
OpenCL’s overhead costs.

Basically, it was a simple proof-of-concept and it was successful.  The next 
steps are to create a new intersection pipeline that is coherent from start to 
end with bundles of rays getting dispatched and evaluated against primitives 
without excessive branching and cache misses.

 Then I did some  'wc -l' runs to see how much effort it would take to recode 
 parts of the pipeline:
 src/librt/*.c  - 33516 lines
 src/librt/primitives/*.c   - 4653 lines
 src/librt/primitives/*/*.c - 141351 lines

This is all the back-end portions of the pipeline.  You’ll also need to 
consider the front-end where ray dispatching occurs in 
src/rt/(do|main|opt|view|worker).c

That said, it’s both better and worse than those numbers make it seem.  You’ve 
encompassed far too much code in librt, much of it related to .g file 
management.  You’re also not looking at the code that some of those primitives 
rely on.  For example, the “brep” primitive is built on src/libbrep and 
src/other/openNURBS, and some of the primitives rely on 

Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-06 Thread Vasco Alexandre da Silva Costa
On Fri, Mar 6, 2015 at 5:03 PM, benson chepkwony bchepk...@att.net wrote:

 3. Can we Modify/Edit these libraries such as: OpenCL, Opennurbs and
 Rhino3D folks or should we write a ray tracer from scratch, let say if we
 need to optimize it?
 4. Is NURBS used to find ray intersection on objects? and what about
 OpenCL Code does it also do the same thing as NURBS? which one is better?


NURBS are just another primitive. The BRL-CAD rendering architecture uses
the ray tracing algorithm. So it has an implementation of a ray/NURB
intersection algorithm in order to render NURBS.

OpenCL is just a C like programming language with vector processing
extensions.


  I am interested in working on the NURBS Optimization and Cleanup and I
 would like to optimize its algorithm as this often plays a pivotal role in
 improving performance. This is just one classic place to optimize.
 To optimize ray tracer, I am planning on conducting a couple of these
 steps:
1. Implementing Kd-tree structure.
   2. Implementing Bezier clipping method for NURBS Surface
   3. Implementing BVH Ray traversal
   4. Implementing Bounding Volume Hierarchy
   5. Fine-tuning a NURBS-curve by specifying different weight
   with knot-vector

 Bounding Volume Hierarchy:
 In order to reduce intersection cost, a bounding volume should be drawn
 around each polygon.


You seem to be assuming BRL-CAD triangulates the NURBS surfaces before the
ray-tracing step but I am not sure that is what they are doing.


 You also mentioned that:
 The way ray tracing currently occurs, there’s even room to speed up
 primitives on the host (CPU) by an order of magnitude or better ensuring
 better data coherency and eliminating branch logic.
 Because you mentioned it, I would like to look further in this area.

 If you want to work on that check out these papers:
Design for Parallel Interactive Ray Tracing Systems
An Application of Scalable Massive Model Interaction using Shared-Memory
Systems

which describe the implementation of the Manta ray-tracer. Manta is a an
open-source parallel ray tracer which has ray bundle optimizations.
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-06 Thread Christopher Sean Morrison

On Mar 6, 2015, at 12:58 PM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 You seem to be assuming BRL-CAD triangulates the NURBS surfaces before the 
 ray-tracing step but I am not sure that is what they are doing.

That is most definitely not what we do.

 If you want to work on that check out these papers:
 Design for Parallel Interactive Ray Tracing Systems
 An Application of Scalable Massive Model Interaction using Shared-Memory 
 Systems
 
 which describe the implementation of the Manta ray-tracer. Manta is a an 
 open-source parallel ray tracer which has ray bundle optimizations.

Excellent references, thanks for sharing.  Another is Intel’s Embree ray 
tracer: 
https://software.intel.com/en-us/articles/embree-photo-realistic-ray-tracing-kernels

Cheers!
Sean


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-05 Thread Christopher Sean Morrison

On Mar 5, 2015, at 9:27 AM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 I see. But double precision floating point math is a lot slower than single 
 precision on certain architectures like the lower end NVIDIA cards. e.g. the 
 GeForce GTX 780 Ti has 5048 SP GFLOPS but only 210 DP GFLOPS. Even on the 
 higher end NVIDIA cards the DP FLOPS are like a third of the SP FLOPS. The 
 rendering system should be tunable for either speed or accuracy. That kind of 
 complicates things. As a first approach we can work only on getting the 
 double precision to work but the performance will suffer a great deal.

Absolutely understood, but it’s not just speed vs accuracy.  If a ray that 
should mathematically hit an object reports a miss, we will produce incorrect 
analysis results, which could lead to incorrect decisions that affect 
manufacturing designs or even impact lives.

Imagine a calculator that will report 2+2=[5,4,3] VERY quickly, much faster 
than one that reports 2+2=4 every time.  Would that be very useful?  Perhaps 
for some situations, but not for most math problems.  That’s kind of our 
situation here.  We don’t do raytracing for pretty pictures.  Fast and wrong is 
useless for our purposes.  

If single precision can be made to match results, that will obviously be more 
desirable and faster.  Frankly don’t care if ti's single or double precision, 
but the behavior of a ray that should hit or miss must be demonstrably 
mathematically correct or at least consistent with current behavior.

Cheers!
Sean


--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-05 Thread Vasco Alexandre da Silva Costa
On Thu, Mar 5, 2015 at 6:24 AM, Christopher Sean Morrison brl...@mac.com
wrote:


 Astute observations.  The OpenCL code was intentionally structured that
 way, not at all for performance obviously, but so we could perform a 1-1
 comparison to validate that the implementation behavior was correct.  Our
 librt ray tracing logic is integrated in a number of scientific analysis
 applications, so we go to great lengths to ensure we consistently report
 hits/misses, that the ray intersection implementation is correct especially
 given inaccurate floating point math.

 Indeed, the first OpenCL translation of the ellipsoid logic — which seemed
 100% identical to the C code — was wrong.  It resulted in different
 calculations.  The issues were sorted out a few days later, but this level
 of rigor and validation is necessary.  Of course, this is not the desired
 end-state, but was done to easily test the implementation.  The inefficient
 buffer allocation overhead was also intentional (and also temporary) just
 to get quick numbers on OpenCL’s overhead costs.


I see. But double precision floating point math is a lot slower than single
precision on certain architectures like the lower end NVIDIA cards. e.g.
the GeForce GTX 780 Ti has 5048 SP GFLOPS but only 210 DP GFLOPS. Even on
the higher end NVIDIA cards the DP FLOPS are like a third of the SP FLOPS.
The rendering system should be tunable for either speed or accuracy. That
kind of complicates things. As a first approach we can work only on getting
the double precision to work but the performance will suffer a great deal.

-- 
Vasco Alexandre da Silva Costa
PhD Student at Department of Information Systems and Computer Science
Instituto Superior Técnico/University of Lisbon, Portugal
--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel


Re: [brlcad-devel] Real-Time Ray Tracing

2015-03-04 Thread Christopher Sean Morrison

On Mar 4, 2015, at 6:06 PM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 I have been reading the 'librt' source code.

Excellent!

 There is some OpenCL prototype code in there for testing intersections with 
 ellipsoids. But this kind of architecture is not going to work performance 
 wise. It is allocating buffers and sending geometry data to the device (GPU 
 or whatever) on every single ray/primitive intersection call 
 (src/librt/primitives/sph/sph.c:clt_shot). I also checked the main 
 intersection code and it uses a function table to figure out which 
 ray/primitive intersection routine to call (src/librt/primitives/table.c)…

Astute observations.  The OpenCL code was intentionally structured that way, 
not at all for performance obviously, but so we could perform a 1-1 comparison 
to validate that the implementation behavior was correct.  Our librt ray 
tracing logic is integrated in a number of scientific analysis applications, so 
we go to great lengths to ensure we consistently report hits/misses, that the 
ray intersection implementation is correct especially given inaccurate floating 
point math.

Indeed, the first OpenCL translation of the ellipsoid logic — which seemed 100% 
identical to the C code — was wrong.  It resulted in different calculations.  
The issues were sorted out a few days later, but this level of rigor and 
validation is necessary.  Of course, this is not the desired end-state, but was 
done to easily test the implementation.  The inefficient buffer allocation 
overhead was also intentional (and also temporary) just to get quick numbers on 
OpenCL’s overhead costs.

Basically, it was a simple proof-of-concept and it was successful.  The next 
steps are to create a new intersection pipeline that is coherent from start to 
end with bundles of rays getting dispatched and evaluated against primitives 
without excessive branching and cache misses.

 Then I did some  'wc -l' runs to see how much effort it would take to recode 
 parts of the pipeline:
 src/librt/*.c  - 33516 lines
 src/librt/primitives/*.c   - 4653 lines
 src/librt/primitives/*/*.c - 141351 lines

This is all the back-end portions of the pipeline.  You’ll also need to 
consider the front-end where ray dispatching occurs in 
src/rt/(do|main|opt|view|worker).c

That said, it’s both better and worse than those numbers make it seem.  You’ve 
encompassed far too much code in librt, much of it related to .g file 
management.  You’re also not looking at the code that some of those primitives 
rely on.  For example, the “brep” primitive is built on src/libbrep and 
src/other/openNURBS, and some of the primitives rely on our src/libbn math 
library extensively for polynomial root solving.

 Most code is ray/primitive intersection routines. I have to read some more to 
 see how the librt object database is stored, updated, and accessed. But I 
 think the most viable architecture is probably going to be something like... 
 we store some accelerated primitives on the device (GPU), and unaccelerated 
 primitives on the host (CPU). Then when we do the ray shooting we perform it 
 on both databases and merge the results. In theory the more accelerated 
 primitives (GPU) we have the smaller the unaccelerated primitives database 
 (CPU) will be and the faster the ray shooting will be. As a proof of concept 
 we can reuse the accelerated ellipsoid intersection code over this 
 architecture.

The way ray tracing currently occurs, there’s even room to speed up primitives 
on the host (CPU) by an order of magnitude or better ensuring better data 
coherency and eliminating branch logic.  This will require modifying the 
dispatcher (src/rt/worker.c), the shot routine (src/librt/primitives/*/*.c), 
the boolean evaluator (src/librt/bool.c), and the optical renderer 
(src/rt/view.c and src/rt/do.c).

On a minor related note, the implementation should not ever assume or require 
the availability of a GPU.  It can and should leverage GPU cores when they are 
available.  Again, there are substantial performance gains to be had even on 
the CPU.

 We need to perform some kind of analysis on which primitives people are most 
 likely to use and work on accelerating them. So basically we need some 
 statistics on a corpus of scenes i.e. how many primitives each model total 
 has and how many primitives of each type each scene has as a percentage.

Also, while we have 24+ different primitives and intersection routines, the 
most important are just these six: ell, tgc, arb8, tor, bot, and brep.  For 
GSoC, I would recommend focusing on only the first four or even just ell since 
it’s shot routine is already complete and validated, so you could focus on 
getting the rest of the pipeline working.

 I haven't checked if this is implemented yet but it would be nice to have 
 some built-in statistics like framerate or geometrical complexity that could 
 be displayed on user demand via the UI.


Re: [brlcad-devel] Real-Time Ray Tracing

2015-02-18 Thread Christopher Sean Morrison

On Feb 19, 2015, at 12:14 AM, Vasco Alexandre da Silva Costa 
vasco.co...@gmail.com wrote:

 I was reading your mailing-list archives and noticed you guys are interested 
 in a real-time RT visualization mode. I am currently doing my PhD on that 
 topic and it just so happens that I ended up bumping into BRL-CAD. I have not 
 made a code review to see what needs to be done.

Hello and welcome!  Glad to see the interest.  If you need help finding your 
way around the code, please don’t hesitate to ask here or on our IRC channel 
(if you’re willing to wait for a reply).  The ray tracing is almost entirely 
concentrated into two directories: src/rt for the front-end ray dispatcher and 
src/librt containing the rest.  Digging down into src/librt/primitives is where 
you’ll find individual entity logic (the familiar shot/intersection functions 
for example).

 AFAIK BRL-CAD uses CSG. So I assume you guys use a classical RT approach 
 where you have a bunch of primitive types and support complex object 
 hierarchies.

It’s a hybrid representation system with different intersection behavior based 
on the geometry representation type (implicit, triangle, polygonal, NURBS, 
volumetric, etc), but you basically have it right.  We do support Boolean 
combinations of any solid entity type against any other.  This involves 
calculating ray intersections and then “weaving” segment partitions together 
based on the CSG operations.   All highly expressive with modern coherency and 
vectorization techniques, but a rather woefully unexplored area of research 
within the academic community. 

 Most modern real-time RT systems only support triangles as a primitive for 
 code simplification and performance reasons. It is not impossible to support 
 multiple primitive types but I do not know how good the performance would be 
 in a system like that. Is your OpenGL backend (I'm assuming you have one for 
 the modeling stage) totally different?

Ever since real-time RT really hit the research scene, circa 2002-2004 with 
Ingo’s thesis and Reschetov’s tracer, I’ve strongly asserted that the exact 
same techniques can apply to a primitive-based system like we have in BRL-CAD 
and that the results would be even more impressive.  We’ve just not had anyone 
try.  The downside is as you describe — that you have to tune a dispatcher, a 
dozen intersection algorithms, a lighting model, and a boolean weaver instead 
of just a dispatcher, one intersection routine, and a lighting model.  This is 
not intractable difference though.

In fact, one of the tenants of real-time is actually encouraged by using 
implicit geometry with CSG operations — data coherency.  Implicit CSG geometry 
requires approximately two orders of magnitude less data to express the same 
shape compared to a roughly equivalent triangle mesh.  CSG implicits are also, 
on average, spatially larger than your average triangle and often with fewer 
intersection calculations required (e.g., all the quadratics — cones, 
cylinders, ellipsoids, etc).

Cumulatively, this means that not only should the same real-time performance 
improvements be realizable, but the results will likely exceed triangle tracing.

For what it’s worth, as a mildly related aside, it’s fun to note that BRL-CAD’s 
ray tracing library is actually credited [1] as being the world’s *first* real 
time ray tracing implementation.  Circa 1987, librt and our remrt/rtsrv tools 
achieved 5-10fps rendering across a network of ridiculously slow machines by 
today’s standards. :)

Cheers!
Sean



[1] http://en.wikipedia.org/wiki/Ray_tracing_(graphics)#In_real_time
--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk
___
BRL-CAD Developer mailing list
brlcad-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/brlcad-devel