Re: [brlcad-devel] Real-Time Ray Tracing
On Mar 21, 2015, at 12:55 PM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: I looked some more at the code and yeah the weave booleans step is branch heavy and has gotos in it. So this is most likely a bad candidate for GPU acceleration since a GPU will have like a quarter of the clockspeed and worse branch prediction hardware than the CPU. We want something that is compute heavy and has low thread divergence. I wouldn’t discredit it because of how it’s currently implemented. Similar to the various research demonstrations years ago that showed how to encode a (branch-heavy) kd-tree efficiently on the GPU, it should be possible to do the same with Boolean weaving. The algorithm is very similar to depth buffering and depth peeling. Like I said, you could almost certainly land a Siggraph or JCGT paper on this particular topic. It’s an under-explored area of research. So I think it is best to focus on ray dispatch+traversal at this time. Generate rays in bundles, traverse some sort of acceleration structure, gather hit candidate ids, and compute hit intersections on CPU. Not having the primitive tests on the GPU will reduce the performance advantage since we will spend time transmitting data back and forth though. This also sounds reasonable to me. This will almost certainly involve more modifications to our front-end application logic (and obviously replaced back-end traversal logic). Considerably more code that you’ll have to become familiar with compared with Boolean weaving, but not nearly as complicated or validation-sensitive. In the end, probably a tradeoff wash. Less code, harder algo vs More code, easier algo. Both highly valuable, so propose whatever piques your interest! The shading pass also seems like low hanging fruit. We need to send the image data to the GPU to display anyway so why not let it do some of the work. But this depends on how the code is organized and output is displayed. For some shaders, this is certainly possible, but I’m not married to our shader system. More inclined to leverage someone else’s rendering infrastructure (e.g., OSL and/or Appleseed) for anything other than simple phong/flat shading. The main issue is that we would like that the shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel steps would be done in clearly separated stages for all tested rays, in order to reduce the amount of CL kernel calls and maximize coherence, but from what I understand everything currently happens in one giant megakernel. So perhaps a lot of effort should be spent changing this way of things before doing any CL coding at all. Almost certainly. There’s a front-end layer (rt) that dispatches to a back end layer (librt) for traversed intersections, then the front-end sends the results to another layer (liboptical) for coloring/effects, and that all is then composited by the front-end into an image (via libicv/libfb). The application layer (used by dozens of applications) is somewhat relevant and described in detail here: http://brlcad.org/wiki/Developing_applications Cheers! Sean -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Mar 20, 2015, at 11:29 PM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: I see. Then I need to examine this part better and propose a plan focused on this. I only have CS grad student like knowledge of CSG since most of my experience in RT is with acceleration structures and ray processing so I am interested in seeing how you do the weaving. If there are any specific optimizations, etc. Old school optimizations. Complex goto branching and highly tuned tolerance testing/merging/splitting. For a first take, though, a simpler approach could be taken since the performance considerations will be considerably different with coherency and/or GPU involved. So the only planning concern is that this isn’t an incremental plan. It’s always a big red flag when a proposal has a big “test and document” at the end. If anything goes wrong or takes a couple weeks longer than planned with one of your early steps, we usually end up with something incomplete that cannot be shipped to users. I'm used to doing incremental development with continuous integration. I don't know how to do it any other way. I always test every step in order to not break the code. But in my experience something always needs fine tuning or debugging in the end. Also the docs always lag behind everything. Not dedicating time specifically in the plan for docs usually means next to no docs will be produced. Coding complete isn’t about incremental development or continuous testing. Those are good too. Taken to extreme, complete means you develop in a manner such that even if tragedy struck (e.g., you got hit by a bus tomorrow), there’s actually nothing anyone would have to undo/fix/finish/revert. You focus on making sure your code is always “done”, in a production-ready state. If writing code were a graph traversal, it’s about going depth-first instead of breadth-first. Not leaving TODO and FIXME notes for issues directly related to your implementation, making sure your code is always fully compliant to standards, not getting something “good enough” before moving on to the next piece, etc. If you’re complete, docs never lag, implementations are verified as they’re implemented, and reverified as they become integrated with other components. If a callout is warranted in the schedule, there should be several instances of “test, debug, document” throughout your timeline, but still not one big 2.5 week one at the end. That’d be breadth-first waterfall-style cleanup. It’s just not acceptable from a risk-management perspective, and essentially something we’ve learned to not even allow for GSoC. I was an OSS project maintainer (Freeciv) for close to 4 years. You never ever break the code with a commit. BTW what's your process for patch code reviews? You discuss patches on the mailing-list? Ah, a fellow game dev and with OSS experience — fantastic. Please read our developer HACKING guide (in source tree). It covers patches, commit expectations, code reviews. It’s an informal process primarily in the hands of existing devs to decide what to review. What is deemed acceptable is discussed throughout HACKING, automated tests, and core developer mindshare. As for commits breaking code, HACKING also talks about that topic. We actually allow a (temporarily) unstable trunk. We have STABLE and RELEASE branches that have much different quality and commit requirements. I basically wanted to have a whole (optional) accelerated rendering path from end to end so we could then work on the primitives afterwards when this thing ends. But if weaving is that compute intensive in BRL-CAD you are correct that it may make more sense to initially focus on just that. I have not profiled BRL-CAD librt to know where the perf bottlenecks are but you guys have actual experience with this. I completely understood why you proposed it that way. :) The problem is that approach means that you could be 100% successful in your project, even ahead of schedule, and users could end up with no benefit when GSoC is over until enough primitives have a shot routine transcoded, tested, and verified. Some of them will be exceptionally complex to get working in OpenCL, months of effort each. There are about 10 primitives that are in pervasive use — some incredibly complex. Why would we want optional acceleration? By simply shifting the order/focus around a little bit, we can avoid optional. It will ensure the focus remains on the user, not the status of implementation. If you actually accomplish accelerating the whole pipeline by the end, that would be fantastic. If anything takes longer than expected, and it will, everything will still be fine. Remember to account for time throughout to provably demonstrate consistency with the existing implementation, to document progress as you go, to engage in discussions, towards answering questions, etc. As for performance,
Re: [brlcad-devel] Real-Time Ray Tracing
On Mar 21, 2015, at 10:11 AM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: On Sat, Mar 21, 2015 at 7:19 AM, Christopher Sean Morrison brl...@mac.com wrote: If a callout is warranted in the schedule, there should be several instances of “test, debug, document” throughout your timeline, but still not one big 2.5 week one at the end. That’d be breadth-first waterfall-style cleanup. It’s just not acceptable from a risk-management perspective, and essentially something we’ve learned to not even allow for GSoC. Not necessarily. For example if you issue a code freeze before doing a version bump where you only accept bugfixes and don't accept any feature improvements it doesn't mean you are using a waterfall model. Is the Linux kernel development model a waterfall model? I would say it isn’t. I would agree. When you have ongoing continuous integration cycles and depending on the length of the cycles relative to the overall effort invested, it’s arguably not waterfall. When it’s only one cycle, about 17% of the timeline, and at the end — it fits the classic verification/debugging characterization rather well. Regardless, it really doesn’t matter what we call it. It’s highly “undesirable” for GSoC projects. :) I will change the planning timetable to work in the fashion you describe. Appreciated. For everyone following the thread, I will add that discussions like these are good to see. We expect give and take compromises and collaborative discussions throughout development from everyone. We certainly won’t always agree and contrary decisions will suck. It takes time and commitment to earn a meritocratic position (which I’m sure, Vasco, you're well aware of given your OSS experience). With that, however, also comes the ability to influence change, take BRL-CAD in directions we weren’t even considering (looking at all you web devs). :) Networking and building working relationships are really the most important aspect of GSoC. Cheers! Sean -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Sat, Mar 21, 2015 at 12:30 AM, Christopher Sean Morrison brl...@mac.com ... Something to consider, you could even propose *only* focusing on this aspect for GSoC. What I mean is that you could spend time restructuring the rt dispatch logic, which is something like forAllPixels-shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel that iterates over all pixels depth-first pipeline style. You’d restructure it into something like phase1: forAllPixels-shootOneRay-traverseScene-evalHits then phase2: CLweaveAllBooleans-CLcolorizePixels. ... That’s why I’d suggest either focusing only on boolean weaving, or on bundled ray dispatch+traversal, or hit+result gathering, etc — something that could be put into immediate use, even if it’s not going to give the 10x speedup until the rest of the pipeline is converted. Changing BRL-CAD’s render pipeline to support this style of evaluation is going to be a lot of work. I looked some more at the code and yeah the weave booleans step is branch heavy and has gotos in it. So this is most likely a bad candidate for GPU acceleration since a GPU will have like a quarter of the clockspeed and worse branch prediction hardware than the CPU. We want something that is compute heavy and has low thread divergence. So I think it is best to focus on ray dispatch+traversal at this time. Generate rays in bundles, traverse some sort of acceleration structure, gather hit candidate ids, and compute hit intersections on CPU. Not having the primitive tests on the GPU will reduce the performance advantage since we will spend time transmitting data back and forth though. On Sat, Mar 21, 2015 at 7:19 AM, Christopher Sean Morrison brl...@mac.com wrote: As for performance, weaving can be a big or small factor, as it’s heavily dependent on the geometry being rendered, number of primitives, number of regions, depth of hierarchy, number of operations, etc. A traditional model like havoc.g or m35.g in our sample db directory should benefit greatly while a pure NURBS model (e.g., no Boolean operations) won’t benefit nearly as much. We’re not going to see huge performance until the whole pipeline is connected, and that’s okay if users get speedups for some geometry situations. Generally speaking, dispatched traversal is roughly usually 5-15%, prep is 0-20% (except brep), shot intersection testing is 30-70%, weaving is 5-30%, and shading is 0-25%. Models with complex entities spend more time in prep and shot; models with complex hierarchies spend more time in weaving and traversal. The shading pass also seems like low hanging fruit. We need to send the image data to the GPU to display anyway so why not let it do some of the work. But this depends on how the code is organized and output is displayed. The main issue is that we would like that the shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel steps would be done in clearly separated stages for all tested rays, in order to reduce the amount of CL kernel calls and maximize coherence, but from what I understand everything currently happens in one giant megakernel. So perhaps a lot of effort should be spent changing this way of things before doing any CL coding at all. -- Vasco Alexandre da Silva Costa PhD Student at Department of Information Systems and Computer Science Instituto Superior Técnico/University of Lisbon, Portugal -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Sat, Mar 21, 2015 at 7:19 AM, Christopher Sean Morrison brl...@mac.com wrote: If a callout is warranted in the schedule, there should be several instances of “test, debug, document” throughout your timeline, but still not one big 2.5 week one at the end. That’d be breadth-first waterfall-style cleanup. It’s just not acceptable from a risk-management perspective, and essentially something we’ve learned to not even allow for GSoC. Not necessarily. For example if you issue a code freeze before doing a version bump where you only accept bugfixes and don't accept any feature improvements it doesn't mean you are using a waterfall model. Is the Linux kernel development model a waterfall model? I would say it isn't. I will change the planning timetable to work in the fashion you describe. Vasco Alexandre da Silva Costa PhD Student at Department of Information Systems and Computer Science Instituto Superior Técnico/University of Lisbon, Portugal -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Mar 20, 2015, at 7:17 PM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: I came up with this tentative work plan for improving the OpenCL (CL) RT code on BRL-CAD: Excellent! - Hook into shot and db load routines (hooks disabled by default) in order to capture ellipsoid primitives and shots into the CL side. Get familiar with these top level interfaces of the code. (2 weeks) This might require a little more inspection. For places where you might “get opencl ready”, I would expect you to hook in when a raytrace context is created (e.g., rt_dirbuild()) or during geometry prep, which every object implements and is called during directory building. If you see the rt_ell_prep() function in the opencl branch, that’s effectively where this is already happening for our quick test. - CL megakernel for boolean operations. i need to check how involved this will be to have a more accurate time estimate. Probably requires storing some primitive hierarchy on CL. Integration with C++ side of things for other primitives may be problematic. (2 weeks?) This may be a large task, possibly the bulk of your timeframe, as boolean weaving is one of the most complex aspects in BRL-CAD. Understanding it is almost certainly going to take you some time. However, you’ll probably be able to publish a paper on it when you’re done! :) Something to consider, you could even propose *only* focusing on this aspect for GSoC. What I mean is that you could spend time restructuring the rt dispatch logic, which is something like forAllPixels-shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel that iterates over all pixels depth-first pipeline style. You’d restructure it into something like phase1: forAllPixels-shootOneRay-traverseScene-evalHits then phase2: CLweaveAllBooleans-CLcolorizePixels. If you got it working, you’d speed up ray tracing somewhere between 25-50% (as that is about how much time Boolean weaving takes) for all geometry, not just ellipsoids and it could go into immediate production use. It’d be in users hands. More on this later. With time-permitting, you could then work on the rest of the pipeline like implementing the top-level bundled dispatch and scene traversal acceleration, and primitive shot() routines. - Implement CL rectilinear grids [1] (improvement on the Mike Gigante Nugrid currently used in BRL-CAD) spatial partitioning construction and traversal. Should reuse most of the regular grid construction code but requires some extra construction steps and has different traversal scheme (2.5 weeks) Sounds good, but know that we don’t actually use Nugrid. It’s faster for some scenes, slower for others — overall a wash. I think there’s also a bug somewhere in there. We use a custom BSP traversal for production work that is much more thoroughly tested and robust. Grids generally do much better on the GPU than they do on the CPU. - Cleanups, bugfixes, final tests, docs. (2.5 weeks) So the only planning concern is that this isn’t an incremental plan. It’s always a big red flag when a proposal has a big “test and document” at the end. If anything goes wrong or takes a couple weeks longer than planned with one of your early steps, we usually end up with something incomplete that cannot be shipped to users. Assume something will go wrong, that it will take longer, or your computer will explode next week. How would you change your approach so that users still get something? If you had to immediately stop working, there should be something usable (without requiring additional work by you or others to make it usable). This is an aspect of “coding complete” mentioned on http://brlcad.org/wiki/Google_Summer_of_Code/Acceptance#Write_complete_code Basically, this means thinking through how you can incrementally break up the task into phases where each phase gets tested and could be put into production use without detriment. It not only means you’re continually cleaning up, testing, and updating docs, it means you’re continually focused on the user and their benefit instead of your development comfort. If that means getting a little less done, that’s okay. Users will love you for it. My other concern was that your objective doesn’t result in a feature that affects users. Yay, devs can raytrace ellipsoids MUCH faster … and users see nothing. That’s why I’d suggest either focusing only on boolean weaving, or on bundled ray dispatch+traversal, or hit+result gathering, etc — something that could be put into immediate use, even if it’s not going to give the 10x speedup until the rest of the pipeline is converted. Changing BRL-CAD’s render pipeline to support this style of evaluation is going to be a lot of work. Still, outstanding proposal progress. This and other proposals are making me pretty darn excited! Cheers! Sean p.s. Some technical libraries to consider in lieu of directly binding to OpenCL:
Re: [brlcad-devel] Real-Time Ray Tracing
On Sat, Mar 21, 2015 at 12:30 AM, Christopher Sean Morrison brl...@mac.com wrote: ... This may be a large task, possibly the bulk of your timeframe, as boolean weaving is one of the most complex aspects in BRL-CAD. Understanding it is almost certainly going to take you some time. However, you’ll probably be able to publish a paper on it when you’re done! :) Something to consider, you could even propose *only* focusing on this aspect for GSoC. What I mean is that you could spend time restructuring the rt dispatch logic, which is something like forAllPixels-shootOneRay-traverseScene-evalHits-weaveBooleans-colorizePixel that iterates over all pixels depth-first pipeline style. You’d restructure it into something like phase1: forAllPixels-shootOneRay-traverseScene-evalHits then phase2: CLweaveAllBooleans-CLcolorizePixels. I see. Then I need to examine this part better and propose a plan focused on this. I only have CS grad student like knowledge of CSG since most of my experience in RT is with acceleration structures and ray processing so I am interested in seeing how you do the weaving. If there are any specific optimizations, etc. If you got it working, you’d speed up ray tracing somewhere between 25-50% (as that is about how much time Boolean weaving takes) for all geometry, not just ellipsoids and it could go into immediate production use. It’d be in users hands. More on this later. Yeah this seems to be a good idea. I wanted to do something that would leave an actual performance improvement for users in the end. If this is the critical compute intensive task then it is what needs to be addressed. With time-permitting, you could then work on the rest of the pipeline like implementing the top-level bundled dispatch and scene traversal acceleration, and primitive shot() routines. ... Sounds good, but know that we don’t actually use Nugrid. It’s faster for some scenes, slower for others — overall a wash. I think there’s also a bug somewhere in there. We use a custom BSP traversal for production work that is much more thoroughly tested and robust. Grids generally do much better on the GPU than they do on the CPU. That is news to me. When i glanced at librt/shoot.c I saw a lot of references to 'nugrid'. There's only one match for 'nubsp' and it is in a comment. I guess I need to start peeling this and see what is actually being called. - Cleanups, bugfixes, final tests, docs. (2.5 weeks) So the only planning concern is that this isn’t an incremental plan. It’s always a big red flag when a proposal has a big “test and document” at the end. If anything goes wrong or takes a couple weeks longer than planned with one of your early steps, we usually end up with something incomplete that cannot be shipped to users. I'm used to doing incremental development with continuous integration. I don't know how to do it any other way. I always test every step in order to not break the code. But in my experience something always needs fine tuning or debugging in the end. Also the docs always lag behind everything. Not dedicating time specifically in the plan for docs usually means next to no docs will be produced. Assume something will go wrong, that it will take longer, or your computer will explode next week. How would you change your approach so that users still get something? If you had to immediately stop working, there should be something usable (without requiring additional work by you or others to make it usable). This is an aspect of “coding complete” mentioned on http://brlcad.org/wiki/Google_Summer_of_Code/Acceptance#Write_complete_code I was an OSS project maintainer (Freeciv) for close to 4 years. You never ever break the code with a commit. BTW what's your process for patch code reviews? You discuss patches on the mailing-list? Basically, this means thinking through how you can incrementally break up the task into phases where each phase gets tested and could be put into production use without detriment. It not only means you’re continually cleaning up, testing, and updating docs, it means you’re continually focused on the user and their benefit instead of your development comfort. If that means getting a little less done, that’s okay. Users will love you for it. My other concern was that your objective doesn’t result in a feature that affects users. Yay, devs can raytrace ellipsoids MUCH faster … and users see nothing. That’s why I’d suggest either focusing only on boolean weaving, or on bundled ray dispatch+traversal, or hit+result gathering, etc — something that could be put into immediate use, even if it’s not going to give the 10x speedup until the rest of the pipeline is converted. Changing BRL-CAD’s render pipeline to support this style of evaluation is going to be a lot of work. I basically wanted to have a whole (optional) accelerated rendering path from end to end so we could then work on the primitives
Re: [brlcad-devel] Real-Time Ray Tracing
You mentioned that: There is some OpenCL prototype code in there for testing intersections with ellipsoids. But this kind of architecture is not going to work performance wise. It is allocating buffers and sending geometry data to the device (GPU or whatever) on every single ray/primitive intersection call (src/librt/primitives/sph/sph.c:clt_shot). I also checked the main intersection code and it uses a function table to figure out which ray/primitive intersection routine to call (src/librt/primitives/table.c) Since this subject talks about Performance I was going to ask: 1. First do you have/carry BrlCad with Manual Guide or a Guide book? 2. Is the main ray tracing written from OpenCL library/code? And does this affects the following files: src/other/opennurbs src/librt/primitives/brep src/librt/opennurbs_ext.* src/proc-db/csgbrep.cpp src/proc-db/brep* 3. Can we Modify/Edit these libraries such as: OpenCL, Opennurbs and Rhino3D folks or should we write a ray tracer from scratch, let say if we need to optimize it? 4. Is NURBS used to find ray intersection on objects? and what about OpenCL Code does it also do the same thing as NURBS? which one is better? I am interested in working on the NURBS Optimization and Cleanup and I would like to optimize its algorithm as this often plays a pivotal role in improving performance. This is just one classic place to optimize. To optimize ray tracer, I am planning on conducting a couple of these steps: 1. Implementing Kd-tree structure. 2. Implementing Bezier clipping method for NURBS Surface 3. Implementing BVH Ray traversal 4. Implementing Bounding Volume Hierarchy 5. Fine-tuning a NURBS-curve by specifying different weight with knot-vector Bounding Volume Hierarchy: In order to reduce intersection cost, a bounding volume should be drawn around each polygon. You also mentioned that: The way ray tracing currently occurs, there’s even room to speed up primitives on the host (CPU) by an order of magnitude or better ensuring better data coherency and eliminating branch logic.Because you mentioned it, I would like to look further in this area. From, Benson On Wednesday, March 4, 2015 11:25 PM, Christopher Sean Morrison brl...@mac.com wrote: On Mar 4, 2015, at 6:06 PM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: I have been reading the 'librt' source code. Excellent! There is some OpenCL prototype code in there for testing intersections with ellipsoids. But this kind of architecture is not going to work performance wise. It is allocating buffers and sending geometry data to the device (GPU or whatever) on every single ray/primitive intersection call (src/librt/primitives/sph/sph.c:clt_shot). I also checked the main intersection code and it uses a function table to figure out which ray/primitive intersection routine to call (src/librt/primitives/table.c)… Astute observations. The OpenCL code was intentionally structured that way, not at all for performance obviously, but so we could perform a 1-1 comparison to validate that the implementation behavior was correct. Our librt ray tracing logic is integrated in a number of scientific analysis applications, so we go to great lengths to ensure we consistently report hits/misses, that the ray intersection implementation is correct especially given inaccurate floating point math. Indeed, the first OpenCL translation of the ellipsoid logic — which seemed 100% identical to the C code — was wrong. It resulted in different calculations. The issues were sorted out a few days later, but this level of rigor and validation is necessary. Of course, this is not the desired end-state, but was done to easily test the implementation. The inefficient buffer allocation overhead was also intentional (and also temporary) just to get quick numbers on OpenCL’s overhead costs. Basically, it was a simple proof-of-concept and it was successful. The next steps are to create a new intersection pipeline that is coherent from start to end with bundles of rays getting dispatched and evaluated against primitives without excessive branching and cache misses. Then I did some 'wc -l' runs to see how much effort it would take to recode parts of the pipeline: src/librt/*.c - 33516 lines src/librt/primitives/*.c - 4653 lines src/librt/primitives/*/*.c - 141351 lines This is all the back-end portions of the pipeline. You’ll also need to consider the front-end where ray dispatching occurs in src/rt/(do|main|opt|view|worker).c That said, it’s both better and worse than those numbers make it seem. You’ve encompassed far too much code in librt, much of it related to .g file management. You’re also not looking at the code that some of those primitives rely on. For example, the “brep” primitive is built on src/libbrep and src/other/openNURBS, and some of the primitives rely on
Re: [brlcad-devel] Real-Time Ray Tracing
On Fri, Mar 6, 2015 at 5:03 PM, benson chepkwony bchepk...@att.net wrote: 3. Can we Modify/Edit these libraries such as: OpenCL, Opennurbs and Rhino3D folks or should we write a ray tracer from scratch, let say if we need to optimize it? 4. Is NURBS used to find ray intersection on objects? and what about OpenCL Code does it also do the same thing as NURBS? which one is better? NURBS are just another primitive. The BRL-CAD rendering architecture uses the ray tracing algorithm. So it has an implementation of a ray/NURB intersection algorithm in order to render NURBS. OpenCL is just a C like programming language with vector processing extensions. I am interested in working on the NURBS Optimization and Cleanup and I would like to optimize its algorithm as this often plays a pivotal role in improving performance. This is just one classic place to optimize. To optimize ray tracer, I am planning on conducting a couple of these steps: 1. Implementing Kd-tree structure. 2. Implementing Bezier clipping method for NURBS Surface 3. Implementing BVH Ray traversal 4. Implementing Bounding Volume Hierarchy 5. Fine-tuning a NURBS-curve by specifying different weight with knot-vector Bounding Volume Hierarchy: In order to reduce intersection cost, a bounding volume should be drawn around each polygon. You seem to be assuming BRL-CAD triangulates the NURBS surfaces before the ray-tracing step but I am not sure that is what they are doing. You also mentioned that: The way ray tracing currently occurs, there’s even room to speed up primitives on the host (CPU) by an order of magnitude or better ensuring better data coherency and eliminating branch logic. Because you mentioned it, I would like to look further in this area. If you want to work on that check out these papers: Design for Parallel Interactive Ray Tracing Systems An Application of Scalable Massive Model Interaction using Shared-Memory Systems which describe the implementation of the Manta ray-tracer. Manta is a an open-source parallel ray tracer which has ray bundle optimizations. -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Mar 6, 2015, at 12:58 PM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: You seem to be assuming BRL-CAD triangulates the NURBS surfaces before the ray-tracing step but I am not sure that is what they are doing. That is most definitely not what we do. If you want to work on that check out these papers: Design for Parallel Interactive Ray Tracing Systems An Application of Scalable Massive Model Interaction using Shared-Memory Systems which describe the implementation of the Manta ray-tracer. Manta is a an open-source parallel ray tracer which has ray bundle optimizations. Excellent references, thanks for sharing. Another is Intel’s Embree ray tracer: https://software.intel.com/en-us/articles/embree-photo-realistic-ray-tracing-kernels Cheers! Sean -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Mar 5, 2015, at 9:27 AM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: I see. But double precision floating point math is a lot slower than single precision on certain architectures like the lower end NVIDIA cards. e.g. the GeForce GTX 780 Ti has 5048 SP GFLOPS but only 210 DP GFLOPS. Even on the higher end NVIDIA cards the DP FLOPS are like a third of the SP FLOPS. The rendering system should be tunable for either speed or accuracy. That kind of complicates things. As a first approach we can work only on getting the double precision to work but the performance will suffer a great deal. Absolutely understood, but it’s not just speed vs accuracy. If a ray that should mathematically hit an object reports a miss, we will produce incorrect analysis results, which could lead to incorrect decisions that affect manufacturing designs or even impact lives. Imagine a calculator that will report 2+2=[5,4,3] VERY quickly, much faster than one that reports 2+2=4 every time. Would that be very useful? Perhaps for some situations, but not for most math problems. That’s kind of our situation here. We don’t do raytracing for pretty pictures. Fast and wrong is useless for our purposes. If single precision can be made to match results, that will obviously be more desirable and faster. Frankly don’t care if ti's single or double precision, but the behavior of a ray that should hit or miss must be demonstrably mathematically correct or at least consistent with current behavior. Cheers! Sean -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ ___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Thu, Mar 5, 2015 at 6:24 AM, Christopher Sean Morrison brl...@mac.com wrote: Astute observations. The OpenCL code was intentionally structured that way, not at all for performance obviously, but so we could perform a 1-1 comparison to validate that the implementation behavior was correct. Our librt ray tracing logic is integrated in a number of scientific analysis applications, so we go to great lengths to ensure we consistently report hits/misses, that the ray intersection implementation is correct especially given inaccurate floating point math. Indeed, the first OpenCL translation of the ellipsoid logic — which seemed 100% identical to the C code — was wrong. It resulted in different calculations. The issues were sorted out a few days later, but this level of rigor and validation is necessary. Of course, this is not the desired end-state, but was done to easily test the implementation. The inefficient buffer allocation overhead was also intentional (and also temporary) just to get quick numbers on OpenCL’s overhead costs. I see. But double precision floating point math is a lot slower than single precision on certain architectures like the lower end NVIDIA cards. e.g. the GeForce GTX 780 Ti has 5048 SP GFLOPS but only 210 DP GFLOPS. Even on the higher end NVIDIA cards the DP FLOPS are like a third of the SP FLOPS. The rendering system should be tunable for either speed or accuracy. That kind of complicates things. As a first approach we can work only on getting the double precision to work but the performance will suffer a great deal. -- Vasco Alexandre da Silva Costa PhD Student at Department of Information Systems and Computer Science Instituto Superior Técnico/University of Lisbon, Portugal -- Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel
Re: [brlcad-devel] Real-Time Ray Tracing
On Mar 4, 2015, at 6:06 PM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: I have been reading the 'librt' source code. Excellent! There is some OpenCL prototype code in there for testing intersections with ellipsoids. But this kind of architecture is not going to work performance wise. It is allocating buffers and sending geometry data to the device (GPU or whatever) on every single ray/primitive intersection call (src/librt/primitives/sph/sph.c:clt_shot). I also checked the main intersection code and it uses a function table to figure out which ray/primitive intersection routine to call (src/librt/primitives/table.c)… Astute observations. The OpenCL code was intentionally structured that way, not at all for performance obviously, but so we could perform a 1-1 comparison to validate that the implementation behavior was correct. Our librt ray tracing logic is integrated in a number of scientific analysis applications, so we go to great lengths to ensure we consistently report hits/misses, that the ray intersection implementation is correct especially given inaccurate floating point math. Indeed, the first OpenCL translation of the ellipsoid logic — which seemed 100% identical to the C code — was wrong. It resulted in different calculations. The issues were sorted out a few days later, but this level of rigor and validation is necessary. Of course, this is not the desired end-state, but was done to easily test the implementation. The inefficient buffer allocation overhead was also intentional (and also temporary) just to get quick numbers on OpenCL’s overhead costs. Basically, it was a simple proof-of-concept and it was successful. The next steps are to create a new intersection pipeline that is coherent from start to end with bundles of rays getting dispatched and evaluated against primitives without excessive branching and cache misses. Then I did some 'wc -l' runs to see how much effort it would take to recode parts of the pipeline: src/librt/*.c - 33516 lines src/librt/primitives/*.c - 4653 lines src/librt/primitives/*/*.c - 141351 lines This is all the back-end portions of the pipeline. You’ll also need to consider the front-end where ray dispatching occurs in src/rt/(do|main|opt|view|worker).c That said, it’s both better and worse than those numbers make it seem. You’ve encompassed far too much code in librt, much of it related to .g file management. You’re also not looking at the code that some of those primitives rely on. For example, the “brep” primitive is built on src/libbrep and src/other/openNURBS, and some of the primitives rely on our src/libbn math library extensively for polynomial root solving. Most code is ray/primitive intersection routines. I have to read some more to see how the librt object database is stored, updated, and accessed. But I think the most viable architecture is probably going to be something like... we store some accelerated primitives on the device (GPU), and unaccelerated primitives on the host (CPU). Then when we do the ray shooting we perform it on both databases and merge the results. In theory the more accelerated primitives (GPU) we have the smaller the unaccelerated primitives database (CPU) will be and the faster the ray shooting will be. As a proof of concept we can reuse the accelerated ellipsoid intersection code over this architecture. The way ray tracing currently occurs, there’s even room to speed up primitives on the host (CPU) by an order of magnitude or better ensuring better data coherency and eliminating branch logic. This will require modifying the dispatcher (src/rt/worker.c), the shot routine (src/librt/primitives/*/*.c), the boolean evaluator (src/librt/bool.c), and the optical renderer (src/rt/view.c and src/rt/do.c). On a minor related note, the implementation should not ever assume or require the availability of a GPU. It can and should leverage GPU cores when they are available. Again, there are substantial performance gains to be had even on the CPU. We need to perform some kind of analysis on which primitives people are most likely to use and work on accelerating them. So basically we need some statistics on a corpus of scenes i.e. how many primitives each model total has and how many primitives of each type each scene has as a percentage. Also, while we have 24+ different primitives and intersection routines, the most important are just these six: ell, tgc, arb8, tor, bot, and brep. For GSoC, I would recommend focusing on only the first four or even just ell since it’s shot routine is already complete and validated, so you could focus on getting the rest of the pipeline working. I haven't checked if this is implemented yet but it would be nice to have some built-in statistics like framerate or geometrical complexity that could be displayed on user demand via the UI.
Re: [brlcad-devel] Real-Time Ray Tracing
On Feb 19, 2015, at 12:14 AM, Vasco Alexandre da Silva Costa vasco.co...@gmail.com wrote: I was reading your mailing-list archives and noticed you guys are interested in a real-time RT visualization mode. I am currently doing my PhD on that topic and it just so happens that I ended up bumping into BRL-CAD. I have not made a code review to see what needs to be done. Hello and welcome! Glad to see the interest. If you need help finding your way around the code, please don’t hesitate to ask here or on our IRC channel (if you’re willing to wait for a reply). The ray tracing is almost entirely concentrated into two directories: src/rt for the front-end ray dispatcher and src/librt containing the rest. Digging down into src/librt/primitives is where you’ll find individual entity logic (the familiar shot/intersection functions for example). AFAIK BRL-CAD uses CSG. So I assume you guys use a classical RT approach where you have a bunch of primitive types and support complex object hierarchies. It’s a hybrid representation system with different intersection behavior based on the geometry representation type (implicit, triangle, polygonal, NURBS, volumetric, etc), but you basically have it right. We do support Boolean combinations of any solid entity type against any other. This involves calculating ray intersections and then “weaving” segment partitions together based on the CSG operations. All highly expressive with modern coherency and vectorization techniques, but a rather woefully unexplored area of research within the academic community. Most modern real-time RT systems only support triangles as a primitive for code simplification and performance reasons. It is not impossible to support multiple primitive types but I do not know how good the performance would be in a system like that. Is your OpenGL backend (I'm assuming you have one for the modeling stage) totally different? Ever since real-time RT really hit the research scene, circa 2002-2004 with Ingo’s thesis and Reschetov’s tracer, I’ve strongly asserted that the exact same techniques can apply to a primitive-based system like we have in BRL-CAD and that the results would be even more impressive. We’ve just not had anyone try. The downside is as you describe — that you have to tune a dispatcher, a dozen intersection algorithms, a lighting model, and a boolean weaver instead of just a dispatcher, one intersection routine, and a lighting model. This is not intractable difference though. In fact, one of the tenants of real-time is actually encouraged by using implicit geometry with CSG operations — data coherency. Implicit CSG geometry requires approximately two orders of magnitude less data to express the same shape compared to a roughly equivalent triangle mesh. CSG implicits are also, on average, spatially larger than your average triangle and often with fewer intersection calculations required (e.g., all the quadratics — cones, cylinders, ellipsoids, etc). Cumulatively, this means that not only should the same real-time performance improvements be realizable, but the results will likely exceed triangle tracing. For what it’s worth, as a mildly related aside, it’s fun to note that BRL-CAD’s ray tracing library is actually credited [1] as being the world’s *first* real time ray tracing implementation. Circa 1987, librt and our remrt/rtsrv tools achieved 5-10fps rendering across a network of ridiculously slow machines by today’s standards. :) Cheers! Sean [1] http://en.wikipedia.org/wiki/Ray_tracing_(graphics)#In_real_time -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk ___ BRL-CAD Developer mailing list brlcad-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/brlcad-devel