Brecht van Lommel posted his highlights from SIGGRAPH. As this may be a sign of things to come in Blender, I thought you'd be interested in reading them.
Some notes I took during SIGGRAPH, mostly technical and render related stuff that I found interesting. For more in-depth stuff check the collection of links here.
Pixar is open sourcing OpenSubdiv mainly to push it as *the* subdivision surface standard, and will be proposing to include it in OpenGL later this year. They have a lot invested in their workflow, e.g. modeling with creases and fewer edge loops, and think it's worth the investment to ensure they do not have to switch to some other standard that could win later on.
There's still some things missing in the library that we need for Blender, in particular non-manifold meshes and smooth UVs. These are planned to be added by Dreamworks, along with some performance optimizations.
For Blender we also still have to see how this all fits with displacement and multires sculpting. My guess is that for display in sculpt mode itself OpenSubdiv is not so useful, as you are editing the tessellated vertex positions directly there and retessellating them each time will not help. You'd need to bake down the displacement each time which is just going to make things slower. Outside of sculpt mode, doing the subdivision surfaces on the GPU with or without displacement should be quite nice, especially for animators to see realtime playback of the full subdivided and displaced mesh.
SUBD AND RENDERING
There is an interesting difference between the workflows for Pixar Renderman and Arnold. It seems that Pixar is relying a lot on having good lower poly subdivision surfaces with detail added through subdivision and displacement. Ptex needs such base meshes to do mipmapping well, when the base mesh is super high poly and every primitive is a separate Ptex face it may not be as good at texture filtering. The Renderman multiresolution shading cache also needs a base mesh to store the coarse level of the cache, if that's too fine performance would drop. On the other hand it was mentioned that Arnold is happy to just get tessellated very high poly meshes, partially because the subd isn't as good but also because it doesn't do caching anyway. In Arnold reducing shading cost for secondary bounces seems to be mostly done by having a simpler version of the shader / ray differentials.
So this is an interesting difference that we might want to think about for Blender/Cycles algorithm choices too, things like dynamic topology sculpting suit the Arnold workflow, whereas multires sculpting fits the Pixar workflow. For the Pixar workflow with dynamic topology sculpts it seems that good retopology and remeshing is needed. Is it worth it adding special optimizations for subd surfaces all the way into the rendering kernel, or do we focus on handling triangle/quad soups really well with compression, SIMD, etc? We can look into a few higher level primitives to support in the BVH like triangle strips, quad strips, grids, .. but try to make those things work also when not using subd.
Further, automatically picking the right subdivision level for global illumination still is an unsolved problem. There is usually a choice per object to do it with a fixed user specified subd level or world space subd refiniment, or to do it adaptive based on screen projection for certain special cases. It's basically up to the artists to do this well, automatic tessellation in screen space in the style of Renderman may be becoming a bit less useful with GI. I would like an automatic solution here but couldn't find anyone who had something like that..
Also interesting is that for some studios the base meshes already have enough resolution to render mostly without subdivision. Sometimes the detail is needed in the base mesh for physics simulation to show enough detail. It depends a lot on the modelling and rigging workflow, some people like to add lots of edge loops and others not.
PIXAR LIGHT SAMPLING
The difference between Renderman and Arnold style sampling is quite interesting, and you can't easily switch between the two. With Arnold every AA sample will result in relatively few diffuse or glossy samples, whereas for Renderman it will take enough samples to be noise free for each point on the micropolygon grid.
Renderman can do this because they decouple visibility from shading and use a shading cache for indirect bounces. The downside is that you're potentially doing too much shading, especially if your base meshes are quite fine, the upside is that you can use tricks like importance resampling to reduce the number of shadow rays, and control variates to get an analytic noise free result for area lights if there is no shadowing. The control variates results look quite impressive but I wonder how much it really helps to get unshadowed areas noise free if your shadowed areas still have a lot of noise. Maybe with adaptive sampling, but Pixar isn't using that as far as I know.
Not also that with the caching the diffuse bounces are cheaper than the glossy once because the latter are view dependent and so generally can't be cached. Pixar uses tricks like selectively disabling glossy in secondary bounces when they aren't needed. Those are technically caustics anyway.
A nice trick they showed is to give some basic texture to area lights, like a white area light white a blue border, simple but gives nice shading variation.
Simple OSL shaders have quite a bit of overhead, and shader compile thing is quite long. There are some things that we can do to improve things, like making our internal struct match the ShaderGlobals struct. In general SPI have the same sort of issue with e.g. volume rendering where you have lots of small shading operations, and they are also looking into this.
SPI is interested in OSL on the GPU but unsure about the right choice of implementation and target to use (llvm nvidia/amd backends, opencl, cuda, glsl, hlsl, ..). This was the second big reason to start OSL at SPI, to have it e.g. seamlessly display in render and viewport without manually writing GLSL shaders that match nodes. Some other companies might be interested in working on this, discussion will happen on osl-dev mailing list. There seems to be a consensus that a single backend target isn't going to satisfy everyone, and that there will be some system to easily plug in multiple backends.
There were also some mentions that OIIO image texture lookup can be slow, it's really designed to have good high quality texture filtering. It's possible to improve things for the simpler cases but Autodesk Beast and V-Ray just used their own texture filtering code. We could have an option to use our quick/stupid SVM texture filtering code too, or looking into OIIO optimizations for common settings.
Solid Angle / SPI had a talk on BSSRDF sampling. Their method seems to have less noise than the line sampling we use now, mainly because it can stratify samples better.
Cycles does some extra things though, which perhaps we should drop? We use some tricks to normalize the influence in to avoid light leaking and color shifts, other render engines don't do this so perhaps it's acceptable. Cycles also does multiple tries to reduce variance, both the line sampling and new technique will have about 50% of samples miss which is quite a lot. My implementation of this is quite weak though and I suspect to be actually wrong, it may not be possible to do this properly in an unbiased way.
The multiple importance sampling they use between falloff curves for difference color channels should also help. This color noise is something I couldn't figure out when implementing SSS but their solution should not be so hard to add. Especially for things like the skin model with a sum of 6 gaussians this can be very useful.
Everyone seems to be using some variation of the Marschner model. The Pixar/Disney importance sampling method is published, the SPI/Arnold method one by Alejandro Conty is not. The former uses full raytracing of the hair, no deep shadow maps anymore. They do skip the TT term and uses some trickery to compensate for the missing multiple scattering (using surface normals for short hair and point cloud blurring for long hair). The SPI/Arnold method includes multiple scattering and presumably does not rely on any precomputation but I don't know how it works.
The volume rendering in The Croods and Wizard of Oz is quite impressive, things have really gotten more advanced this year.
Camera frustum volume shapes seem to be quite important in getting as much detail as possible into the render. Tracing rays through such a frustum turns out to be not so simple.
There were already some papers published by SPI/SolidAngle for single scattering, now there is also a trick to emulate multiple scattering. The idea is to combine multiple 'octaves' with different volume settings. Each octave halves the density to let light scatter further with just one bounce. This is only a trick but combining these octaves lets light scatter far enough while still preserving detail, and it's entirely unbiased with no need for precomputation.
The OpenVDB file format and data structure is quite cool in that you can store volumes with no predefined bounding boxes, the volume data can grow as needed without the users needing to worry about setting bounds. Even better through are the tools provided along with the library. They've got production quality tools for conversion to/from meshes and particles, various volume manipulations, etc. If someone is interested in implementing a native Volume datablock in Blender this sounds like a great way to do it as much of the important stuff is already there, with more planned to be added.
One example they showed is converting a mesh to a volume, giving the walls some thickness with dilation, fracturing the mesh, and converting back to an adaptive tessellated mesh. All while preserving mesh attributes like UV maps. Another example was clouds modelling by quickly deforming and placing some spheres, converting to a volume and adding procedural noise.
Embree is now using ISPC and has kernels that work with ray packets. Restructuring the code to take advantage of that sort of thing is hard, using ISPC to compile the kernel might help. It's unclear if the extra memory usage of computing 4/8/16 ray paths at the same time is really more efficient in the end for real use cases, you can then optimally use CPU FLOPS but you're doing a lot more memory access which is usually a bottleneck already.
At the same time for optimal GPU usage we should split our megakernel into smaller parts, this will help getting OpenCL to work on AMD, but will also benefit NVidia cards in more complex scenes with many different materials. This is quite challenging to do in practice though, especially with a kernel as complex as Cycles. NVidia showed how to do this for a simple path tracer but Cycles is quite a bit more complicated. What you need to do is to turn the code into a kind of state machine. We should try this at some point.
In general it's sort of interesting that when you look at the open sourced libraries and talks from the studios, they're not actually using SIMD that often, it's generally a pain to get working and adapt your code to it. For a raytracing kernel it's quite important though.
MULTITHREADED DEPENDENCY GRAPH
Pixar right now uses basically one character per thread, then background caches frames for playback to keep cores occupied. It's kind of a workaround, but ensuring fast playback this way is quite nice for animators. They showed their Presto animation software, with fast playback, opensubdiv and hair deformation on the GPU, baked ptex applied to the mesh and lighting and shadows from the key light. All looked quite nice.
They are looking into finer granularity too but don't have it yet. Dreamworks is very granular, graphs with 50k-150k nodes. Requires careful design of rigs, but they have very good scaling. Overhead from granularity is reduced by using TBB, and letting threads handle chains of nodes without going through the task system.
Pixar uses a system where there is a very clear separation between output data and the depsgraph, for evaluating multiple frames at the same time and to reduce locking, this is something we want in Blender too. They also compile the depsgraph in advance, and Dreamworks caches networks for changing various values. It's unclear if this will help in Blender, maybe with quite complex rigs. I think it's best to leave this as an optimization to solve when it shows up as a performance problem.