This is more a piece-of-mind thing than anything else. But it
also lets us make the number of LDS dimensions lower without
worrying, which in turn makes the code smaller.
After implementation, it does appear to make rendering slower
by a noticable bit compared to what I was doing before. At very
low sampling rates it does provide a bit of visual improvement,
but by the time you get to even just 16 samples per pixel its
benefits seem to disappear.
Due to the slow down and the minimal gains, I'll be removing
this in the next commit. But I want to commit it so I don't
lose the code, since it was an interesting experiment with
some promising results.
I couldn't make the BVH4 faster than the BVH, and the bitstack
was bloating the AccelRay struct. Removing the bitstack gives
a small but noticable speedup in rendering.
Specifically, LightPath is now significantly smaller, and
resultingly faster to process.
Also finally fixed the bug where without light sources the light
from the sky wouldn't affect surfaces.
If the average surface area of all the time samples is close enough
to the surface area of their union, just take the union and use that.
This both makes the BVH smaller in memory (time samples don't
propigate up the tree beyond their usefulness) and makes it
faster since traversal can avoid interpolating BBoxes when there's
only one BBox for a node.
Reduced from 64 to 42. This still allows each BVH to hold 4.4
trillion elements, but it guarantees that the accel ray's
traversal bitstack can accommodate at least two nested max-depth
trees.
In practice it worked fine, but only by accident. NaN's were
being passed to the lerp_slice function, which led to the
correct result in this case but is icky and dependant
on how lerp_slice is implemented.
The BVH building code is now largely split out into a separate
type, BVHBase. The intent is that this will also be used by
the BVH4 when I get around to it.
The BVH itself now uses references instead of indexes, allocating
and pointing directly into the MemArena. This allows the nodes
to all be right next to their bounding boxes in memory.
This seems to work more nicely than a fixed block size, because
it adapts to how much memory is being requested over-all. For
example, a small scene won't allocate large amounts of RAM,
but a large scene with large data won't be penalized with a
lot of tiny allocations.