NVIDIA researchers’ landmark achievement in machine learning uses multiresolution hash encoding: Digital Photography Review

January 19, 2022

3 Views 0

SaveSavedRemoved 0

[ad_1]

Researchers from NVIDIA have developed a method for very quickly training neural graphics primitives using a single GPU. Neural graphic primitives have traditionally required multiple, fully connected neural networks and are challenging, time-consuming and expensive to train and evaluate.

The research team, comprised of Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller, has created a new input encoding method that significantly reduces the number of floating point and memory access operations. Further, the team has augmented its small neural network using a multiresolution hash table, which simplifies the overall architecture and leads to significant optimizations. The training method allows for high-quality neural graphics primitives to be trained in mere seconds and require less-powerful individual devices, rather than expansive networks comprised of many expensive computers. This means super-resolution style upscaling for photos and other images can be done quickly, on-the-fly, without the need for racks of computer systems and GPUs.

There are a lot of complicated terms and ideas at play, but the general idea is that by reducing the number of parameters required for the parametric encoding technique being used, and making the data structure itself easier for GPUs to handle, neural network training is made significantly faster. The authors write, ‘We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations: a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs.’ The GPU being used for the work is an NVIDIA RTX 3090, which while not inexpensive at $1,500, is within the reach of many.

‘Fig. 1. We demonstrate instant training of neural graphics primitives on a single GPU for multiple tasks. In Gigapixel image we represent a gigapixel image by a neural network. SDF learns a signed distance function in 3D space whose zero level-set represents a 2D surface. Neural radiance caching (NRC) [Müller et al. 2021] employs a neural network that is trained in real-time to cache costly lighting calculations. Lastly, NeRF [Mildenhall et al. 2020] uses 2D images and their camera poses to reconstruct a volumetric radiance-and-density field that is visualized using ray marching. In all tasks, our encoding and its efficient implementation provide clear benefits: rapid training, high quality, and simplicity. Our encoding is task-agnostic: we use the same implementation and hyperparameters across all tasks and only vary the hash table size which trades off quality and performance. Photograph ©Trevor Dobson (CC BY-NC-ND 2.0)’

Figure and caption credit: Müller, Evans, Schied, and Keller. Click to enlarge.

Graphics primitives are ‘represented by mathematical functions that parameterize appearance.’ The goal is to have high-quality, detailed graphics that are also fast and compact. The finer a grid of data, the more detailed the resulting graphics. However, the finer a grid of data, the more costly. ‘Functions represented by multi-layer perceptrons (MLPs), used as neural graphics primitives, have been shown to match these criteria (to varying degree), for example as representations of shape [Martel et al. 2021; Park et al. 2019] and radiance fields [Liu et al. 2020; Mildenhall et al. 2020; Müller et al. 2020, 2021],’ says the new research paper.

The potential issue with MLPs is that these data structures can require structural modifications, like pruning, splitting or merging, which can make the training process more resource- and time-intensive. The team has addressed these concerns through its multiresolution hash encoding. The multiresolution hash encoding is highly adaptable and it’s configured by only two values, the number of parameters and the desired finest resolution. Part of what makes the multiresolution hash encoding method particularly fast and impressive is that the hash table, which is a data structure that stores data using association in an array format, can be queried across all resolutions in parallel. The neural network teaches itself in an iterative fashion across multiple resolutions at the same time.

Hash tables allow for fast search operations regardless of the size of the data because each data value has a unique index value. If you know the index of the data you wish to retrieve, the operation is very fast. When performing training operations, no structural updates to the data structures are required. Further, the hash tables automatically prioritize ‘the sparse areas with the most important fine scale detail.’

Placeholder

This is important because you don’t want to spend time and computational resources on empty spaces or spaces with less detail. For example, an area of an image with coarser detail will not be queried repeatedly across unnecessarily fine resolutions, resulting in more efficient and faster training and rendering. It’s also important to encode an input at multiple resolutions because doing so ensures that a neural network is not just trained faster and more efficiently, but that in areas of a 2D or 3D graphic that include high levels of detail, the appropriate level of detail are learned and you achieve high-quality results.

There are massive performance gains realized with the new input encoding method. The research paper shows that a NeRF, or Neural Radiance Field, can be trained in just five seconds. Per this Reddit thread, training a NeRF used to require up to 12 hours to train a single scene just a couple of years ago. That the new multiresolution hash encoding algorithm has reduced this to five seconds to not just train a scene, but to deliver real-time rendering. Not only is the iterative, adaptive encoding method significantly faster, but it can also be performed on a single high-end GPU that can be purchased by anyone, rather than an expensive network of super-powerful computers.

‘Fig. 6. Approximating an RGB image of resolution 20,000 x 23,466 (469M RGB pixels) with our multiresolution has encoding’ with different table sizes. The painting is ‘Girl With a Pearl Earring’ renovation by Koorosh Orooj (CC BY-SA 4.0). Click to enlarge.

The full research paper includes numerous experimental examples of the multiresolution hash encoding method being used. For example, a neural network was used to approximate an RGB image with a resolution of 20,000 x 23,466 (469M RGB pixels). With a hash table size of T = 2^22, the neural network was trained for 2.5 minutes and achieves similar peak signal-to-noise ratio as ACORN (Adaptive Coordinate Networks for Neural Scene Representation) achieved after 36.9 hours of training.

The consequences of the new research may be huge. In technology, advancements often focus on speed or quality, but rarely are both achieved simultaneously in a way that also reduces the required computational overhead. Considering the hardware this new research was run on, it’s not out of the realm of possibilities that we could see similar tech used in post-processing programs in the near future that would open up a whole new world of image enhancement technologies. To read about the process in extensive detail, read the full research paper.

[ad_2]