thrown together by David Tamés, v.1, December 14, 2023
This presentation is a first draft, comments and suggestions are requested, please contact me at: d (dot) tames (at) northeastern (dot) edu.
For copyright and acknowledgments please refer to the last slide.
In this 90-minute demo, you will be introduced to using the Ambisonics Recording Kit that will soon be available from the CAMD Immersive Media Lab. We will begin with an introduction of the fundamentals of Ambisonics and the components in the kit, and you’ll have an opportunity to record sound with the kit and listen to the results, and learn about postproduction options for working with Ambisonics sources for use in immersive audio compositions, 360 videos, and VR projects. I would also like to discuss what additional resources do we need to support creative research that incorporates immersive audio? The MSO has not yet determined the actual name of the kit, and the deployment and actual inventory of the kit and the availability of post-production software resources are still being finalized. If you are interested in using Ambisonics recording or postproduction for a course or creative research, please get in touch with Jonny Ouk.
Ambisonics is a sound encoding/decoding standard for full-sphere surround sound. Ambisonics is flexible, future-proof, and captures a high degree of realism. Despite being around since the 1970s, it is still new to a lot of people and, like every technique, it has a bit of a learning curve. Audio materials encoded in the Ambisonics format represent sound sources along the horizontal plane and above and below the listener. Unlike traditional multi-channel surround formats (e.g., 5.1 and double MS), Ambisonics implements a loudspeaker-independent representation of a sound field (or soundfield) that can be decoded to a wide range of loudspeaker setups (e.g., mono, binaural, stereo, 5.1, 7.1, cube, octahedral, etc.).
Ambisonics allows creators to work with sound sources based on directions instead of speaker positions and offers flexibility regarding the playback environment. Many sound artists and designers, including Mark Mangini (Dune, Blade Runner 2049), advocate for recording sound effects and ambiances in Ambisonics due to the post-production flexibility it provides. Therefore, 360 video and VR are only two of the many use cases for Ambisonics in the contemporary media production landscape. You don’t need to record Ambisonics sources for an Ambisonics mix, many creators build their immersive soundscapes from stereo and mono sources that have been encoded to Ambisonics for mixing. The future of audio is object based, which releases us from the limits of specific speaker configurations and makes it easier to future proof our projects and release them in a variety of delivery formats.
Image: Illustration by Yoshi Sodeoka from Spatial Audio, New York Times, October 22, 2021, https://rd.nytimes.com/brief/spatial-audio
Ambisonics was developed in the mid-1970s by a group of British academics, notably Michael Gerzon (Mathematical Institute, Oxford) and Peter Fellgett (University of Reading). The system was designed to reproduce recordings made with a “Soundfield” microphone and mixed using their own technology, and reproduced with a minimum of four speakers. The significant innovation of this project was the creation of an object-based audio format that encoded the direction, distance, and height of recorded sound without specific reference to a loudspeaker channel. The technology is a cousin of the M-S stereo recording technique developed in the early 1930s by Alan Dower Blumlein, an English scientist who was also responsible for the 45/45 stereo record cutting technique. The M-S configuration with a cardioid or line-cardioid microphone for the middle channel and a figure-of-eight microphone for the side channels has been demonstrated to achieve the highest accuracy of stereo imaging with two loudspeakers and is easily decoded to stereo (and stereo is easily encoded into M-S). Double M-S adds an additional line-cardioid or cardioid middle channel pointing to the rear for a full 360 surround recording. But the moment you move or turn your head, the spatial realism vanishes. Ambisonics to the rescue.
Curiously, if you are working with AmbiX files, using SN3D normalization and standard ACN (Ambisonic Channel Numbering), WYZX, the first two channels of the file are the W (omnidirectional) and Y (left/right) signals, identical to the mid/side signals of M-S stereo, which you can simply decode with a mid/side decoder available in most DAWs. The M-S heritage is right there in Channels 1 and 2!
References
Images: Double MS (B9 Audio); Triple Calec Microphones (Into The Soundfield Michael Gerzon and Ambisonics at Oxford)
While developed in the 70s, Ambisonics did not gain traction until YouTube, Facebook (now Meta), and Unity adopted it as a standard for 360-degree videos. Ambisonics has become a widely used not only as a 360-degree spatial audio standard for VR and 360 video but is also used for immersive installations and field recording due to the flexibility of the format. For example, Reeps One: Does Not Exist (VR Beatbox with 3D sound) by Mill+ and Aurelia Soundworks was composed to take advantage of the 360 audiovisual space, creating a new style of music video, watch on YouTube, https://youtu.be/OMLgliKYqaI and read more at https://www.aureliasoundworks.com/project/reeps-one-does-not-exist/
The listener can locate the sound coming from any direction in space. It works on the principle that if you can record the boundaries, you can reproduce the inside using differential field equations (Kirchhoff-Helmholtz Integral theory. An Ambisonics encoder takes azimuth and elevation of the sound source to be encoded as input and the listener will perceive the sound source to be present at that particular position when the audio material is decoded.
In an Ambisonics rendering system you would see loudspeakers not just at a height of the listener’s ear level but even above and below (Periphony). If the listener is using headphones, binaural processing based on HRTF filtering can be applied to provide the spatial effect and if head tracking is implemented, the sound field will remain stable rather than remain fixed when the listener moves their head.
For an example, see “The Rise Of Immersive Audio: Progressum Unifies Vivid House” by Katinka Allender, LIVE DESIGN, August 18, 2022, https://www.livedesignonline.com/news/rise-immersive-audio-progressum-unifies-vivid-house-vivid-sydney-2022
The 5th order Ambisonic Dome at the School of Arts, Media and Engineering (DAME) at Arizona State University is an example of a state-of-the-art space to work with spatial audio and the home of Ambisonic Dome Concerts https://asuevents.asu.edu/event/ambisonic-dome-concert. The dome contains 45 speakers, allowing for the precise placement of sounds anywhere along a three-dimensional space, giving artists the opportunity to create immersive audio experiences and explore new frontiers of sonic art, virtual reality and multimedia art. The Dome was designed by Garth Paine with technical director Peter Weisman.
There’s a wide range of range of microphones available for field recording and several plug-in libraries for major DAWs supporting Ambisonics mixing and rendering to different exhibition formats and we’ll cover some of these today.
Unlike traditional surround formats like quadraphonic, 5.1, 7.1, etc. Ambisonics covers sources above and below listener in addition to the horizontal plane. Transmission channels do not carry speaker signals; instead, they contain a speaker-independent representation of a sound field (called B-format). For playback, material encoded in Ambisonics B-format is decoded to a specific speaker configuration. The same source material can be decoded for playback through stereo speakers, binaural or stereo headphones, a four-speaker setup, a multi-speaker dome, etc. Media makers can think in terms of sound source directions rather than loudspeaker positions, and the format provides a high degree of flexibility in terms of speaker layout. The most effective method to present spatial audio in VR and 360 video applications, the scene can be rotated to match the participant's head orientation and then decoded as binaural stereo.
For the most immersive effect, we need as many loudspeakers as we have B-format channels, preferably a few more! The number of loudspeakers for playback should exceed the number of channels. The number of Ambisonics encoded channels is equal to (order + 1)² Higher order Ambisonics provides better localization of sound source, however, it requires additional channels. 1st order Ambisonics requires four encoded channels. 2nd order Ambisonics requires nine encoded channels. 3rd order Ambisonics requires16 encoded channels. And would also need additional loudspeakers to take advantage of higher orders.
Higher order Ambisonics (HOA) are used to reconstruct a plane wave by decomposing the sound field into spherical harmonics. This process is known as encoding. Encoding creates a set of signals that depend on the position of the sound source, with the channels weighted depending on the source direction. The functions become more and more complex as the HOA order increases. The spherical harmonics are shown here up to third-order. These third-order signals include, as a subset, the omnidirectional zeroth-order and the first-order figure-of-eights. Depending on the source direction and the channel, the signal can also have its polarity inverted (the darker lobes).
The various channels in an Ambisonics B-format file may be visualized as virtual microphones with increasingly complex pick-up patterns. As you add channels, you increase the spatial resolution and the size of sweet spot, remember, channels do not correspond to loudspeakers, they correspond to spatial harmonics. Here we see a visualization of Ambisonics spherical harmonics for orders up to three. 1st order Ambisonics (abbreviated 1OA or FOA) includes the top two lines of basis functions (four channels) and 3OA includes all four lines (16 channels). Note that ACN channels 2, 6 and 12 contain only vertical components.
The AmbiX file format is the standard in most widespread use at this time. The format defines the channel ordering and the normalization, both are related to the spherical harmonics, the mathematics behind Ambisonics. The channel ordering used in the Ambix files is called ACN which stands for Ambisonic Channel Order. In contrast to FuMa, another B-format standard, is that the channels aren’t ordered alphabetically any more. The first order components for FuMa is WXYZ and the first order components for ACN is WYZX where W is the omni-directional signal and X, Y, and Z are the figure-of-eights in the x, y, and z direction, respectively. SN3D Normalization is used in the AmbiX format because this ensures that when you encode a source, the levels of all channels will not exceed the first (omni, W) channel. In general, this is quite handy to avoid clipping in your DAW. It is important to know the file format when working with Ambisonics sources that were recorded in the past, as there were many different configurations in use before the AmbiX standard. A neat property of AmbiX files: Using SN3D normalization and the standard ACN (Ambisonic Channel Numbering, WYZX), in other words AmbiX, the first two channels of your Ambisonic signals (WY) will hold mid/side signals, which you can simply decode with a mid/side decoder, like the one which comes with Reaper. So a very very simple StereoDecoder would be exactlyt that. Neat, right? The MS heritage is right there in Channels 1 and 2!
Reference
Since the number of loudspeakers has to at least match the number of HOA channels the expense and physical limitations are significant factors. How many venues can provide the 64 speakers needed for 7OA playback? So why encoding things to a high order in postproduction if we are limited to lower order playback? There are two reasons why postproduction is often done at orders much higher than the source material: (1) future-proofing and (2) better binaural rendering (and given headphones are ubiquitous, even if participants are only going to listen with headphones, mixing in higher orders makes sense.
One of the best features of Ambisonics as a postproduction format is that you can work with a subset of channels to use for a lower order rendering. The first four channels in a 3OA mix are exactly the same as the four channels of a 1OA mix! We can ignore the higher order channels without having to do any approximative down-mixing! By encoding at a higher order than might be feasible for the current configuration for deployment you can remain ready for for another configuration of loudspeakers in the future. For example, you might mix for a binaural preview, then mix for a 24 speaker dome for a museum installation, and use the binaural mix for documentation of the experience and a 1OA mix for deployment to a VR headset. If the limiting factors to HOA are cost and loudspeaker placement issues then what if we use headphones instead? A binaural rendering uses headphones to place a set of virtual loudspeakers around the listener. Now our rendering is only limited by the number of channels our PC/laptop/smartphone can handle at any one time (and the quality of the HRTF).
1st order Ambisonics (1OA) microphones require only four channel recorders at the expense of reduced spatial cues and a smaller sweet spot compared to 2OA and 3OA microphones. The Sennheiser AMBEO VR microphone in the CAMD kit is a 1OA microphone.
Configuration currently a work-in-progress.
A-format is the term used for the unprocessed signals from the four capsules of a tetrahedral sound field microphone consisting of four sub-cardioid capsules (a polar response that is slightly more omni than cardioid) mounted on the surface of a tetrahedron. The capsule outputs can be electronically equalized to some degree so that they appear to be coincident up to a certain frequency. Basic sum-and-difference processing of the A-format outputs generates the B-format components. X = 0.5((LF–LB) + (RF–RB)), Y = 0.5((LF–RB) – (RF–LB)), Z = 0.5((LF–LB) + (RB–RF)), and W = 0.5(LF+LB+RF+RB). The resulting B-format outputs are carefully equalized to compensate for level differences, as for example the W output may lack low frequencies as it is derived from velocity capsules that can lack bass. Because the characteristics the capsules will vary between different microphone designs, the exact specification of the A-format signals is not fixed and each microphone has a procedure (implemented in hardware or software) for converting from A-format to B-format for further processing. Usually we are working with four channels (1OA) from field recordings, but higher orders are also used.
The basic format that is used for the storage and manipulation of ambisonics is B-format. AmbiX is the contemporary B-format standard that has been widely adopted by distribution platforms such as YouTube, it orders the channels W-Y-Z-X. B-format format consists of the spherical harmonics of the sound field up to the order being considered. For first-order ambisonics there is one signal of 0th order known as W, and three of 1st order known as X, Y, and Z. These signals correspond conceptually with the outputs of one omnidirectional microphone and three orthogonal figure-of-eight microphones placed at the same point. This four channel signal allows the manipulation required to generate speaker signals, to rotate the sound field, and various other transformations, to be performed with the simple mathematics, and therefore this format is used for storage, manipulation, and transmission of Ambisonic material.
For a more comprehensive description of Ambisonics channels, see
An Ambisonics decoder can decode the B-format signals to any loudspeaker setup or binaural in case of headphones. One of the advantages Ambisonics is that the format is independent of loudspeaker configuration. Traditional surround sound formats have separate channels for each loudspeaker placed at the front, center, back, etc.). Ambisonics co-exists well with existing mono, stereo and 5.1 setups, but also opens up the possibilities of more sophisticated surround sound arrangements. No wonder sound designers like Mark Mangini now advocate for recording ambiences in Ambisonics for maximum flexibility in post production. While many sound effects are better recorded using a coincident stereo format like MS stereo and dialogue is traditionally recorded in mono, these can be placed anywhere in the sound field in post production. The Ambisonic B-format WXYZ signals define what the listener should hear. How these signals are presented to the listener depends on the number of speakers and their location. Ambisonics treats directions where no speakers are placed with as much importance as speaker positions. UHJ is the matrixing scheme associated with Ambisonics, the two-channel version was a compromise between the BBC's Matrix H and the NRDC's 45J. The BBC's had been chosen from a range of possibilities by listening tests, and the NRDC's was based on theoretical principles.
B-format Ambisonics can be decoded into almost any playback format, including: mono (without "sum to mono" phase cancellation issues); stereo; binaural, fixed-head or headtracked, using individualized or generic HRTF information; four speakers arranged as a square or rectangle; six speakers arranged as a regular or irregular hexagon; 5.1 (ITU); 7.1; 10.1; Dolby Atmos; or any of these formats plus height information (e.g., two hexagonal arrays of speakers, one above the listener and one below); and many more.
Time for the hands-on portion of the demo! We'll make a recording, listen to it, and talk about some postproduction considerations.
I would be happy to offer a follow-up workshop covering editing, mixing, and playback options for Ambisonics projects, let me know if you are interested.
Reaper (https://www.reaper.fm/) is a DAW widely used for Ambisonics mixing due to the ability to accomodate up to 64 tracks for working with HOA up to 7OA.
The most extensive suite of free plug-ins for working with Ambisonics materials is the IEM Plug-in Suite, https://plugins.iem.at/, it includes Ambisonic plug-ins up to 7OA, it was created by staff and students of the Institute of Electronic Music and Acoustics (IEM) (https://iem.kug.ac.at/). Below is a summary of the plug-ins in the suite and what each is used for:
I’ve also uses the a1/a3 Bundle of Ambisonics plug-ins from SSA Plugins, https://www.ssa-plugins.com/, along with Harpex-X, https://harpex.net/, for extracting virtual microphones and binaural decoding.
SSA offers the following plug-ins:
I’ve been using the free dearVR AMBI MICRO, https://www.dear-reality.com/products/dearvr-ambi-micro, for optimized A-to-B conversion for the Sennheiser AMBEO VR microphone and dearVR MICRO, https://www.dear-reality.com/products/dearvr-micro provides another tool for binaural rendering, HRTF selection, panning etc. and it's free, although I prefer to use Harpex-X for binaural rendering and A-format to B-format conversion, but it’s not free.
If your goal is to produce a 360 video or an immersive installation without computer-based interactivity, or an audio-only binaural experience, Reaper or ProTools are good options with support for Ambisonics editing and mixing.
On the other hand, if you intend to create an experience for an interactive installation or a VR experience, you'll want to consider the Wwise from AudioKinetic suite of design and development tools that are tailor made for prototyping and deploying interactive audio experiences. My own experience is limited to Reaper and Unity, but Wwise look intriguing for interactive works.
It is preferable to mix and deliver the soundtrack as 3OA because this will provide better spatial resolution (a more immersive audio experience for the participant). Even if your Ambisonics field recordings are 1OA, any mono or stereo sound sources encoded to Ambisonics will benefit from better spatial resolution if you encode them to 30A instead of 1OA. Another benefit of mixing with 3OA is better decoding to binaural.
Wwise, Reaper, and ProTools all support working with Ambisonics up to 3OA. Reaper and ProTools Ultimate can supports up to 7OA (64 channels) which will be required when working with large speaker array installations. However, for many VR and installation use cases, 3OA (16 channels) provides a good balance between computational resource requirements and spatial resolution of the end results.
Most playback platforms only support 1OA, for example, YouTube, the most popular platform for 360 content at this time, however, working in 3OA offers more deployment flexibility in the future and better binauralization right now for sources that originated as mono or stereo.
For my current project I'm using Reaper with a 3OA mix, which I've determined is good enough for my use case. I have been using the Harpex-X plug-in for up-conversion from 1OA recordings made with my Sennheiser AMBEO VR microphone to 3OA with good results. You can just mix 1OA straight into a 3OA mix since the first four tracks of any B-format HOA does make up the first order. However, based on my experience so far, I get slightly better results up-converting 1OA material to 30A first, but this could very well be a placebo effect.
BTW, 7OA mixing might be a bit of overkill, for many art installations and VR experiences, I believe that 3OA (16 channels) is good enough when working with a variety of sources including stereo sound effects and mono dialogue and 1OA field recordings (that’s the format I’m using for the sound design for my current project), however, the advantage of 7OA is very precise placement of mono and stereo sources in the sound field, which may be required for a large dome installation and also leads to better binaural decoding.
I disagree with the New York Times research report on spatial audio that advocates for Dolby Atmos and Pro Tools are the way to go for mixing immersive audio; there is a lot of hype surrounding Dolby Atmos at the moment, the dust has yet to settle, and for installations and binaural deployments it’s a waste of energy to even think about Dolby Atmos, I believe the most prudent thing to do is to avoid it, see “Gaslighting Your Fans w/ Dolby ATMOS™” and “ATMOS doesn't make sense” below.
Resources
additional materials we did not have time for
offers significant improvements over 1st-order microphones, these improvements are particularly attractive for three reasons: (1) 2OA microphones are much better at preserving the perceptual cues necessary for a listener to precisely locate sound sources; (2) 2OA microphones provides a larger sweet spot for listeners, this is particularly valuable in dome or room installations (while 1OA microphones have a sweet spot around the size of a human head, a 2OA microphone can accommodate multiple listeners without degrading a recording's sound location perceptual cues); and (3) 2OA microphones can be used 50% farther away from the sound source while maintaining the same directivity index. An example of a 2OA microphone is the Core Sound OctoMic, see https://www.core-sound.com/products/octomic for more information.
offers improvements over 2OA microphones, offering excellent spatial cues and a large sweet spot for more convincing being-there experiences suitable for dome and small venues. The Zylia ZM-1 is an example of a 3OA microphone. It sports 19 digital MEMS sensors distributed in a sphere and is part of a system providing direct recording to a laptop or tablet via USB, or the ZR-1 portable recorder. The ZYLIA ZR-1 records 22 channels in total: 19 from the microphone capsules plus 2 channels stereo (pseudo binaural) and one channel timecode. See https://www.zylia.co/zylia-pro.html for more information.
Copyright 2023 by David Tamés, some rights reserved. This presentation (not the copyrighted images and figures) is released are under a Creative Commons Attribution-ShareAlike license.
Figures have been reproduced from various sources under several licenses; please check their caption for source information. Copyrighted images and figures are used for educational purposes under fair use guidelines. Uncredited images of products are from their respective vendors. Images credited with “d.t.” are by the author and are released under the same license as the presentation.