Immersive monitoring: A perceptive perspective
Localisation makes use of the most energy-consuming and fast-firing synapses of the brain, so the capability has been important for survival. Hearing, balance/acceleration and proprioception are our main look-ahead senses, without which the 0.4 s latency of our mind could get us hurt many times per day, for instance if we had to rely just on vision.
Hard-wired reflexes from the fast senses therefore play a crucial role, also when sound is accompanied by picture, conveying dimensionality, suspense and surprise. One of the first things a baby does is to localise, quickly and automatically turning eyes towards a sound. Until adolescence, we further learn and refine localisation using a system under construction. Ear canals and other structures of the outer ear ("pinnae") grow and reshape, constantly modifying spherical hearing, as we reach out and experience a fascinating world in return.
Pinnae continue to be entirely personal. To some extent, they are actually also under development throughout life, though the rate of change slows in adults. Sound is colored by the pinnae, depending on its direction of arrival (azimuth), which is a highly important feature. Expert listeners constantly use it in combination with head movements; not only when evaluating immersive content but also to distinguish direct sound from room reflections.
Personal head related transfer functions (HRTFs) drive localisation, considering frequencies above 700 Hz. That is the frequency range where interaural level difference (ILD) is of primary concern. From 50 Hz to 700 Hz, however, fast-firing synapses in the brainstem are responsible for localisation, employed in a phase-locking structure to determine interaural time difference (ITD). Humans can localise at even lower frequencies, but we will come back to that in a specific ultra low frequency blog.
The ability to position sound sources with precision spherically is a key benefit of immersive systems. Another is the possibility to influence the sense of space in human listeners. For the latter, the lowest two octaves of the ITD range (i.e. 50-200 Hz) play an essential role; but may be compromised in multiple ways: Microphones with not enough physical spacing during pick-up, synthesized reverb without the right kind of decorrelation, lossy codecs that collapse channel-differences, loudspeakers with limited LF capability, bass-management etc.
So where does this all lead, considering immersive reference monitoring? A well-aligned loudspeaker system in a fine room has the best chance of translating well to a variety of immersive playback situations. The sound engineer can make full use of outer ear features and head movements, with listener fatigue and "cyber sickness" minimised.
Headphone-based immersive monitoring needs to incorporate precise, personal HRTFs and head tracking around a n-channel virtual reference room. Even so, any static or temporal imperfection can lead to listener fatigue, and head movements in production are unlikely to produce anywhere near the same results as during reproduction across platforms.
About the Author
Thomas is one of the fathers of the loudness and peak-level measurement standards used widely from music production over streaming and broadcast to OTT. Perception has been at the centre of his professional life, working first as a physician and then in pro audio research. Thomas has written a number of papers; he is senior technologist at Genelec, and is the convenor of an EU expert group tasked with the prevention of recreational hearing loss.
Main image: Jungle Studio's Steven Boardman, Credit: Rob Jarvis.