The Future of Stereo (part 2) by Floyd E. Toole (May 1997)

Home | Audio mag. | Stereo Review mag. | High Fidelity mag. | AE/AA mag.

By Floyd E. Toole [Floyd E. Toole is Corporate Vice President of Engineering for Harman International. He is a past president of the Audio Engineering Society and a Silver Medal Award winner. Prior to his move to the United States, he spent 25 years with the National Research Council of Canada as a scientist and psychoacoustician. His Ph.D. thesis dealt with stereo localization and binaural hearing.]

Also see: The Future of Stereo (part 1) by Floyd E. Toole (May 1997)

Two-speaker techniques for 3D sound.

The capture, storage, and reproduction of musical and other acoustical events remains a challenge, even after decades of technological developments. Last issue, in discussing multichannel approaches, I concluded that genuine progress is being made in bringing directionally and spatially enriched listening experiences to multiple as well as individual listeners. In this issue, I will explore the alternatives available when we attempt to imitate natural hearing.

It seems that every decade or so, binaural sound enjoys a revival. The technique was first demonstrated in the late 19th century, and one can only imagine how bad it sounded. Since then, acquired knowledge and technology have led to improvements, but even today binaural sound is not widely known or understood. Yet soon it will be popular, and many of us-and most of our kids-will experience binaural "3-D" audio in interactive computer games. Others will enjoy multichannel sound reproduced through five phantom speakers but actually generated by just two real speakers or a pair of headphones.

In Part I of this article, I discussed Ambisonics, a system that attempts to capture a three-dimensional sound field and then immerse the listener in a facsimile of that sound field as reconstructed by numerous speakers. Binaural techniques attempt to capture the spatially encoded sounds that enter the ear canals of an artificial "listener" at a live event and then deliver those same sounds to the ears of real listeners, thereby reconstructing the perception of the original three-dimensional sound field at different times and places.

Binaural means two ears. When you listen with your two ears, you are hearing in three dimensions--in fact, in perfect 3-D!

All of the acoustical information needed for 3-D auditory illusions is contained in the sounds arriving at our ears. Therefore, if we could encode recorded sounds in the appropriate manner and reproduce them for each of our ears, we should be able to reproduce 3-D audio experiences.

It has long been acknowledged that, in theory, the most accurate recording technique is binaural: The ears of an accurately modeled mannequin or dummy head are fitted with microphones, and the left- and right-ear signals are recorded and subsequently replayed through headphones to the ears of a listener (Fig. 1). A Bell Labs study of auditory perspective came to that conclusion in 1934 [1]. Ideally, the listener should experience an auditory illusion identical to that which would have occurred if he had taken the place of the mannequin at the original performance. As it turns out, although the binaural illusion works very well, it is not perfect. Listeners usually report a pleasantly spacious illusion, but sounds that should be perceived to be far out in front of the listener are instead localized inside, very close to, or even behind the head. Because most sounds of interest are outside and in front of us, within our field of vision, this is a serious problem.

In the '70s and '80s, numerous binaural recordings were made. Some involved whispering in the left or right ear and noises that sounded like a barber's scissors at the nape of the neck. Heard through headphones, these demonstrations sounded very realistic. Distant sounds to the side and rear were also convincing. However, although some people were persuaded that voices and noises moved convincingly outside and to the front of the head, for most listeners it was a disappointment.

These perceptual errors have been attributed to a number of factors: the lack of a visual confirmation of what is heard, the fact that the auditory illusion tracks head movements, the fact that the mannequin's ears probably are not exact replicas of the listener's, headphone performance errors, and so on. In a static listening situation, the eye/ear/brain system is not fooled. Adding correlated visual cues and a dynamic head position tracking system, with appropriate DSP corrections, is a great improvement, as has been demonstrated in the best virtual reality (VR) systems. Personalizing the system to match the listener's ears is another possibility. But with all of that, it has to be said that this form of binaural reproduction is probably not yet a solution for the masses.

Crosstalk Cancelation

What is really needed is a delivery system that makes the sound sources convincingly external and in front. We need to be able to reproduce binaural signals through speakers. But the problem is that the sound from each speaker travels to both ears; consequently, there is crosstalk, or leakage, of sounds from the left speaker into the right ear and vice versa.

However, if we know where the speakers are located and where the listener is located relative to teem, it is possible to calculate or measure the unwanted crosstalk components. Then, in a component upstream of the speakers, the sounds can be processed so that when they arrive at the ears an acoustical "algebra" occurs, resulting in the left speaker communicating only with the left ear and the right speaker with the right ear (Fig. 2). The crosstalk-cancelation concept was patented by Atal and Schroeder/Bell Telephone Laboratories in 1966 [2] and used by them to study concert hall acoustics. In that study, binaural recordings made in different venues were reproduced through speakers in an anechoic chamber. The system did work, but it required an anechoic chamber and a listener locked into a sweet spot.

Commercial versions of crosstalk cancelation have appeared over the years, mainly as methods to expand the perceived soundstage yielded by two-channel stereo. Of these, probably the best known are Carver's Sonic Holography [3] and Polk's SDA speakers [4]. Lexicon's CP and DC series of surround processors also include a binaural "Panorama" mode that is a crosstalk canceler.

In the 1980s, Duane Cooper and Jerry Bauck focused on the original problem of accurate binaural playback and developed a series of improvements that made speaker based listening more practical and economical [5]. These patented innovations yielded a technique that is simpler to implement than the Atal and Schroeder model and less demanding of the listening environment. Further, it's more tolerant of head movements, and it degrades "gracefully" as the listener moves out of the sweet spot. The Cooper-Bauck Transaural technology provided the basis for Harman's recent VMAx (Virtual Multi-Axis) system. In the best systems, the sweet spot, or, more accurately, the sweet region, is about the same as it has been in stereo for the past 40-odd years: long, tall, and narrow. The difference is in the auditory reward. In stereo we get to hear the featured artist floating midway between the speakers. But in speaker-based binaural listening, we can be transported to another three-dimensional world.

Fig. 1--In traditional 3-D binaural recording, mikes fitted in the ears of a dummy head record sounds that are reproduced, via headphones, in the ears of the listener.

Fig. 2-Interaural crosstalk cancelation enables playback of 3-D binaural audio over speakers by canceling the leakage of left or right speaker sound to the opposite ear.

This system can work remarkably well. Obviously, it works best when the listener is in the predominantly direct sound field of the speakers. This means that close listening, as at a computer workstation, is likely to work well. At greater distances, one must pay attention to reflected sounds, which can be done by controlling the speakers' directivity or the absorption characteristics of the room's reflecting surfaces. Best performance will always be achieved when the listening geometry matches that for which the crosstalk-cancelation filters have been designed.

Since the sounds come from speakers, head movements simply confirm that the sounds originate outside and in front of the listener.

Whereas in headphone reproduction of binaural signals it is difficult to create convincing illusions of sounds originating in front of the listener, with speakers it is difficult to create illusions behind him. However, in practice, when the sound images are in motion and if there are visual cues that correlate with the sound movements, most people drift into a susceptible frame of mind in which even these front/back uncertainties disappear.

That there should be these front/back reversal problems in headphone and speaker reproduction of binaural programs is not surprising. The problem is the location of our ears and the front/back symmetry that exists. Auditory cues alone are not enough for us to make a completely reliable front or back identification. In tests of listeners' natural localization capabilities, front/back reversals are frequently observed. In the course of our everyday lives we rely on head movements and visual cues to keep things straight. Removing or altering those normally reliable cues makes things go perceptually wrong. Since much of our natural auditory localization relies on plausibility, we must conclude that the perception of a sound source outside and in front the head is less plausible in headphones than it is through crosstalk-canceled speakers. The reverse is apparently true for sound sources behind us. In-head localization must therefore occur when nothing else seems plausible [6, 7, 8].

Fig. 3--By digitally processing head-related transfer functions (HRTFs), it's possible to electronically synthesize the left- and right-ear signals appropriate for any direction.

Binaural Steering

Because the music industry is committed to multitrack, multimike recording methods, 3-D will not be popular so long as it depends on dummy-head recording techniques sing DSP techniques, however, it is now possible, in real time, to electronically synthesize the left- and right-ear signals appropriate for any direction (Fig. 3). To accomplish this we must know, for every direction we wish to synthesize, how the sound is modified on its way to the ears. This is determined by positioning a known sound source at various points in space around a head and measuring the head-related transfer functions (HRTFs) that correspond to each of the ears. The HRTFs are measured as impulse responses (amplitude versus time) or as the Fourier equivalent, amplitude and phase versus frequency.

With all this information stored away, the binaural directional synthesizer can alter any single-channel signal to the left- and right-ear signals appropriate to the chosen direction, a process known as binaural steering.

This process has been widely used since about 1980, when affordable computers of sufficient speed and power became available. Among the best-known endeavors of this sort is the collaboration of Elizabeth Wenzel at the NASA-Ames Research Center, Fred Wightman and Doris Kistler at the University of Wisconsin, and Scott Foster at Crystal River Engineering, who focused on headphone reproduction for the military and virtual-reality applications of the technology. Another pioneering venture was that of Bo Gehring of Focal Point 3D Audio. Both of these systems, used with head position tracking, provided quite convincing 360° localizations. Several other processors now exist that can do this, and thanks to the MIT Media Lab, a set of HRTFs is available on the World Wide Web.

Durand Begault documents this subject in detail in his recent book [9]. So, there it is. We have two ways to create binaural signals, dummy-head microphones and electronic synthesis, and two ways to reproduce them, headphones and crosstalk-canceled speakers.

The Nature of the Sweet Spot

In any system involving binaural image steering through loudspeakers, there is a sweet spot: the location where the 3-D sound "picture" is most sharply in focus.

Systems that claim to have a broad sweet spot do so at the expense of localization precision; a "fuzzy" sweet spot has "fuzzy" localization. For some recordings or movies, that may be acceptable; for others it won't be. In general, listeners will find that they have considerable latitude in terms of front/back or vertical movement (the latter determined mainly by the speakers' vertical directivity). Systems will differ, however, in their tolerance of head rotation and of movement from side to side, off the axis of symmetry. Ideally, angular variations of ±20° or more should not dramatically change the illusion. Small movements from side to side should cause the soundstage to distort in an "elastic" fashion. Larger movements away from the axis of symmetry should result in a smooth degradation from a 3-D illusion to an illusion of fewer dimensions. At no time should the listener be aware of a "pulling" sensation or obvious phasiness from normal, small lateral head movements.

The concept of a sweet spot is not new to us; there has always been one for conventional stereo. That most stereo listeners ignore the sweet spot is a measure of the marginal reward for sitting there. In crosstalk-canceled speaker 3-D audio, the rewards are enormous if the technique is done well.

Fig. 4--When speakers are too close to yield a realistic soundstage, binaural synthesis permits replacing real speakers with phantom speakers having increased angular separation and apparent distance.

Speaker-Based Binaural Effects

There are a number of useful effects that can be achieved by means of binaural techniques applied to loudspeakers.

1. Speaker Spreading. In circumstances where the speakers are too close together to yield a realistic stereo soundstage, binaural synthesis makes it possible to replace the real speakers with phantom speakers having an increased angular separation and apparent distance (Fig. 4). Done well, the effect is so convincing that little or no sound is perceived to come from the real speakers. The angular separation can be varied according to the listener's taste.

2. Crosstalk Cancelation.

Used alone, acoustical crosstalk cancelation permits the left speaker to communicate with the listener's left ear and the right speaker with the listener's right ear. In this sense the delivery system is not unlike headphones, but there is an important advantage: The sounds are perceived to be outside the listener's ears and in front of him.

Binaural (e.g., dummy-head) recordings can be played directly through such a system, and the result can be a remarkably convincing sense of three-dimensional space. Incidentally, not all binaural recordings are good. Differences among the artificial-head microphones, post-processing, and positioning of the head vis-à-vis the performers affect the relative quality of the few existing commercial examples.

Conventional stereophonic recordings also can be played through a crosstalk-cancelation system. Results will vary, depending on the microphone techniques and signal processing used in making the recording. My experience has been that a high percentage of recordings, especially popular ones, are enhanced in very interesting ways, with the stereo image sometimes expanding to fill the entire front hemisphere. Some recordings exhibit very dramatic and pleasant three-dimensional illusions, much more engaging than is possible with conventional stereo reproduction.

Nowadays, a few recordings are preprocessed with 3-D effects. These may or may not be compatible with speaker-based binaural playback because of double processing and its unpredictability.

3. Phantom Home Theater. If we take the outputs from a surround processor, connect them to a five-axis binaural steering device, steer the channels to the appropriate locations around a listener (say, 30° left and right for the main channels, 0° for the center channel, and 90° left and right for the surround channels), and use signal conditioning to account for peculiarities of the headphones, then the listener can experience a multichannel simulation through headphones. All of the preceding provisos for headphone listening apply, so it may or may not be a very realistic experience. However, it is likely to be better than conventional stereo.

If we take the next step and create a crosstalk-canceled speaker version of the system, then the listener can have a multichannel experience from any source through just a single pair of speakers (Fig. 5). With discrete multichannel digital formats like Dolby Digital (AC 3), all main channels (five, in the case of Dolby Digital) are full bandwidth and separate. Thus, it is no longer possible to get away with lesser performance in any of the channels. Each of them can be expected to be used for a convincing directional effect. Speaker-based binaural synthesis can serve as a cost-saving alternative to an appropriate multi-speaker system by creating a phantom multichannel system in which all channels are identical in sound quality and each can be addressed independently.

Moreover, the listener can use more of his budget for two good speakers and a stereo amp rather than apportion it to many lesser components.

These synthesized multichannel techniques are not gimmicks; they are not adaptations of, or modifications to, two-channel stereo, Dolby Pro Logic, Dolby Digital, or any other multichannel system that is selected. Decoding is done before the image steering, using processing approved by the system's developer. All that is done is to synthesize phantom speakers to replace real ones. Even with these capabilities added, all of the native features of the base multichannel processors are still available. Usually, most of these are employed to enhance the spatial characteristics of two channel stereo recordings. In the future, the "auralization" of listening spaces could become an accessory feature in such processors. We will be able to design our own "virtual" listening rooms.

4. Games and Interactive Entertainment.

The ability to binaurally steer specific sounds to specific locations can significantly enhance interactive games. Good guys and bad guys can be audibly tracked and chased. Full interactivity ensures that, as the player alters the visual perspective, sounds will track correspondingly to the correct locations. This has been demonstrated over headphones in various helmet-based virtual-reality games that have used a head position tracking device to provide spatial interactivity. Speaker-based binaurally synthesized sound provides a parallel experience in computer workstations and similar situations.

Fig. 5--Speaker-based binaural synthesis can create multichannel sound reproduced through five phantom speakers but actually generated by just two real speakers.

Challenging Conventional Wisdom

Obviously, the quality of our auditory perceptions is dictated by the integrity of the sounds arriving at our ears. Conventional stereo and multichannel sound systems use several speakers to locate sounds and to energize the reflective sound field in rooms. The room is very much a part of the sound reproduction system and, as such, represents a substantial uncontrolled variable in the critical final phase of the audio delivery system. The physical positioning of speakers and the acoustical treatment of listening rooms has become part of the personal art of audiophile stereo systems. In the end, they are all different. Close-up listening to small speakers (often called near field listening) has become increasingly popular in professional recording as a means of reducing the influence of variable room characteristics prevailing at different studios. It gives engineers a better absolute reference for sound quality than the traditional large monitors that are subject to the variability of rooms. It is a trend that also fits nicely with the audio/video workstations that are becoming increasingly common in production and post-production tasks in film, television, multimedia, and computer games.

In multichannel surround sound systems, a persistent and common problem is the mismatch in the timbral signatures of the various speakers. Some of this may be caused by real differences between the speakers, but even if the speakers are identical, there will be differences attributable to their various positions in the room. With crosstalk-canceled speaker systems, all of the sound comes from two speakers; by definition, therefore, the phantom speakers in virtual home theater systems differ in timbre only by virtue of the differences in the HRTFs associated with their different locations. In other words, it is like listening to five perfectly matched speakers in a perfect room.

A further attribute of this form of listening is a remarkable sense of distance and depth. With the crosstalk canceled, the listener has no information by which to judge the distance of the real sources of sound--the speakers. Impressions of distance, then, are derived exclusively from cues in the recordings. It is captivating to see speakers but not to "hear" them, to be aware only of sounds occupying positions or areas in a perceptual space that extends far beyond the walls of the room.

So let us adjust our mindsets slightly. Let us not think of small speakers in close listening situations as poor substitutes for the traditional professional and hi-fi products.

Let us view them as legitimate alternatives, which in some important ways have the potential of being even better.

Convincingly good 3-D audio requires that there be a well-defined acoustical link from each of the speakers to each of the listener's ears. It is naive to think that any pair of speakers stuffed into any convenient location will work. Disappointments will abound. Fortunately, with attention to the right details, it is not difficult to achieve success.

First, the loudspeakers must be closely matched in performance and the .electrical signal paths balanced.

Second, the listeners must be in a predominantly direct sound field. Sounds reflected from nearby objects or surfaces--walls, tables, or the workstation (monitor, computer, keyboard, printer, and the like)-will corrupt the illusion. Controlling the directivity of the speakers to avoid reflections is a further benefit. In the latter case, controlling horizontal as well as vertical directivity would be advantageous.

Third, the left and right acoustical signal paths must be the same length. The listener must be close to the axis of symmetry of the speakers, equidistant from both. Because of the position of the user's chair, the computer workstation environment lends itself to this type of system. However, with appropriate attention given to the design of speakers and to the listening environment, it is possible to provide excellent experiences at much greater distances.

In all instances, it is assumed that the listener has normal hearing in both ears. If this is not the case, the system breaks down.

It is also assumed that the listener's head and ears have acoustical properties similar to those embodied in the HRTFs used for the binaural steering and crosstalk cancelation. While there are powerful features that all people have in common, there are certain to be instances in which different people hear slightly different effects from these systems. No one listener, therefore, can be the arbiter of absolute quality. But experience does show that it is possible to design a system that can provide remarkably convincing performances for most listeners.

So, there it is-my version of where we are and where we might be going. I have attempted to distinguish between those systems that are social, in that they allow for multiple listeners, and those that are designed to satisfy only a single listener. There are clearly times and places for both types.

Technologies now exist that can give us a choice, and certainly more will come.

I find the whole trend very exciting, because at last we have broken the two-channel stereo stalemate. We knew long ago that there were better ways. As a stopgap (a 40 year one!), stereo has been very enjoyable.

It is time to move on, though. Are you ready?

Also see: The Future of Stereo (part 1) by Floyd E. Toole (May 1997)

REFERENCES

1. Steinberg, J. C. and W. B. Snow, "Auditor} Perspective-Physical Factors," Electrical Engineering, January 1934 (pp. 12-17).

2. “Apparent Sound Source Translator," U.S. Patent No. 3,236,949, granted to B. S. Atal and M. R. Schroeder, assignors to Bell Telephone Laboratories, Inc., 1966 (filed 1962).

3. Carver, Robert W., "Sonic Holography," Audio, March 1982.

4. Polk, Matthew, "Polk's SDA Speakers: Designed-In Stereo," Audio, June 1984.

5 Cooper, D. and J. Bauck, "Prospects for Transaural Recording," Journal of the Audio Engineering Society, January/February 1989 (pp. 3-19).

6 Toole, F. E., "The Acoustics and Psychoacoustics of Headphones," AES Preprint No. C1006, 2nd International Conference, May 1984.

7. Toole, F. E., "Binaural Record/Reproduction Systems and Their Use in Psychoacoustic Investigations," AES Preprint No. 3179, 91st Convention, October 1991.

Toole, F. E., "In-Head Localization of Acoustic Images," Journal of the Acoustical Society of America, October 1970 (Vol. 48 pp. 343-949).

9. Begault, Durand, 3-D Sound for Virtual Reality and Multimedia, AP Professional (Chesnut Hill, Mass., 1994).

10. Web sites with comprehensive bibliographies and other useful information of Ambisonics and related subjects:

www.omb.unb.cd-mleese/

www.aber.ac.uk/-dgw/3daudio.htm

Adapted from Audio magazine (June 1997).

Prev. | Next