Flying Blind--The case against long-term listening (March 1997)

Home | Audio Magazine | Stereo Review magazine | Good Sound | Troubleshooting

Departments | Features | ADs | Equipment | Music/Recordings | History

by Tom Nousaine

We use our ears all day, every day, so listening doesn't seem difficult--until, that is, we are asked to judge the sound of audio equipment. Making observations that accurately describe the sound of a speaker, and that are specific enough to help a de signer improve the product, is never easy. Consistent, reliable data can be elusive. Yet many manufacturers use open, uncontrolled listening comparisons when they're designing hi-fi components, and practically all retailers rely heavily on such methods to sell products.

Indeed, an attitude shared by many reviewers, professionals, and audiophiles is that experimental controls are both intrusive and un necessary-that open, extended listening, over periods longer than one gets in any store or in any single listening session, is mandatory to uncover subtle aspects of sonic performance.

The opinions that flow from long-term listening tests are an important part of many product reviews and sales pitches, and they underlie the listening strategies of ten recommended. The standard advice calls for a listener to relax, put up his feet, open his mind, listen over an extended interval, and trust his ears. He is often warned that switching components or making direct comparisons of different audio gear during the session will interfere with his connection to the product, thereby generating stress and reducing hearing sensitivity. None of this is news to experienced enthusiasts, and though it sounds rational enough to most people, this advice seems at odds with what we know about the human sensory system.

Humans, and nearly all other animals, are most sensitive to a stimulus when first exposed to it. We are consciously aware of the fan only when it is turned on or off; with continued exposure, its drone tends to disappear into the background. Further, common sense tells us that we are most sensitive to a stimulus when we are at attention. How does your dog or cat react to the sound of a can opener or to a strange sound? He pricks up his ears, snaps to attention, and turns his head to localize and face the sound. Why? This response helps him gather the maximum amount of information, the most detail.

Let's take the example a little further. Have you noticed how acutely you respond during a moment of urgency, when you have just received a kick of adrenaline? Why does this happen? During an especially stressful or potentially dangerous situation you quite naturally reach a state of maximum alertness and heightened awareness; your ability to assimilate new information is elevated. Did you ever get an adrenaline blast from closing your eyes and drifting off to Pachelbel's Canon in D Major? No, because your senses are at minimum sensitivity and your mind contributes more to the experience than the environment does.

We also tend to be most sensitive to differences on direct comparison. The differences between off-white paint chips, for in stance, are most apparent when the chips are viewed side by side.

However, if you put them in separate rooms, you may have trouble telling them apart. At the very least, any differences between them will seem diminished. Minor differences may ultimately prove to be inconsequential, but they will always be highlighted by a direct comparison.

On the other hand, it is clear that training and experience can in crease listener sensitivity and reliability. For example, many people have to learn to hear stereo imaging, because they are not aware that recorded sound has a spatial character or that this might be an important aspect of audio system performance. In fact, it's highly desirable to use well-trained listeners in most research dealing with subjective evaluation of such things as data-reduction schemes like those used in Dolby Digital (AC-3) and MiniDisc (ATRAC).

Some researchers have indicated that one well-trained, experienced listener may be worth as much to their studies as eight untrained subjects.


In the late 1980s, David L. Clark (of DLC Design) and Larry Green hill conducted an experiment that compared the efficiency of long term, single-unit listening to shorter-term, switched, side-by-side comparisons. In the single-unit experiment, they used identical-looking black boxes that contained either a simple straight-wire bypass or a circuit that introduced 2.5% distortion to the mu sic signal fed through it. (The distortion circuit generated harmonic distortion that remained a fixed percentage of the output at all signal levels.) Each test subject was given one box, which could be either the bypass model or the distortion box.

Each subject connected the box in the tape loop of the preamplifier of his own audio system and took as long as he wanted to determine if the box he was given produced distortion or was clean.

Listeners were encouraged to use any listening technique (except opening the box) to reach a decision. They didn't have to identify the type of distortion or make any determination except whether the box was "clean" or "dirty" (i.e., whether it produced clean or distorted sound). When the results were tabulated, it turned out that members of two different audiophile clubs had been unable to reliably identify whether they had been given the bypass box or the distortion box.

In a subsequent experiment, however, which included a 45-minute training session and switched double-blind comparisons, the subjects were able to reliably hear the distortion. During this test, subjects were first exposed to music with 13% distortion, a level that yields plainly garbled sound, like a bad AM radio. After the initial training period, the distortion was reduced to 4% and 2% with music and then 0.4% with a sine-wave test signal. Using an ABX double-blind switchbox, subjects were able to reliably identify all these levels of distortion. (With the ABX Comparator, a subject has unlimited, at-will access to two signals, "A" and "B," and to an "X" signal that's identical to either A or B. The ABX circuit randomly assigns A or B to the X position at the beginning of each tri al and keeps track of which signal was used as X each time.) In this experiment, switched direct comparisons proved to be more sensitive at revealing distortion than long-term open listening to a known (and relatively high) level of distortion. However, this was just one experiment, and the program material used in the at home, long-term test was not controlled. Moreover, more than half a decade has elapsed since it was conducted. Many audio enthusiasts believe there have been significant advances in audio equipment since that time and that the debate over listening methods has never been settled. Clearly, it was time to repeat the experiment originally done by Clark and Greenhill.


For this experiment, I prepared CD-R versions of Joan Baez singing "Diamonds and Rust" that either contained 4% of the same distortion used in the prior experiment or were clean, bit-for-bit digital copies of the original CD (The Best of Joan Baez, A&M CD3234). The distorted versions were taken from a DAT recording of the track made from the analog output of David Clark's Audio Chamber of Horrors box, a device that generates calibrated amounts of different types of distortion. I used what Clark calls "grunge" (the same distortion used in the earlier study) and transferred the distorted versions to CD-R.

Clean samples were made via a direct digital transfer of the original CD from a Marantz CD-63 player to a Marantz CDR-610 CD recorder through a 1-meter AudioQuest Quartz cable. Each CD-R was labeled with a coded serial number written on the face of the disc with a water-based marker.

Sixteen audiophile subjects (see Table) from Illinois, California, and Canada were given discs and a score sheet. Five CD-Rs were used in all. Three had added grunge, and two were clean. The assignment of discs and subjects was determined by coin flips. Subjects were told that either a "certain level of harmonic distortion" was added to the disc or the disc was clean and free of any addition al processing. They were asked to take as long as they needed to decide whether the disc was clean or dirty, mark the score sheet, and return the disc. The subjects were told they could use any listening methods they wished except a direct comparison to another CD of the song played on a second CD player. And they were forbidden to discuss the results with other participants who had not completed the test. The initial five discs were assigned in May 1996, and the discs were reissued to other subjects as they were returned. The final subject finished the test on October 28, 1996.

The subjects were intensely interested in the results and wanted to know immediately after submitting their score sheets whether they had correctly identified their disc. One subject shouted "Yes," with a sharply raised fist, when told he had correctly identified his disc as clean. Respondents nearly always ex pressed surprise and disappointment with an incorrect answer. The longest any subject kept a disc was 13 weeks.

One subject returned a disc after a single day, but the average participant took three weeks to complete the listening assignment.

After all scores were tabulated, only 10 of the 16 subjects had correctly identified their discs as clean or dirty. Five of seven subjects correctly identified a clean disc, and five of nine correctly identified a dirty one.

(These results are not statistically significant at the 95% criterion level, using a one-tailed test of significance.) Speaking in statistical terms, we do not reject the null hypothesis that the results are attributable to chance alone and not to a systematic factor-because subjects were unable to prove they could hear the difference between the clean and dirty discs. Twelve of 16 correct answers are required to confirm that subjects were not just guessing. So, over the long term, subjects were not reliably able to tell when 4% distortion was added to a recording, even when they listened with their own audio systems and faced no time limits.

above: HEAT 1--Long-term listening. RESULTS BY AUDITIONING PERIOD


For the following session, I asked one of the subjects, who had kept his CD-R for the longest time and who had incorrectly concluded that a dirty disc was clean, to participate in a single-listener, double-blind, switched ABX comparison. For this experiment, a second CD-R of the same Joan Baez song ("Diamonds and Rust") was made in mono, this time with distortion added to the left channel while the right channel was kept clean. This facilitated comparisons of clean and dirty mono programs with a single CD-R loaded in the Marantz CD-63 player, enabling me to avoid timing discrepancies that sometimes occur between two simultaneously running CD players. With cheap RCA Y adaptors, the distorted x left channel was connected to the A (for Aw p ful) inputs and the clean right channel to the B (for Beautiful) inputs of the ABX box.

Thus, the signal at the selected switch position appeared in both earpieces of Etymotic Research ER-4S in-the-earphones. The 'phones were driven directly from the ABX box's output jacks, and the volume was adjusted from the CD player's remote control.

The subject was first given a 10-minute trial run, with 13% distortion added to one channel. Next, he participated in a 16-trial ABX test that compared 4% distortion levels versus a clean signal. He could choose any switching protocol he wanted.

Running the disc straight through and switching in any fashion he desired, the subject identified clean and dirty signals correctly only seven of 16 times, indicating he was unable to reliably identify the distortion. Next, I used the CD player's A-B repeat to define a 25-second interval on the disc. Listening only to this segment, the subject identified 12 out of 16 correctly--the first statistically significant positive result of the experiment. This confirmed that the subject was able to hear the difference between the clean and dirty signals, with only a 1-in-20 chance that he was guessing. When I shortened the A-B repeat interval to 6 seconds, the listener scored 16 out of 16 correct, leaving virtually no doubt that he could hear the distortion. (I should note that the total test sequence, including the warm-up, took about an hour; the final two sessions consumed less than half that.)


My results confirm my initial description of the human sensory system. Humans are most sensitive to a stimulus when first ex posed to it and can more reliably discern differences in sound quality by immediate comparisons than through long-term expo sure. As test signals, I used common pro gram material (pop music)-clean, and with distortion added of a type and amount previously shown to be audible. The shorter the comparison interval, and the more rapid and direct the comparisons, the better the results were.

In other words, to highlight differences in the sound of audio components, direct A/B comparisons provide maximum listening acuity. Shorter and more similar comparative periods maximize sensitivity (praise be for the player's A-B repeat function). However, I would emphasize that direct comparisons seldom have the level of experimental control used in these listening sessions, in which blind ABX presentations and precise level matching played integral roles.

You might question whether the poor long-term listening results could be attributed to low-quality audio systems. But I chose subjects who were audio enthusiasts, people who were already familiar with the concept of distortion and who were interested in the ability to hear differences be tween hi-fi components. They tended to own above-average audio gear. At least four of the participants had what I would call exotic systems, and two were audio-store salesmen who had easy access to high-end equipment.

The evidence of these experiments, combined with a common-sense approach to human hearing, tells us that the most sensitive listener is the one who is alert, who is at attention, and who uses direct, side-by-side comparisons. The cat-and-can-opener model provides the best operating blue print for a high-acuity listener. The only thing that would improve results would be to train the cat.

So listen with your ears and brain when assessing audio gear. And listen with your heart for enjoyment--after you've checked out the gear.

Adapted fom: Audio magazine, Mar. 1997

Also see:

A/B/Xing DCC (Apr. 1992)

Do CDs Sound Different? (Nov. 1987)

Home | My PC-based A/V System (circa March 2004) | My Home A/V System (circa 2001)

Updated: Sunday, 2018-09-09 8:32 PST