Issues of Reliability and Validity in Subjective Audio Equipment Criticism (AA, One, 1979)

Home | Audio mag. | Stereo Review mag. | High Fidelity mag. | AE/AA mag.


Issues of Reliability and Validity in Subjective Audio Equipment Criticism

by Lawrence L. Greenhill, M.D.

Scientific techniques which might improve listening tests.

SEVERAL YEARS AGO I tackled the thorny intellectual problem of assembling an excellent audio component system out of the more expensive lines of equipment. I quickly discovered the logical decision tree process which might apply to other con sumer items becomes exceedingly difficult in the area of high-end audio equipment. Myriad pieces of similar equipment all sport excellent specifications, so a judgment cannot be made on numbers alone. The reputation of a company or a given product is often very unstable among audiophiles and dealers and not a good basis for a thousand dollar purchase.

So I turned to the various high fidelity publications which were chock-full of equipment reviews. As Moncrieff (1978) has indicated, the data from these reviews were not always helpful.

Some critics (test bench reviewers) simply validated manufacturers' equipment specification sheets, while others wrote poetic, subjective descriptions so personalized that the critic really seemed to be reporting on his inner emotional state at the time. I felt hopelessly lost.

I decided to become my own audio critic. Local audio salons kindly allowed me to borrow equipment to audition in my own home. Into my living room for subjective evaluation came large numbers of stereo components, including 14 power amplifiers, six preamplifiers, three pre-preamplifiers, three crossover networks, nine speaker systems, and three turntable-arm combinations. I purchased many items, lived with them, and later traded or sold them for other pieces of equipment. I am certain many audiophiles experiment and learn about equipment in the same manner.

I put a great deal of time, feeling, and thought into the choices I made, and in the process developed my listening abilities for the often elusive qualities described in the subjective audio reviews.

In the end I possess a system that gives me great pleasure, but I am left with some nagging intellectual concerns about the state of the reviewing art.

------------

ABOUT THE AUTHOR Laurence Greenhill, M.D. is a child psychiatrist engaged in full-time academic pharmacology research at the New York State Psychiatric Institute. He currently holds a Career Development Research Scientist Award which supports his work in the area of treating emotionally troubled children with psychotropic drugs. He bas maintained a strong interest in electronics and in music for many years, and bas held a general class amateur radio license for the past twenty-two years. Many of the ideas which appear in this article were developed and originally applied in bis scientific work, which involves the measurement and statistical interpretation of changes in children's mood and behavior.

 

----------------

 

 

Many respected critics have strongly urged that a holistic unified methodological approach to audio equipment criticism be developed (Moncrieff, 1978; Meyer, 1978; Heyser, 1978). All agree the split between the ''test bench reviewer'' and the ""golden ear subjective reviewer'' currently is irreconcilable, for each utilizes different methodologies and languages. Two general solutions have been advocated.

The first lies in the direction of finding laboratory measures which better detect transient distortions picked up by the ear, experienced as fatigue, and not now described by the static distortion test (Jung, 1977b). Critics now correlate the new transient distortion lab tests and the subjective experience of listening in order to compare several pieces of similar equipment from different manufacturers. Recent studies of power amplifiers using subjective reports coupled with laboratory tests of IM distortion are a good example of this first approach.

The second solution has not yet been put into practice, but has been discussed by Richard Heyser(1978). He points out that the goal of modern sound reproduction is the creation of an ''acceptable illusion in the mind of the listener.' If we can develop a new meta-language based on the rules of human perception to translate between certain objective and subjective descriptions of the same event, we will be able to correlate what we measure with what we hear.

Excellent reasons explain the absence of a quantified approach to the subjective experience of listening. Young (in Meyers, 1978) has pointed out that the repeatability of experiments, which forms the foundation of the scientific method, is inexorably complicated by the inner subjective experience of listening. Feelings, memories, training combine with a listener's active work to align the auditory curves from the stereo into an illusion. These are all unstable, highly individual, mostly unrepeatable personal experiences. Yet these highly idiosyncratic elements can explain the divergent reports from different subjective critics on the same piece of audio equipment.

Although much of this highly personal process must remain private, we can reduce a good deal of variance among observers by applying certain key methodological limitations on the measurement of the auditory perceptual processes involved.

Current psychological research methods have dealt with the reduction of such noise in testing systems through techniques of prolonged baseline observations, the use of operational definitions of qualities being measured, statistical measures of repeatability, and the careful control of all independent variables in the system before testing the dependent variable (the new piece of audio equipment). A careful discussion of these concepts will illustrate how their application to subjective reviewing could narrow the gap between what we hear and what we measure.

Concepts in Testing Perceptual Processes

1. Reliability and Stability of Baseline Measures.

Reliability refers to the reproducibility of observations. Inter-test reliability is a statement about the similarity of repeated tests on the same equipment by the same observer. Will a given reviewer

consistently prefer the Beveridge ''System 2SW'' over the Dayton Wright XG-8MK3? Inter-rater reliability describes the amount of agreement among different raters making observations on the same equipment at the same time. Would all five listeners in a panel simultaneously prefer the Beveridge speaker over the Dayton Wright and for the same reasons? Reliability can be approached in several ways. First of all, one must know the stability of baseline observations on a reference listening session. One can measure the stability by making sure the listener or listeners rate the observation with the aid of a fixed schedule of rating questions with intensity scales and operational definitions. Thus a rater would answer a typical question as follows:

BOX A:

Please rate your listening fatigue in relation to the component being tested over a 15-minute period while playing (given record, volume settings specified). You would rate your fatigue as follows:

Please Choose One:

1. None: can listen with pleasure.

2. Slight: immediately think of other units which you feel are less strident.

3. Moderate: think of turning down volume.

4. Marked: think of turning down volume over five times.

Notice how each choice of answers has its own numerical criterion spelled out operationally. Such specificity helps ob servers stabilize and regulate the accuracy of their observations.

Furthermore, such schedules of ratings produce numbers which can be tested statistically. These statistics give a concise statement about reliability, usually in the form of a Pearson correlation coefficient expressed as a number r.

Correlation coefficient is expressed as a number r. The value of the correlation coefficient is that one can make a statement about the probability (from statistical probability tables) that the critics' agreement and their observations are highly stable and not likely to be due to chance. If two critics have made independent listening assessment on six different units, their answers to a fixed schedule of rating questions (which include intensity scales and operational definitions) can be compared by a standard formula as listed in a standard statistical text such as that written by Roscoe (1969).

If their answers correlate well, and the statistical formula on these observations give a high numerical score (such as a Pearson r equal to 0.707), the reader can be certain that the agreement between these critics is due to consistent perceptual auditory trends. A statistical probability table for a Pearson r equal 0.707 states that such agreements on six different observations would occur by chance only five percent of the time. Such a statistic allows the reader to judge the degree to which chance enters into any two critics' agreement on a given piece of equipment.

Why is reliability important? It gives a quantitative idea of the stability of observations, so real differences in equipment (e.g. a ringing tweeter) measured by electronic means can be correlated with a perceptual response in the listening observers. This helps to increase validation which is the amount of agreement among different types of tests (e.g. distortion measurements, listening measures, and measures of the emotions) reported in the listener to determine the truth of an experimental listening observation.

Some critics have based their whole critical reviewing style on the use of laboratory measures to validate subjective auditory experiences, but unfortunately none has published reliability figures. The ''hardness'' of these listening data could be in creased by publishing inter-test and inter-rater reliability statistics.

2. Control of all Independent Variables in the Test Situation Many critics have strongly advised controlling test situation variables (Moncrieff, 1978), but little data is consistently published describing the actual experimental listening equipment, sources, room, etc. Moncrieff (1978) indirectly states this in his article '"The M Rule,'' in which he states, ''No evaluation of a device can be scientific if that evaluation is carried out through other devices that are imperfect."" In practice, controlling the experimental situation means that the critic should list for the reader some of the following variables :

a. Description of the listening room: the volume, shape, furnishings, reverberation time and listening position must be listed. Such basic technical structural information as used by acoustical engineers would be helpful.

b. Description of the listener: it would be helpful to know the age and hearing ability of the listener. I strongly believe all audio critics who advocate purchases of expensive equipment should have their own ears' frequency response listed.

The emotional state of the observer is a second factor. Colloms (1978) has indicated that his listening panel's reports were highly influenced by the anxiety level on the panel. Although anxiety level is hard to describe and quantify, a simple self-scored mood rating scale such as the Profile of Mood States as described by McNair (1971) would provide valuable quantitative information.

I know that my personal positive or negative reaction to both sources and to equipment can be highly influenced by my emotional state.

The listening measures' validity could be increased by introducing a panel where known differences exist. Such a panel might include a young critic (under 20) whose hearing and tastes would be different from those of an older, more experienced critic. A woman should definitely be on all listening panels because of known differences in female hearing.

c...Description of Subjective Responses to the Visual Appearance of the new equipment in relation to the fixed associated equipment:

It would be helpful to know how much the equipment's appearance influences the critic. I believe the packaging of a con sumer electronic item is one of it's most important features and can bias the listening response. The visual appearance can be rated on the structured rating form.

d. Description of the independent equipment variables: this is essentially a list of all other sources, equipment, and sound levels (in dB) at the critic's listening position. Some standards could be established among critics for age of equipment, when the equipment last had routine distortion level tests, and the age and condition of the source material (that treasured record may have had its high frequency content worn away). Such technical details can be listed in smaller type face under a heading of ''methods'' as in articles on psychological research. Such a ''methods section' might appear as in Box B.

This ''methods section'' can be made far more exhaustive.

The hypothetical test conditions are described to illustrate how one can control some of the independent variables involved in a listening test. Secondly, the likelihood that a given electronic distortion test will correlate with listening tests will increase greatly if the listening tests are stabilized, controlled, and reproducible.

Pearson (1978) comes closest to this suggested methodological approach to audio listening tests. His magazine, The Absolute Sound, routinely publishes a short methods section under 'the Reference Systems,' for each critic. Although reliability data do not appear, inter-rater agreements and disagreements are accessible to the reader. Units are tested by at least two critics who openly discuss their individual findings, emotional responses, and individual audio equipment tastes. A debate between two critics is most informative, for each examines the unit under test for a variety of important consumer-oriented qualities, including the unit's reliability over time, ease of installation, and unit-to unit reliability. The critics cite names of records and tapes, types of associated equipment, and comparisons between the new unit and certain equipment ' 'reference standards."' BOX B: Methods Section Listening tests were performed in a room which measures 18' x 11° x 28', has a reverberation time of 50ms, and is sparsely furnished. Equipment used as independent variables included a Denon 1035 cartridge (stylus wear 105 hours), a J.H. Formula 4 arm, Technics 1100A turntable, Netronics sound absorbing platform, platter pad, Verion cables, DW535 pre-preamplifier (serial number 0696), Gas Theobe Preamplifier (serial number 512019), Dahlquist DQ-10A speakers (serial number 13322, 13323 with mylar capacitors) and Janis W-1 woofer (serial number 1191). Speakers were put up along the room's short wall, spaced three feet apart from each other in a row, two feet out from the wall. All equipment met specifications one month before testing.

Listening was done 10:00 a.m. to 1:00 p.m. on four successive days. Line voltage averaged 120 plus or minus 0.5 volts at that time. Record sources included Sheffield records numbers 3,5,8 previously played for three hours and Horizon Label number 702 (C. Haden, Closeness) previously played for five hours. Two critics listened for three 45 minute sessions with two 20 minute breaks. Both critics' hearing was judged normal within the last six months by audiogram tests. Critic 1 is a 37 year old male and critic 2 is a 34 year old female. Critic 1 listened at an average sound level of 95dB +10dB at 15 feet from speakers while critic 2 listened at a level of 87dB +8dB at the same position. Sound levels were confirmed with an Ivie 10-A spectrum analyzer.

Units under test were two power amplifiers. Unit A is a 100 Watt per channel amplifier, the Threshold 400A, serial number 77086, and Unit B is a 200 Watt per channel amplifier, the Threshold 4000, serial number F7806007. Units were stacked and tested after being turned on for 30 minutes and were randomly chosen by a '"double-blind box'' to feed either the speaker or a dummy load. Input levels were adjusted to both amplifiers delivered 25 Watts into the dummy load when fed a 1000Hz sine wave. Switching periods were regulated to five pairs of five segments with ratings filled out at the end of each period.

Inter-rater reliability on listening rating form previously found for a reference amplifier (Audio Research D150) tested by the same two raters had run r equals 0.79, and inter-test reliability for each critic's ratings over a four day run for the D150 was r equals 0.73 for critic 1 and r equals 0.86 for critic 2.

Due to this careful approach, another critic, Walt Jung (1977a), working in another listening laboratory, was able to replicate Pearson's listening test findings (harsh high frequency sounds) on a particular amplifier (the Double Dyna 400) both with electronic tests and listening measures. Such a double laboratory replication serves to validate the listening test and is an important advance in the science of audio reviewing. It narrows the gap between what is perceived and what is measured.

Listening

Test Bias

Bias creeps into listening tests when comparisons are not standardized. One method of reducing rater bias is to compare several components ''blind,'' so the rater is unaware which component he is hearing. Blind testing is laborious, requiring excellent switching devices (like one developed by David Hodley of AGI), attention to level matching, and a mounting rack which will visually disguise the identity of the unit being played. Critics have objected to such a focus on methodology, preferring to openly listen to a component in what has been called a ''long term listening test."" Again defending their resistance to commercial pressures, several critics have pointed out that double blind testing could only control for the bias of a rater who might be immature enough to want a particular one to sound better.

Is experience bias a matter of maturity? Many investigators in the field of human perception worry about judgments based on the human senses, influenced as they are by emotions and mood at any age. These critics' own listening tests show the effects of such experimenter bias. Judgments of audio equipment are defended with colorful, poetic, emotion-laden adjectives, suggesting the critics' listening decisions are highly influenced by emotion as well as by the sound they hear. When a piece of equipment misses top ranking, certain critics might describe its sonic qualification in terms of excessively rich food; such an imaginative leap from one sense modality to another often leaves the reader behind. On the other hand, if another unit finally achieves top ranking the review becomes flooded with global, emotion-laden adjectives such as ''frightening,'' ''scary,'' or ''supernatural.'' Moncrieff (1978) has written an excellent critique of subjective audio reviewers' poor use of colorful adjectives. The subjective audio critic, like the wine taster, develops a special vocabulary to describe his experiences. This sonic vocabulary relies on other senses, such as taste and touch, to communicate the inner feeling of listening to the illusion of music. New components have been described as having a grey velvet midrange or a chocolate-y bass.

Moncrieff points out that a reader is probably more interested in and better served by a description of the sonic qualities of the unit. Should I be about to spend $2,000 on a Marc Levinson ML 1, I want to know about the clarity of the midrange, not whether its flavor is chocolate or vanilla.

Furthermore, the subjective audio critic often asserts he is frank and unbiased by commercial interests. Several critics refuse to accept advertising from audio manufacturers. Though their financial integrity is pure, their vague, often contradictory sonic vocabulary is wide open to ambiguity and personal bias. Since no specified methodology (with operational definitions) spells out how much of a certain quality the audio device must possess to create the desirable sonic effect, the critic drifts anchorless in a sea of feelings, prejudices, and half-remembered audio impressions. Moncrieff notes that such methods lead to 'simple minded'' impressions that are ' 'too childishly vituperative'' and paint a ' 'black and white picture, when in fact there are a number of pros and cons among competing audio designs."' In my impression these critics often develop strong preferences for certain manufacturers while frequently rejecting products from another company. Loyalty to company A and general dislike of company B become a real rater bias which influences listening tests.

Non-standardized listening techniques can lead the critic astray even when his likes and dislikes are not so strong. These raters have tried to link up standardized electronic tests, such as distortion measures, with open, non-standardized listening procedures. In one case, two amplifiers sounded very close in excellent initial listening tests but later performed differently in the lab. After the lab test, the critic decided that amplifier B began to sound inferior to amplifier A, the unit with lower distortion measures. Had the electronic tests biased the listening response' The Removal of Bias through "Blind Testing" Pontis (1978) has briefly described the controversy over the use of controlled

"'blind'" listening tests. In double-blind testing one critic switches an electromechanical device which randomly chooses between two similar pieces of equipment while the other critic listens and makes judgments. The term ''double-blind'' means that neither the "'switcher'" nor the ''rater'" knows which of the two units under test is being heard. The rater must make choices and assign a rank order of preference to the equipment solely on what he hears.

Advocates of double-blind testing emphasize that one's listening judgments are thus unbiased by other factors (the unit's appearance, cost, prestige). Furthermore, each listening period's time can be fixed to optimize the listener's short term auditory memory, for the ear is more accurate in resolving differences when listening periods are short. Proponents of long-term listening trials contradict this approach, for they feel subtle differences between units often emerge after long periods of time and an audiophile's appreciation of a piece of equipment involves many variables that might be excluded by a controlled blind test.

I believe the ideal equipment listening method should include a blind test in addition to other types of measure. Again, the purpose is to focus on one variable, the ability of the piece of equipment to process sound. As more laboratories gather blind data, we will increase our overall knowledge about the complex psychoacoustical processes involved in the perception of reproduced sound. Blind comparisons should be made between a new unit and an arbitrarily chosen reference piece of equipment that the critic can return to repeatedly.

Often the use of a standard is routine but not mentioned consistently. New tuners are matched against a ''classic'' such as the Marantz 10B tuner, and new preamplifiers are compared to the Marc Levinson ML-1. Each critic possesses a reference system and it would be more helpful if the components were routinely compared in standard listening tests.

Summary Audio equipment reviewing is a lively, active, and controversial field. The different methodologies and languages used by the two different types of audio critics, test bench reviewers and 'golden ear subjectivist'' reviewers, result in data which are difficult to interpret and useless in making purchase decisions. Differences between units that might be heard on standardized listening tests are obscured in the open, uncontrolled listening procedures now being used. The introduction of psychological research methods of reliability testing, the control of all independent variables, double-blind controlled tests, and reference standards could greatly reduce the great variability that now exists in listening test reports.

The validity of the listening critics' findings will be greatly increased through the exact replication of their findings by other critics and by the correlation of carefully structured listening tests with electronic distortion measures.

REERENCES

Colloms, Martin (1978): "Amplifier Tests on Test: 2. The Panel Game." Hi-Fi news and Record Review November, pp. 114-117.

Heyser, R. (1978): ''Hearing Vs.

Measurement.' Audio, Vol. 62, No. 3: 46-50.

Jung, W. (1977a): *'Super Power Amplifiers:

an aural comparison.' The Audio Amateur, Vol. VIII, No. 2: 62-70.

Jung, W. (1977b): ''Slewing Induced Distortion: Part 4. Phase IV: Listening tests for SID." The Audio Amateur, Vol. VIII, No. 4: 22-28.

Moncrieff, P. (1978): "The M Rule." International Audio Review , No. 33: 36. ( 2449 Dwight Way, Berkeley, CA 94794) Meyer, B. (1978): "Young versus the Subjectivists" (from the report on the April, 1978, B.A.S. meeting). The B.A.S. Speaker, Vol. 6, No. 8: 16-18.

Otala, M., Ensomaa, R. (1974): "Transient Intermodulation Distortion in Commercial Audio Amplifiers." Journal of the AES, Vol. 2, No.4.

McNair, D.M., Lorr, M., Droppleman, L.F.(1971): EITS Manual for the Profile of Mood States. Published by Educational and Industrial Testing Service, San Diego, CA 92107, 1971, pp.5-10.

Pearson, H. (1978) "The Reference Systems.

The Absolute Sound, Vol. 3, No. 11, 395-396.

( P.O. Box 233, Sea Cliff NY 11579).

Pontis, G.D. (1978): ''Audio General Model S11A Preamplifier." in 'Equipment Profiles."' (See Editor's Note). Audio, Vol. 62, No. 6:94.

Roscoe, J.T. (1969): Fundamental Research Statics for the Behavioral Sciences. Holt, Rinehart & Winston, Inc. New York. See Table II, Appendix, p. 301.

SEE ALSO

Hope, Adrian (1978): "Amplifier Test on Test: 1. Without Prejudice." Hi-Fi News and Record Review , November, pp. 110-114.

Moir, James (1978): ''Valves versus Transistors: The Results of a Comparison of Three Different Amplifiers." Wireless World, July, pp. 55-58.

 

 

-------------------

Also see:

EDITORIAL Periodical Mitosis

DIRECT MODS FOR THE MAGNAVOX CDB-560 By Kenneth Beers

The Mark I-- A Low Distortion IC Preamplifier / Control Unit, by Michael Lampton and Donald Zukauckas; Listening Tests on the Mark I ---by David Vorhis--A well researched, compact, high performance unit.

Prev. | Next

Top of Page   All Related Articles    Home

Updated: Thursday, 2025-09-04 0:01 PST