Digital Audio--Principles & Concepts: Low Bit-Rate Coding: Codec Design (part 2)

Home | Audio mag. | Stereo Review mag. | High Fidelity mag. | AE/AA mag.

MP3 Stereo Coding

To take advantage of redundancies between stereo channels, and to exploit limitations in human spatial listening, Layer III allows a choice of stereo coding methods, with four basic modes: normal stereo mode with independent left and right channels; M/S stereo mode in which the entire spectrum is coded with M/S; intensity stereo mode in which the lower spectral range is coded as left/right and the upper spectral range is coded as intensity; and the intensity and M/S mode in which the lower spectral range is coded as M/S and the upper spectral range is coded as intensity. Each frame may have a different mode.

The partition between upper and lower spectral modes can be changed dynamically in units of scale factor bands.

Layer III supports both M/S (middle/side) stereo coding and intensity stereo coding. In M/S coding, certain frequency ranges of the left and right channels are mixed as sum (middle) and difference (side) signals of the left and right channels before quantization. In this way, stereo unmasking can be avoided. In addition, when there is high correlation between the left and right channels, the difference signal is further reduced to conserve bits. In intensity stereo coding, the left and right channels of upper- frequency subbands are not coded individually. Instead, one summed signal is transmitted along with individual left and right-channel scale factors indicating position in the stereo panorama. This method retains one spectral shape for both channels in upper sub-bands, but scales the magnitudes. This is effective for stationary signals, but less effective for transient signals because they may have different envelopes in different channels. Intensity coding may lead to artifacts such as changes in stereo imaging, particularly for transient signals. It is used primarily at low bit rates.

MP3 Decoder Optimization

MP3 files can be decoded with dedicated hardware chips or software programs. To optimize operation and decrease computation, some software decoders implement special features. Calculation of the hybrid synthesis filter bank is the most computationally complex aspect of the decoder.

The process can be simplified by implementing a stereo downmix to monaural in the frequency domain, before the filter bank, so that only one filter operation must be performed. Downmixing can be accomplished with a simple weighted sum of the left and right channels.

However, this is not optimal because, for example, an M/S stereo or intensity-stereo signal already contains a sum signal. More efficiently, built in downmixing routines can calculate the sum signal only for those scale factor bands that are coded in left/right stereo. For M/S- and intensity coded scale factor bands, only scaling operations are needed.

To further reduce computational complexity, the hybrid filter bank can be optimized. The filter bank consists of IMDCT and polyphase filter bank sections. As noted, the IMDCT is executed 32 times for 18 spectral values each to transform the spectrum of 576 values into 18 consecutive spectra of length 32. These spectra are converted into the time domain by executing a polyphase synthesis filter bank 18 times. The polyphase filter bank contains a frequency mapping operation (such as matrix multiplication) and a FIR filter with 512 coefficients. The FIR filter calculation can be simplified by reducing the number of coefficients, the filter coefficients can be truncated at the ends of the impulse response, and the impulse response can be modeled with fewer coefficients. Experiments have suggested that filter length can be reduced by 25% without yielding additional audible artifacts. More directly, computation can be reduced by limiting the output audio bandwidth. The high-frequency spectral values can be set to zero; an IMDCT with all input samples set to zero does not have to be calculated. If only the lower halves of the IMDCTs are calculated, the audio bandwidth is limited. The output can be downsampled by a factor of 2, so that computation for every second output value can be skipped, thus cutting the FIR calculation in half.

There are many nonstandard codecs that produce MP3 compliant bitstreams; they vary greatly in performance quality. LAME is an example of a fast, high-quality, royalty free codec that produces a MP3-compliant bitstream.

LAME is open-source, but using LAME may require a patent license in some countries. LAME is available at sourceforge.net. MP3 Internet applications are discussed in section15.

MPEG-1 Psychoacoustic Model 1

The MPEG-1 standard suggests two psychoacoustic models that determine the minimum masking threshold for inaudibility. The models are only informative in the standard; their use is not mandated. The models are used only in the encoder. In both cases, the difference between the maximum signal level and the masking threshold is used by the bit allocator to set the quantization levels.

Generally, model 1 is applied to Layers I and II and model 2 is applied to Layer III .

Psychoacoustic model 1 proposes a low-complexity method to analyze spectral data and output signal-to-mask ratios. Model 1 performs these nine steps:

1. Perform FFT analysis: A 512- or 1024-point fast Fourier transform, with a Hann window with adjacent overlapping of 32 or 64 samples, respectively, to reduce edge effects, is used to transform time-aligned time-domain data to the frequency domain. An appropriate delay is applied to time-align the psychoacoustic model's output. The signal is normalized to a maximum value of 96 dB SPL, calibrating the signal's minimum value to the absolute threshold of hearing.

2. Determine the sound pressure level: The maximum SPL is calculated for each subband by choosing the greater of the maximum amplitude spectral line in the subband or the maximum scale factor that accounts for low-level spectral lines in the subband.

3. Consider the threshold in quiet: An absolute hearing threshold in the absence of any signal is given; this forms the lower masking bound. An offset is applied depending on the bit rate.

4. Finding tonal and nontonal components: Tonal (sinusoidal) and nontonal (noise-like) components in the signal are identified. First, local maxima in the spectral components are identified relative to bandwidths of varying size. Components that are locally prominent in a critical band by + 7 dB are labeled as tonal and their sound-pressure level is calculated. Intensities of the remaining components, assumed to be nontonal, within each critical band are summed and their SPL is calculated for each critical band. The nontonal maskers are centered in each critical band.

5. Decimation of tonal and nontonal masking components: The number of maskers is reduced to obtain only the relevant maskers. Relevant maskers are those with magnitude that exceeds the threshold in quiet, and those tonal components that are strongest within 1/2 Bark.

6. Calculate individual masking thresholds: The total number of masker frequency bins is reduced (for example, in Layer I at 48 kHz, 256 is reduced to 102) and maskers are relocated. Noise masking thresholds for each subband, accounting for tonal and nontonal components and their different downward shifts, are determined by applying a masking (spreading) function to the signal. Calculations use a masking index and masking function to describe masking effects on adjacent frequencies. The masking index is an attenuation factor based on critical-band rate. The piecewise masking function is an attenuation factor with different lower and upper slopes between -3 and + 8 Bark that vary with respect to the distance to the masking component and the component's magnitude.

When the subband is wide compared to the critical band, the spectral model can select a minimum threshold; when it is narrow, the model averages the thresholds covering the subband.

7. Calculate the global masking threshold: The powers corresponding to the upper and lower slopes of individual subband masking curves, as well as a given threshold of hearing (threshold in quiet), are summed to form a composite global masking contour. The final global masking threshold is thus a signal-dependent modification of the absolute threshold of hearing as affected by tonal and nontonal masking components across the basilar membrane.

8. Determine the minimum masking threshold: The minimum masking level is calculated for each subband.

9. Calculate the signal-to-mask ratio: Signal-to-mask ratios are determined for each subband, based on the global masking threshold. The difference between the maximum SPL levels and the minimum masking threshold values determines the SMR value in each subband; this value is supplied to the bit allocator.

The principal steps in the operation of model 1 can be illustrated with a test signal that contains a band of noise, as well as prominent tonal components. The model analyzes one block of the 16-bit test signal sampled at 44.1 kHz. FIG. 11A shows the audio signal as output by the FFT; the model has identified the local maxima. The figure also shows the absolute threshold of hearing used in this particular example (offset by -12 dB). FIG. 11B shows tonal components marked with a "+" and nontonal components marked with a "o." FIG. 11C shows the masking functions assigned to tonal maskers after decimation. The peak SMR (about 14.5 dB) corresponds to that used for tonal maskers. FIG. 11D shows the masking functions assigned to nontonal maskers after decimation. The peak SMR (about 5 dB) corresponds to that used for nontonal maskers. FIG. 11E shows the final global masking curve obtained by combining the individual masking thresholds. The higher of the global masking curve and the absolute threshold of hearing is used as the final global masking curve. FIG. 11F shows the minimum masking threshold. From this, SMR values can be calculated in each subband.

FIG. 11 Operation of MPEG-1 model 1 is illustrated using a test signal. A. Local maxima and absolute threshold. B. Tonal and nontonal components. C. Tonal masking. D. Nontonal masking. E. Masking threshold. F. Minimum masking threshold.

To further explain the operation of model 1, additional comments are given here. The delay in the 512-point analysis filter bank is 256 samples and centering the data in the 512-point Hann window adds 64 samples. An offset of 320 samples (256 + (512 - 384)/2 = 320) is needed to time-align the model's 384 samples.

The spreading function used in model 1 is described in terms of piecewise slopes (in dB):

where dz = z(i) - z(j) is the distance in Bark between the maskee and masker frequency; i and j are index values of spectral lines of the maskee and masker, respectively.

X[z(j)] is the sound pressure level of the jth masking component in dB. Values outside -3 and + 8 Bark are not considered in this model.

Model 1 uses this general approach to detect and characterize tonality in audio signals: An FFT is applied to 512 or 1024 samples, and the components of the spectrum analysis are considered. Local maxima in the spectrum are identified as having more energy than adjacent components. These components are decimated such that a tonal component closer than 1/2 Bark to a stronger tonal component is discarded. Tonal components below the threshold of hearing are discarded as well. The energies of groups of remaining components are summed to represent tonal components in the signal; other components are summed and marked as nontonal. A binary designation is given: tonal components are assigned 1, and nontonal components are assigned 0. This information is presented to the bit allocation algorithm. Specifically, in model 1, tonality is determined by detecting local maxima of 7 dB in the audio spectrum. To derive the masking threshold relative to the masker, a level shift is applied; the nature of the shift depends on whether the masker is tonal or nontonal:

?T(z) = -6.025 - 0.275z dB

?N(z) = -2.025 - 0.175z dB

where z is the frequency of the masker in Bark.

Model 1 considers all the nontonal components in a critical band and represents them with one value at one frequency. This is appropriate at low frequencies where sub-bands and critical bands have good correspondence, but can be inefficient at high frequencies where there are many critical bands in each subband. A subband that is apart from the identified nontonal component in a critical band may not receive a correct nontonal evaluation.

MPEG-1 Psychoacoustic Model 2

Psychoacoustic model 2 performs a more detailed analysis than model 1, at the expense of greater computational complexity. It is designed for lower bit rates than model 1.

As in model 1, model 2 outputs a signal-to-mask ratio for each subband; however, its approach is significantly different. It contours the noise floor of the signal represented by many spectral coefficients in a way that is more accurate than that allowed by coarse subband coding. Also, the model uses an unpredictability measure to examine the side-chain data for tonal or nontonal qualities. Model 2 performs these 14 steps:

1. Reconstruct input samples: A set of 1024 input samples is assembled.

2. Calculate the complex spectrum: The time-aligned input signal is windowed with a 1024-point Hann window; alternatively, a shorter window may be used.

An FFT is computed and output represented in magnitude and phase.

3. Calculate the predicted magnitude and phase: The predicted magnitude and phase are determined by extrapolation from the two preceding threshold blocks.

4. Calculate the unpredictability measure: The unpredictability measure is computed using the Euclidian distance between the predicted and actual values in the magnitude/phase domain. To reduce complexity, the measure may be computed only for lower frequencies and assumed constant for higher frequencies.

5. Calculate the energy and unpredictability in the partitions: The energy magnitude and the weighted unpredictability measure in each threshold calculation partition are calculated. A partition has a resolution of one spectral line (at low frequencies) or 1/3 critical band (at high frequencies), whichever is wider.

6. Convolve energy and unpredictability with the spreading function: The energy and the unpredictability measure in threshold calculation partitions are each convolved with a cochlea spreading function. Values are renormalized.

7. Derive tonality index: The unpredictability measures are converted to tonality indices ranging from 0 (high unpredictability) to 1 (low unpredictability). This determines the relative tonality of the maskers in each threshold calculation partition.

8. Calculate the required signal-to-noise ratio: An SNR is calculated for each threshold calculation partition using tonality to interpolate an attenuation shift factor between noise-masking-tone (NMT) and tone-masking-noise (TMN). The interpolated shift ranges from 5.5 dB for NMT and upward. The final shift value is the higher of the interpolated value or a frequency-dependent minimum value.

9. Calculate power ratio: The power ratio of the SNR is calculated for each threshold calculation partition.

10. Calculate energy threshold: The actual energy threshold is calculated for each threshold calculation partition.

11. Spread threshold energy: The masking threshold energy is spread over FFT lines corresponding to threshold calculation partitions to represent the masking in the frequency domain.

12. Calculate final energy threshold of audibility: The spread threshold energy is compared to values in absolute threshold of quiet tables, and the higher value is used (not the sum) as the energy threshold of audibility. This is because it is wasteful to specify a noise threshold lower than the level that can be heard.

13. Calculate pre-echo control: A narrow-band pre-echo control used in the Layer III encoder is calculated, to prevent audibility of the error signal spread in time by the synthesis filter. The calculation lowers the masking threshold after a quiet signal. The calculation takes the minimum of the comparison of the current threshold with the scaled thresholds of two previous blocks.

14. Calculate signal-to-mask ratios: Threshold calculation partitions are converted to codec partitions (scale factor bands). The SMR (energy in each scale factor band divided by noise level in each scale factor band) is calculated for each partition and expressed in decibels.

The SMR values are forwarded to the allocation algorithm.

The principal steps in the operation of model 2 can be illustrated with a test signal that contains three prominent tonal components. The model analyzes a set of 1024 input samples of the 16-bit test signal sampled at 44.1 kHz.

FIG. 12A shows the magnitude of the audio signal as output by the FFT; the phase is also computed. Following prediction of magnitude and phase, the unpredictability measure is computed, as shown in FIG. 12B, using the Euclidian distance between the predicted and actual values in the magnitude/phase domain. When the measure equals 0, the current value is completely predicted. FIG. 12C shows the energy magnitude in each partition and the spreading functions that are applied. FIG. 12D shows the tonality index derived from the unpredictability measure; the tonality index ranges from 0 (high unpredictability and noise-like) to 1 (low unpredictability and tonal). FIG. 12E shows the spread masking threshold energy in the frequency domain and the absolute threshold of quiet; the higher value is used to find the energy threshold of inaudibility. FIG. 12F shows signal-to-mask ratios (energy in each scale factor band divided by noise level in each scale factor band) in codec partitions.

To further explain the operation of model 2, additional comments are given here. The spreading function used in model 2 is:

10 log10 SF(dz) = 15.8111389 + 7.5(1.05dz + 0.474) - 17.5[1.0 +(1.05dz +0.474) 2] 1/2+8 MIN[(1.05dz - 0.5) 2 - 2(1.05dz - 0.5),0] dB

where dz is the distance in Bark between the maskee and masker frequency.

The spectral flatness measure (SFM), devised by James Johnston, measures the average or global tonality of the segment. SFM is the ratio of the geometric mean of the power spectrum to its arithmetic mean. The value is converted to decibels and referenced to -60 dB to provide a coefficient of tonality ranging continuously from 0 (nontonal) to 1 (tonal). This coefficient can be used to interpolate between TMN and NMT models. SFM leads to very conservative masking decisions for nontonal parts of a signal. More efficiently, specific tonal and nontonal regions within a segment can be identified. This local tonality can be measured as the normalized Euclidean distance between the actual and predicted values over two successive segments, for amplitude and phase. On the basis of this, tonality unpredictability can be computed for narrow frequency partitions and used to create tonality metrics that are used to interpolate between tone or noise models.

FIG. 12 Operation of MPEG-1 model 2 is illustrated using a test signal. A. Magnitude of FFT. B. Unpredictability measure. C. Energy and spreading functions. D. Tonality index. E. Threshold energy and absolute threshold. F. Signal-to-mask ratios. (Boley and Rao, 2004)

Specifically, in model 2, a tonality index is created, on the basis of the predictability of the audio signal's spectral components in a partition in two successive frames. Tonal components are more accurately predicted. Amplitude and phase are predicted to form an unpredictability measure C.

When C = 0, the current value is completely predicted, and when C = 1, the predicted values differ from the actual values. This yields the tonality index T ranging from 0 (high unpredictability and noise-like) to 1 (low unpredictability and tonal). For example, the audio signal's strongly tonal and nontonal areas are evident in FIG. 12D. The tonality index is used to calculate a (z) shift, for example, interpolating values from 6 dB (nontonal) to 29 dB (tonal).

When used in a Layer III encoder, model 2 is modified.

The model is executed twice, once with a long block and once with a short 256-sample block. These values are used in the unpredictability measure calculation. A slightly different spreading function is used. The NMT shift is changed to 6.0 dB and a fixed TMN shift of 29.0 dB is used. As noted, a pre-echo control is calculated.

Perceptual entropy is calculated as the logarithm of the geometric mean of the normalized spectral energy in a partition. This predicts the minimum number of bits needed for transparency. High values are used to identify transient attacks, and thus to determine block size in the encoder. In addition, model 2 accepts the minimum masking threshold at low frequencies where there is good correspondence between subbands and critical bands, and it uses the average of the thresholds at higher frequencies where subbands are narrow compared to critical bands.

Much research has been done since the informative model 2 was published in the MPEG-1 standard. Thus, most practical encoders use models that offer better performance, even if they are based on the informative model. An encoder that follows the informative documentation literally will not provide good results compared to more sophisticated implementations.

MPEG-2 Audio Standard

The MPEG-2 audio standard was designed for applications ranging from Internet downloading to high definition digital television (HDTV) transmission. It provides a backward-compatible path to multichannel sound and a low sampling frequency provision, as well as a non backward-compatible multichannel format known as Advanced Audio Coding (AAC). The MPEG-2 audio standard encompasses the MPEG-1 audio standard of Layers I , II , and III , using the same encoding and decoding principles as MPEG-1. In many cases, the same layer algorithms developed for MPEG-1 applications are used for MPEG-2 applications. Multichannel MPEG-2 audio is backward compatible with MPEG-1. An MPEG-2 decoder will accept an MPEG-1 bitstream and an MPEG-1 decoder can derive a stereo signal from an MPEG-2 bitstream.

However, MPEG-2 also permits use of incompatible audio codecs.

One part of the MPEG-2 standard provides multichannel sound at sampling frequencies of 32, 44.1, and 48 kHz.

Because it is backward compatible to MPEG-1, it is designated as BC (backward compatible), that is, MPEG-2 BC. Clearly, because there is more redundancy between six channels than between two, greater coding efficiency is achieved. Overall, 5.1 channels can be successfully coded at rates from 384 kbps to 640 kbps. MPEG-2 also supports monaural and stereo coding at sampling frequencies of 16, 22.05, and 24 kHz, using Layers I , II , and III . The MPEG-1 and -2 audio coding family is shown in FIG. 13. The MPEG-2 audio standard was approved by the MPEG committee in November 1994 and is specified in ISO/IEC 13818-3.

FIG. 13 The MPEG-2 audio standard adds monaural/stereo coding at low sampling frequencies, multichannel coding, and AAC. The three MPEG-1 layers are supported.

The multichannel MPEG-2 BC format uses a five channel approach sometimes referred to as 3/2 + 1 stereo (3 front and 2 surround channels + subwoofer). The low frequency effects (LFE) subwoofer channel is optional, providing an audio range up to 120 Hz. A hierarchy of formats is created in which 3/2 may be downmixed to 3/1, 3/0, 2/2, 2/1, 2/0, and 1/0. The multichannel MPEG-2 BC format uses an encoder matrix that allows a two-channel decoder to decode a compatible two-channel signal that is a subset of a multichannel bitstream. The multiple channels of MPEG-2 are matrixed to form compatible MPEG-1 left/right channels, as well as other MPEG-2 channels, as shown in FIG. 14. The MPEG-1 left and right channels are replaced by matrixed MPEG-2 left and right channels and these are encoded into backward-compatible MPEG frames with an MPEG-1 encoder. Additional multichannel data is placed in the expanded ancillary data field.

FIG. 14 The MPEG-2 audio encoder and decoder showing how a 5.1-channel surround format can be achieved with backward compatibility with MPEG-1.

To efficiently code multiple channels, MPEG-2 BC uses techniques such as dynamic crosstalk reduction, adaptive interchannel prediction, and center channel phantom image coding. With dynamic crosstalk reduction, as with intensity coding, multichannel high-frequency information is combined and conveyed along with scale factors to direct levels to different playback channels. In adaptive prediction, a prediction error signal is conveyed for the center and surround channels. The high-frequency information in the center channel can be conveyed through the front left and right channels as a phantom image.

MPEG-2 BC can achieve a combined bit rate of 384 kbps, using Layer II at a 48-kHz sampling frequency.

MPEG-2 allows for audio bit rates up to 1066 kbps. To accommodate this, the MPEG- 2 frame is divided into two parts. The first part is an MPEG-1-compatible stereo section with Layer I data up to 448 kbps, Layer II data up to 384 kbps, or Layer III data up to 320 kbps. The MPEG-2 extension part contains all other surround data.

A standard two-channel MPEG-1 decoder ignores the ancillary information, and reproduces the front main channels. In some cases, the dematrixing procedure in the decoder can yield an artifact in which the sound in a channel is mainly phase canceled but the quantization noise is not, and thus becomes audible. This limitation of spatial unmasking in MPEG-2 BC is a direct result of the matrixing used to achieve backward compatibility with the original two-channel MPEG standard. In part, it can be addressed by increasing the bit rate of the coded signals.

MPEG-2 also specifies Layer I , II , and III at low sampling frequencies (LSF) of 16, 22.05, and 24 kHz. This extension is not backward compatible to MPEG-1 codecs. This portion of the standard is known as MPEG-2 LSF. At these low bit rates, Layer III generally shows the best performance. Only minor changes in the MPEG-1 bit rate and bit allocation tables are necessary to adapt this LSF format. The relative improvement in quality stems from the improved frequency resolution of the polyphase filter bank in low- and mid-frequency regions; this allows more efficient application of masking. Layers I and II fare better than Layer III in these applications because Layer III already has good frequency resolution. The bitstream is unchanged in the LSF mode and the same frame format is used. For 24-kHz sampling, the frame length is 16 ms for Layer I and 48 ms for Layer II . The frame length of Layer III is decreased relative to that of MPEG-1. In addition, the "MPEG-2.5" standard supports sampling frequencies of 8, 11.025, and 12 kHz with the corresponding decrease in audio bandwidth; implementations use Layer III as the codec. Many MP3 codecs support the original MPEG-1 Layer III codec as well as the MPEG-2 and MPEG-2.5 extensions for lower sampling frequencies.

The menu of data rates, fidelity, and layer compatibility provided by MPEG are useful in a wide variety of applications such as computer multimedia, CD-ROM, DVD-Video, computer disks, local area networks, studio recording and editing, multichannel disk recording, ISDN transmission, digital audio broadcasting, and multichannel digital television. Numerous C and C++ programs performing MPEG-1 and -2 audio coding and decoding can be downloaded from a number of Internet file sites, and executed on personal computers. The backward compatible format, using Layer II coding, is used for the soundtracks of some DVD-Video discs. However, a matrix approach to surround sound does not preserve spatial fidelity as well as discrete channel coding.

Prev. | Next