Home > DVD
DVD Technical Guide - DVD Audio Format
1.1 Introduction

In February 1999, the DVD Forum formally approved the release of DVD-Audio Ver. 1.0, as a new format to handle next-generation audio. This was a result of three years of discussion in the DVD Forum's Working Group 4, a technical working group comprised of many members of the hardware industry, music industry, and computer industry, as well as the ISC (International Steering Committee, composed of IFPI, RIAA, and RIAJ representatives of the world's music industry). This makes the DVD-Audio specification truly a global standard.

1.2 Overview of the DVD-Audio Specification

DVD-Audio Design Concept
• Pure Audio: Linear PCM and Packed PCM (lossless encoding)
- extremely high quality (192 and 176.4 kHz sampling frequencies, in stereo)
- multi-channel (scalable, up to six channels)
• Maintains compatibility with the DVD-Video format
• Many added features
- still image features
- video features (a subset of the DVD-Video format)
- centralized text, real-time text
• Access features suitable for audio systems
- access units leverage the familiar paradigm of conventional audio media album (Volume), group (Title Group), track (song), index (Cell)
- continuation of the basic concepts behind DVD-Video
- TOC-style access method (two-channel content)

1.2.1 DVD-Audio and DVD-Video formats

The DVD-Audio specification was established as the second ROM application format for the DVD family. In formulating the design concept, compatibility with the DVD-Video specification was given high priority, in addition to striving to meet the needs of next-generation audio discs. To this end, the physical format and file format were made common with the DVD-Video standard, but this is not all; the DVD-Audio specification also aims to share much in common with DVD-Video in its application format. The specification for audio data, the core of playback data for DVD-Audio, is comprised of portions compliant to the DVD-Video specification, portions which are extensions of the DVD-Video specification, and portions which were newly defined for DVD-Audio. The specifications for compressed audio and video are a subset of the DVD-Video specification & complying with that specification but with some additional restrictions added & thus maintaining complete data compatibility. Some portions of the specification for still images, menus, and text data follow the DVD-Video specification, but most portions are newly defined to provide more appropriate features for audio discs.

1.2.2 Discs and players

With the emergence of the DVD-Audio specification, it is anticipated that the DVD-Video players on the market will be joined by Audio-only players, which focus on providing the best possible sound quality; compatible players, which can play both DVD-Audio and DVD-Video discs; and many other specialized players to meet users' needs, such as portable players and car audio devices. The DVD-Audio specification provides the capability for discs to include not only pure audio content but also video or still image content which can be played along with the audio content. The specification allows the audio portion of such video content to be played on Audio-only players, while the video can also be reproduced by pre-existing DVD-Video players.

1.3 Audio Specification

• Major Features

• Super Hi-Fi stereo audio
• Scalable multi-channel capability
• Downmix control features
• Multi-channel to two-channel conversion
• Bit shift capability
• Audio selection features
• Attributes modifiable on a per-track (song) basis

1.3.1 Super Hi-Fi stereo audio

As a result of discussions with the music industry, the highest priority in the development of the DVD-Audio specification was given to the ability to perfectly reproduce the musical characteristics desired by music creators. That is, the producers of music strongly requested the features of complete transparency (that is, the playback sound quality is the same as the production sound quality), complete compatibility with the signal processing, editing, etc. capabilities of current and future studio equipment, and the ability to record past musical assets. As a result, linear PCM was chosen as the encoding scheme for the core audio content. Linear PCM is also used in DVD-Video, but as DVD-Audio gives even more weight to audio quality, the capabilities of DVD-Audio are considerably expanded to provide a 96 kHz bandwidth and a 144 dB dynamic range.

1.3.2 Audio specification details

Audio Specifications
  Audio Object Video Object
Encoding methods (mandatory) Linear PCM(Scalable)
Packed PCM (lossless encoding)
Linear PCM
Dolby AC3
Encoding methods (optional) none MPEG Audio
Audio specifications for Linear PCM and Packed PCM encoding schemes
Sampling frequency 48/96/192 kHz
44.1/88.2/176.4 kHz
48/96 kHz
Quantization depth 16/20/24 bits 16/20/24 bits
Maximum number of channels 6ch (fs: 48/96/44.1/88.2 kHz)
2ch (fs: 192/176.4 kHz)
(2ch for Stereo + 6ch for Multi channel)
Maximum bit rate 9.6 Mbps
(Linear PCM / Packed PCM)
6.144 Mbps
(Linear PCM)
Frame rate 1200Hz
( fs: 48/96/192 kHz)
(fs: 44.1/88.2/176.4 kHz)
(fs: 48/96 kHz)

As with DVD-Video, audio data in DVD-Audio is combined with header information and management information to form audio packets, which are then combined into 2048-byte packs. These packs are multiplexed with other packs to form Objects and are recorded to the disc. Two types of objects are specified by DVD-Audio. Audio Objects (AOB) are intended for use in main audio playback, while Video Objects (VOB) are used for playback of images and audio. The encoding methods defined for the two kinds of objects are different. The encoding method for VOB follows and is identical to the DVD-Video specification, to maintain compatibility with that format. The AOB format, however, uses a new Linear PCM (Scalable) and Packed PCM to provide higher audio quality.

In order to satisfy various requests from the music industry, the audio specifications for Audio Objects are based on the DVD-Video specification, with extensions to provide further capabilities.

Sampling rates that are multiples of 44.1 kHz were added to allow the recording of currently-existing music assets with perfect transparency and no processing required. Sampling rates of 196 kHz and 176.4 kHz were added to meet the ultra-high bandwidth demands of next-generation audio discs. The bit rate was expanded to 9.6 Mbps to support specification extensions for multi-channel audio, in addition to providing ultra-high bandwidth.

Lossless compression (Packed PCM) was added as a means to achieve 96 kHz, 24-bit, 6-channel (13.824 Mbps) recording and a greater than 74 minute recording length.

1.3.3 Data rate and recording time

DVD-Audio takes advantage of the large (4.7 GB) capacity and high (10.08 Mbps) transfer rate of the DVD format to make it possible to record extremely high quality audio content and multi-channel audio content that just wasn't possible with previous media. On the other hand, DVD-Audio also makes it possible to record over 400 minutes of CD-quality audio. The use of lossless compression further expands the recording time and effective transfer rate (to a maximum of 13.842 Mbps before compression).

1.3.4 Scalable multi-channel

One major benefit provided by the Linear PCM encoding used in the DVD-Audio specification is scalability. Even in actual multi-channel recording, the surround channel signals, which consist primarily of echo, typically have much lower levels than the front channel signals and also require less bandwidth. In such cases, it is possible to improve efficiency by recording the surround channel signals with lower sampling frequencies and quantization bit depths.
In this context, scalability is the concept of grouping the channels into multiple channel groups according to various parameters of the source data, and then setting the optimal sampling frequency and quantization bit depth for each channel group. A channel group is simply a set of channels which are encoded with a common sampling frequency and bit depth. For example, the front three channels could be encoded at 96 kHz, 24 bits per sample, while the rear two channels and LFE (Low Frequency Effect) channel could be encoded at 48 kHz and 16 bits per sample. Or front right and left channels and rear channels could be encoded at 96 kHz, 20 bits per sample, while the center and LFE channels could be encoded at 48 kHz with 20 bits per sample. The following restrictions apply to channel group audio specifications in DVD-Audio.

• There may be at most two channel groups.
• 192 kHz and 176.4 kHz sampling frequencies may not be used in conjunction with scalability.
• The sampling frequencies must have common factors.
• The sampling frequency and quantization bit depth for channel group 2 must be less than or equal to those of channel group 1.

There are 21 allowable configurations of channels assigned to channel groups, and the relationship between channels, groups, and speaker position is also specified. Even if the sampling frequency and bit depth are common to all channels, the channels must be assigned to two channel groups if there are three or more channels.

Note 1: must include left and right front channels
Note 2: channel group 1 only, two channels max.
Note 3: if the sampling frequency differs between the channel groups, sample timing must be synchronized to the timing of the lower-frequency channel.

For LPCM audio sources.
Two groups max. ( 9.6 Mbps)
Sampling frequency, quantization bit depth of channel group 1 Sampling frequency, quantization bit depth of channel group 2 May be changed on a per-track (song) basis

5.3.5 Downmixing

In addition to conventional stereo playback, DVD-Audio also supports multi-channel audio to provide new kinds of sound stages. These new sound stages, wherein each channel provides audio quality which far surpasses that of CD audio, should make music a more powerful experience for the listener than ever before possible. Practically speaking, however, all users may not have an environment which allows them to reproduce multi-channel sound. Further, there will be different kinds of listening environments, such as listening outdoors. For these reasons, the DVD-Audio specification was designed to provide robust support for both multi-channel audio and two-channel stereo, and thus provides for various ways to reproduce multi-channel content in two-channel environments.

One of these methods is called downmixing. Each disc which contains multi-channel audio content can have recorded on the disc the relationship between the various channels and the left and right channels (Lmix and Rmix) of a mixed-down two-channel audio stream. When such content is played, the player performs the downmix processing according to the following formulas.

Lmix = 0Lf 1Rf 2C 3Ls 4Rs 5LFE Rmix = 0Lf 1Rf 2C 3Ls 4Rs 5LFE ( indicates phase ( 180 )

The downmix coefficients and may only be set for Linear PCM data in the AOBs, and the values may be different for each track. The coefficients may be set in extremely fine steps, with a minimum step size of 0.2 dB. This allows the artist to provide audio playback with the exactly the desired feel.

1.3.6 Bit shifting

When the quantization bit depths of the channel groups are different in a multi-channel audio stream the player treats both digital full scale values as the same signal level. As a result, when reproducing the multi-channel audio the channels with less bit depth will have quantization noise that is relatively greater, and thus has a larger influence on the total dynamic range. The DVD-Audio specification incorporates a method called bit shifting to reduce this influence.

For instance, if the peak signal level for channel group 2 is less than -12 dB, the upper three bits will always have the same value as the MSB. This means that the signal can be shifted upward by two bits, which allows 18 bits of data to maintain the original 20 bits of precision. After channel group 2 signals are shifted upward, data for channel group 1 is recorded to the disc at 20 bits per sample, while the lower four bits of the channel group 2 signals are truncated and the signals are recorded at 16 bits per sample. At the same time, information indicating that channel group 2 has been recorded with a two-bit upshift is also recorded to the disc. During playback, the channel group 2 signals will be downshifted by two bits. That is, the MSB of the 16-bit data is expanded to add two high-order bits that are the same as the MSB, while two bits of zeros are added to the low-order end of the samples, producing 20-bit data for playback. This allows the precision of an 18-bit sample to be preserved, thereby increasing the dynamic range by 12 dB over what would be obtained by simply truncating the channel group 2 signals to 16-bit samples, and thus reducing the overall noise level. This bit shifting is done to increase the efficiency of sample usage and expand the dynamic range of Linear PCM multi-channel audio. It cannot be used with Packed PCM, and may only be applied to Linear PCM channel group 2 signals. Samples may be shifted by one to four bits, and the shift amount may be changed on a per-track basis. Bit shifting may not be used when both channel groups are using the same quantization bit depth. In combination with the scalability features, bit shifting provides efficient data transfer while maintaining maximum bit resolution for channel group 2 signals.

1.3.7 Audio Selection

Audio selection playback example
AVTT = Audio Video Title (an audio title recorded in Video Format)
AOTT = Audio Only Title (an audio title recorded in the new Audio Format; LPCM, Packed PCM)

The DVD-Audio specification defines a feature called Audio Selection which allows support for audio data in two different formats in the same track. The user first sets up his system according to the playback capabilities of his player and playback system, and the player can then automatically select the correct audio data to play back between the two formats. The user may also specify the data to be played back, at playback time. Audio Selection is applicable to the following combinations of the two types of data (objects or streams).
• A combination of different numbers of channels (stereo, multi-channel)
• A combination of different encoding methods
• A combination of different numbers of channels and different encoding methods

1.4 Additional Information

The DVD-Audio specification allows reproduction of a wide variety of audio content, from Super Hi-Fi stereo to multi-channel audio (5.1ch), which provides a powerful concert hall experience. But that's not all. DVD-Audio also defines various types of additional information which expands the world of audio entertainment, making it a specification truly worthy of the "next-generation audio disc" appellation. In consultation with content creators, the details of this additional information were refined again and again throughout the process of defining the format, as the specification evolved to its current form. DVD-Audio defines three main types of additional information: 1) still image information; 2) video information; and 3) text information.

1.4.1 Still image information

Still images, as the term is used in the purview of DVD-Audio, is the feature which allows still images to be recorded to a DVD-Audio disc, and then displayed as images on a display of some kind while the audio is being played back, according to control provided by navigation data. When playing audio content which has associated still images, a data structure called an ASVU (Audio Still Video Unit), which contains the data for multiple still images, is first read from the disc and stored in buffer memory. After this is done, the audio data begins to play. As long as playback remains within the ASVU range (the range of content covered by the one ASVU in the buffer), the system can guarantee continuity of audio reproduction with no skips in the audio stream. That is, there is no need to stop the audio in order to read images from the disc. The specification defines several different playback modes which expand the possibilities for using still images. Various playback modes are defined for still image reproduction, including display timing, display order, and a variety of visual effects. The content creator can select these modes to provide the user with the most effective possible presentation of the content. The following playback modes are defined by the specification.

(1) Display timing
a. Slide show mode (still images are displayed one after the other, with timing defined by the content creator, as the audio is played)
b. Browsable mode (still image display is independent from the audio playback, and the user can control the timing of still image display)
(2) Display order
a. Sequential mode (still images are displayed in an order determined by the content creator)
b. Random / shuffle mode (still images are displayed in an arbitrary order)
(3) Visual effects (optional)
a. Cut in / out
b. Fade in / out
c. Dissolve
d. Wipe Sequential slide show

This example illustrates a sequential slide show, which is a combination of sequential mode and slide show mode. In this mode, the still images are displayed in an order and timing to fit perfectly with the audio content, as determined by the content creator. While listening to the audio, the user can also view the images, which are matched to the audio, on a monitor. Sequential browsable

This example illustrates sequential browsable mode, which is a combination of sequential mode and browsable mode. In this mode, the display order of the still images is determined by the content creator. The time at which the images are displayed is completely up to the user, who can move through the images like turning the pages of a book, moving at his own pace. In this example the basic order and timing are determined by the content creator, and if the user does nothing the images will be displayed in the predetermined order, and the display will move to the next image after a set amount of time passes. However, the user can specify "return to previous" or "go to next" at any time, and the previous or next image will be displayed in response.

1.5 Text Information

Optional for discs, optional for players

Two types of text information
Centralized recording: static text information • When text information is recorded, it must include album name, group name, and track names
• Other information, such as released date, artist names, etc., may also be recorded
• Allows easy content search, jacket-like text menu
• No limitation on language units
Real-time: dynamic text information
• Recorded in the audio data recording region
• Displayed simultaneously with audio playback; smallest display granularity is one second
• Useful primarily for display of lyrics, liner notes
• Up to eight languages, defined when concentrated data is written, and limited to languages contained therein
• Character set code
• ISO8859-1 (Roman alphabet), Music Shift JIS (Japanese)

1.6 Access system similar to conventional audio discs

The DVD-Audio specification defines a hierarchical structure and units of access, with the focus of continuing with the same style of operations used with familiar previous audio discs. As in the past, one song corresponds to one track. However, since one 12 cm single-layer disc holds 4.7 GB and is thus has the capacity to hold multiple CD albums, it is possible to record many tracks on a single DVD. To improve accessability, the DVD-Audio specification combines multiple tracks which should be played sequentially into a newly-defined access unit called a Group. A Group is a logical unit, and can form a large structure of one or more sequential Audio Titles (ATT) as a single playback unit. This logical unit can be played continuously with a single operation.

Audio Title is a logical structure which is equivalent to a Title in DVD-Video, and is comprised of the presentation data to be played and the navigation data which determine the playback sequence. Unlike its DVD-Video counterpart, Audio Title is a unit of access that the user will not be concerned with. DVD-Audio does not allow complex playback sequences. However, adopting the Audio Title and Group as logical structures allows the construction of large works containing multiple units of content with differing parameters, such as different encoding methods (Linear PCM or Packed PCM), sampling frequencies, bit depths, multi-channel configurations, with or without video data, etc. Up to nine Groups can be contained within a Volume, and up to 99 ATTs can be contained within each Group. However, the maximum number of ATTs per disc Volume is also limited to 99.

Group and Track numbers are the units accessible by users from their DVD-Audio players. Further, a smaller, optional access unit called an Index is also defined. An Index is a playback unit which may be comprised of one or more cells. The Index was defined primarily to specify numbers of logical units within a track. Index numbers start from 1 within each Track.

2008 - 04 -14