summaryrefslogtreecommitdiffhomepage
path: root/vorbis/doc/01-introduction.tex
diff options
context:
space:
mode:
Diffstat (limited to 'vorbis/doc/01-introduction.tex')
-rw-r--r--vorbis/doc/01-introduction.tex528
1 files changed, 0 insertions, 528 deletions
diff --git a/vorbis/doc/01-introduction.tex b/vorbis/doc/01-introduction.tex
deleted file mode 100644
index d7767df..0000000
--- a/vorbis/doc/01-introduction.tex
+++ /dev/null
@@ -1,528 +0,0 @@
-% -*- mode: latex; TeX-master: "Vorbis_I_spec"; -*-
-%!TEX root = Vorbis_I_spec.tex
-\section{Introduction and Description} \label{vorbis:spec:intro}
-
-\subsection{Overview}
-
-This document provides a high level description of the Vorbis codec's
-construction. A bit-by-bit specification appears beginning in
-\xref{vorbis:spec:codec}.
-The later sections assume a high-level
-understanding of the Vorbis decode process, which is
-provided here.
-
-\subsubsection{Application}
-Vorbis is a general purpose perceptual audio CODEC intended to allow
-maximum encoder flexibility, thus allowing it to scale competitively
-over an exceptionally wide range of bitrates. At the high
-quality/bitrate end of the scale (CD or DAT rate stereo, 16/24 bits)
-it is in the same league as MPEG-2 and MPC. Similarly, the 1.0
-encoder can encode high-quality CD and DAT rate stereo at below 48kbps
-without resampling to a lower rate. Vorbis is also intended for
-lower and higher sample rates (from 8kHz telephony to 192kHz digital
-masters) and a range of channel representations (monaural,
-polyphonic, stereo, quadraphonic, 5.1, ambisonic, or up to 255
-discrete channels).
-
-
-\subsubsection{Classification}
-Vorbis I is a forward-adaptive monolithic transform CODEC based on the
-Modified Discrete Cosine Transform. The codec is structured to allow
-addition of a hybrid wavelet filterbank in Vorbis II to offer better
-transient response and reproduction using a transform better suited to
-localized time events.
-
-
-\subsubsection{Assumptions}
-
-The Vorbis CODEC design assumes a complex, psychoacoustically-aware
-encoder and simple, low-complexity decoder. Vorbis decode is
-computationally simpler than mp3, although it does require more
-working memory as Vorbis has no static probability model; the vector
-codebooks used in the first stage of decoding from the bitstream are
-packed in their entirety into the Vorbis bitstream headers. In
-packed form, these codebooks occupy only a few kilobytes; the extent
-to which they are pre-decoded into a cache is the dominant factor in
-decoder memory usage.
-
-
-Vorbis provides none of its own framing, synchronization or protection
-against errors; it is solely a method of accepting input audio,
-dividing it into individual frames and compressing these frames into
-raw, unformatted 'packets'. The decoder then accepts these raw
-packets in sequence, decodes them, synthesizes audio frames from
-them, and reassembles the frames into a facsimile of the original
-audio stream. Vorbis is a free-form variable bit rate (VBR) codec and packets have no
-minimum size, maximum size, or fixed/expected size. Packets
-are designed that they may be truncated (or padded) and remain
-decodable; this is not to be considered an error condition and is used
-extensively in bitrate management in peeling. Both the transport
-mechanism and decoder must allow that a packet may be any size, or
-end before or after packet decode expects.
-
-Vorbis packets are thus intended to be used with a transport mechanism
-that provides free-form framing, sync, positioning and error correction
-in accordance with these design assumptions, such as Ogg (for file
-transport) or RTP (for network multicast). For purposes of a few
-examples in this document, we will assume that Vorbis is to be
-embedded in an Ogg stream specifically, although this is by no means a
-requirement or fundamental assumption in the Vorbis design.
-
-The specification for embedding Vorbis into
-an Ogg transport stream is in \xref{vorbis:over:ogg}.
-
-
-
-\subsubsection{Codec Setup and Probability Model}
-
-Vorbis' heritage is as a research CODEC and its current design
-reflects a desire to allow multiple decades of continuous encoder
-improvement before running out of room within the codec specification.
-For these reasons, configurable aspects of codec setup intentionally
-lean toward the extreme of forward adaptive.
-
-The single most controversial design decision in Vorbis (and the most
-unusual for a Vorbis developer to keep in mind) is that the entire
-probability model of the codec, the Huffman and VQ codebooks, is
-packed into the bitstream header along with extensive CODEC setup
-parameters (often several hundred fields). This makes it impossible,
-as it would be with MPEG audio layers, to embed a simple frame type
-flag in each audio packet, or begin decode at any frame in the stream
-without having previously fetched the codec setup header.
-
-
-\begin{note}
-Vorbis \emph{can} initiate decode at any arbitrary packet within a
-bitstream so long as the codec has been initialized/setup with the
-setup headers.
-\end{note}
-
-Thus, Vorbis headers are both required for decode to begin and
-relatively large as bitstream headers go. The header size is
-unbounded, although for streaming a rule-of-thumb of 4kB or less is
-recommended (and Xiph.Org's Vorbis encoder follows this suggestion).
-
-Our own design work indicates the primary liability of the
-required header is in mindshare; it is an unusual design and thus
-causes some amount of complaint among engineers as this runs against
-current design trends (and also points out limitations in some
-existing software/interface designs, such as Windows' ACM codec
-framework). However, we find that it does not fundamentally limit
-Vorbis' suitable application space.
-
-
-\subsubsection{Format Specification}
-The Vorbis format is well-defined by its decode specification; any
-encoder that produces packets that are correctly decoded by the
-reference Vorbis decoder described below may be considered a proper
-Vorbis encoder. A decoder must faithfully and completely implement
-the specification defined below (except where noted) to be considered
-a proper Vorbis decoder.
-
-\subsubsection{Hardware Profile}
-Although Vorbis decode is computationally simple, it may still run
-into specific limitations of an embedded design. For this reason,
-embedded designs are allowed to deviate in limited ways from the
-`full' decode specification yet still be certified compliant. These
-optional omissions are labelled in the spec where relevant.
-
-
-\subsection{Decoder Configuration}
-
-Decoder setup consists of configuration of multiple, self-contained
-component abstractions that perform specific functions in the decode
-pipeline. Each different component instance of a specific type is
-semantically interchangeable; decoder configuration consists both of
-internal component configuration, as well as arrangement of specific
-instances into a decode pipeline. Componentry arrangement is roughly
-as follows:
-
-\begin{center}
-\includegraphics[width=\textwidth]{components}
-\captionof{figure}{decoder pipeline configuration}
-\end{center}
-
-\subsubsection{Global Config}
-Global codec configuration consists of a few audio related fields
-(sample rate, channels), Vorbis version (always '0' in Vorbis I),
-bitrate hints, and the lists of component instances. All other
-configuration is in the context of specific components.
-
-\subsubsection{Mode}
-
-Each Vorbis frame is coded according to a master 'mode'. A bitstream
-may use one or many modes.
-
-The mode mechanism is used to encode a frame according to one of
-multiple possible methods with the intention of choosing a method best
-suited to that frame. Different modes are, e.g. how frame size
-is changed from frame to frame. The mode number of a frame serves as a
-top level configuration switch for all other specific aspects of frame
-decode.
-
-A 'mode' configuration consists of a frame size setting, window type
-(always 0, the Vorbis window, in Vorbis I), transform type (always
-type 0, the MDCT, in Vorbis I) and a mapping number. The mapping
-number specifies which mapping configuration instance to use for
-low-level packet decode and synthesis.
-
-
-\subsubsection{Mapping}
-
-A mapping contains a channel coupling description and a list of
-'submaps' that bundle sets of channel vectors together for grouped
-encoding and decoding. These submaps are not references to external
-components; the submap list is internal and specific to a mapping.
-
-A 'submap' is a configuration/grouping that applies to a subset of
-floor and residue vectors within a mapping. The submap functions as a
-last layer of indirection such that specific special floor or residue
-settings can be applied not only to all the vectors in a given mode,
-but also specific vectors in a specific mode. Each submap specifies
-the proper floor and residue instance number to use for decoding that
-submap's spectral floor and spectral residue vectors.
-
-As an example:
-
-Assume a Vorbis stream that contains six channels in the standard 5.1
-format. The sixth channel, as is normal in 5.1, is bass only.
-Therefore it would be wasteful to encode a full-spectrum version of it
-as with the other channels. The submapping mechanism can be used to
-apply a full range floor and residue encoding to channels 0 through 4,
-and a bass-only representation to the bass channel, thus saving space.
-In this example, channels 0-4 belong to submap 0 (which indicates use
-of a full-range floor) and channel 5 belongs to submap 1, which uses a
-bass-only representation.
-
-
-\subsubsection{Floor}
-
-Vorbis encodes a spectral 'floor' vector for each PCM channel. This
-vector is a low-resolution representation of the audio spectrum for
-the given channel in the current frame, generally used akin to a
-whitening filter. It is named a 'floor' because the Xiph.Org
-reference encoder has historically used it as a unit-baseline for
-spectral resolution.
-
-A floor encoding may be of two types. Floor 0 uses a packed LSP
-representation on a dB amplitude scale and Bark frequency scale.
-Floor 1 represents the curve as a piecewise linear interpolated
-representation on a dB amplitude scale and linear frequency scale.
-The two floors are semantically interchangeable in
-encoding/decoding. However, floor type 1 provides more stable
-inter-frame behavior, and so is the preferred choice in all
-coupled-stereo and high bitrate modes. Floor 1 is also considerably
-less expensive to decode than floor 0.
-
-Floor 0 is not to be considered deprecated, but it is of limited
-modern use. No known Vorbis encoder past Xiph.Org's own beta 4 makes
-use of floor 0.
-
-The values coded/decoded by a floor are both compactly formatted and
-make use of entropy coding to save space. For this reason, a floor
-configuration generally refers to multiple codebooks in the codebook
-component list. Entropy coding is thus provided as an abstraction,
-and each floor instance may choose from any and all available
-codebooks when coding/decoding.
-
-
-\subsubsection{Residue}
-The spectral residue is the fine structure of the audio spectrum
-once the floor curve has been subtracted out. In simplest terms, it
-is coded in the bitstream using cascaded (multi-pass) vector
-quantization according to one of three specific packing/coding
-algorithms numbered 0 through 2. The packing algorithm details are
-configured by residue instance. As with the floor components, the
-final VQ/entropy encoding is provided by external codebook instances
-and each residue instance may choose from any and all available
-codebooks.
-
-\subsubsection{Codebooks}
-
-Codebooks are a self-contained abstraction that perform entropy
-decoding and, optionally, use the entropy-decoded integer value as an
-offset into an index of output value vectors, returning the indicated
-vector of values.
-
-The entropy coding in a Vorbis I codebook is provided by a standard
-Huffman binary tree representation. This tree is tightly packed using
-one of several methods, depending on whether codeword lengths are
-ordered or unordered, or the tree is sparse.
-
-The codebook vector index is similarly packed according to index
-characteristic. Most commonly, the vector index is encoded as a
-single list of values of possible values that are then permuted into
-a list of n-dimensional rows (lattice VQ).
-
-
-
-\subsection{High-level Decode Process}
-
-\subsubsection{Decode Setup}
-
-Before decoding can begin, a decoder must initialize using the
-bitstream headers matching the stream to be decoded. Vorbis uses
-three header packets; all are required, in-order, by this
-specification. Once set up, decode may begin at any audio packet
-belonging to the Vorbis stream. In Vorbis I, all packets after the
-three initial headers are audio packets.
-
-The header packets are, in order, the identification
-header, the comments header, and the setup header.
-
-\paragraph{Identification Header}
-The identification header identifies the bitstream as Vorbis, Vorbis
-version, and the simple audio characteristics of the stream such as
-sample rate and number of channels.
-
-\paragraph{Comment Header}
-The comment header includes user text comments (``tags'') and a vendor
-string for the application/library that produced the bitstream. The
-encoding and proper use of the comment header is described in \xref{vorbis:spec:comment}.
-
-\paragraph{Setup Header}
-The setup header includes extensive CODEC setup information as well as
-the complete VQ and Huffman codebooks needed for decode.
-
-
-\subsubsection{Decode Procedure}
-
-The decoding and synthesis procedure for all audio packets is
-fundamentally the same.
-\begin{enumerate}
-\item decode packet type flag
-\item decode mode number
-\item decode window shape (long windows only)
-\item decode floor
-\item decode residue into residue vectors
-\item inverse channel coupling of residue vectors
-\item generate floor curve from decoded floor data
-\item compute dot product of floor and residue, producing audio spectrum vector
-\item inverse monolithic transform of audio spectrum vector, always an MDCT in Vorbis I
-\item overlap/add left-hand output of transform with right-hand output of previous frame
-\item store right hand-data from transform of current frame for future lapping
-\item if not first frame, return results of overlap/add as audio result of current frame
-\end{enumerate}
-
-Note that clever rearrangement of the synthesis arithmetic is
-possible; as an example, one can take advantage of symmetries in the
-MDCT to store the right-hand transform data of a partial MDCT for a
-50\% inter-frame buffer space savings, and then complete the transform
-later before overlap/add with the next frame. This optimization
-produces entirely equivalent output and is naturally perfectly legal.
-The decoder must be \emph{entirely mathematically equivalent} to the
-specification, it need not be a literal semantic implementation.
-
-\paragraph{Packet type decode}
-
-Vorbis I uses four packet types. The first three packet types mark each
-of the three Vorbis headers described above. The fourth packet type
-marks an audio packet. All other packet types are reserved; packets
-marked with a reserved type should be ignored.
-
-Following the three header packets, all packets in a Vorbis I stream
-are audio. The first step of audio packet decode is to read and
-verify the packet type; \emph{a non-audio packet when audio is expected
-indicates stream corruption or a non-compliant stream. The decoder
-must ignore the packet and not attempt decoding it to
-audio}.
-
-
-
-
-\paragraph{Mode decode}
-Vorbis allows an encoder to set up multiple, numbered packet 'modes',
-as described earlier, all of which may be used in a given Vorbis
-stream. The mode is encoded as an integer used as a direct offset into
-the mode instance index.
-
-
-\paragraph{Window shape decode (long windows only)} \label{vorbis:spec:window}
-
-Vorbis frames may be one of two PCM sample sizes specified during
-codec setup. In Vorbis I, legal frame sizes are powers of two from 64
-to 8192 samples. Aside from coupling, Vorbis handles channels as
-independent vectors and these frame sizes are in samples per channel.
-
-Vorbis uses an overlapping transform, namely the MDCT, to blend one
-frame into the next, avoiding most inter-frame block boundary
-artifacts. The MDCT output of one frame is windowed according to MDCT
-requirements, overlapped 50\% with the output of the previous frame and
-added. The window shape assures seamless reconstruction.
-
-This is easy to visualize in the case of equal sized-windows:
-
-\begin{center}
-\includegraphics[width=\textwidth]{window1}
-\captionof{figure}{overlap of two equal-sized windows}
-\end{center}
-
-And slightly more complex in the case of overlapping unequal sized
-windows:
-
-\begin{center}
-\includegraphics[width=\textwidth]{window2}
-\captionof{figure}{overlap of a long and a short window}
-\end{center}
-
-In the unequal-sized window case, the window shape of the long window
-must be modified for seamless lapping as above. It is possible to
-correctly infer window shape to be applied to the current window from
-knowing the sizes of the current, previous and next window. It is
-legal for a decoder to use this method. However, in the case of a long
-window (short windows require no modification), Vorbis also codes two
-flag bits to specify pre- and post- window shape. Although not
-strictly necessary for function, this minor redundancy allows a packet
-to be fully decoded to the point of lapping entirely independently of
-any other packet, allowing easier abstraction of decode layers as well
-as allowing a greater level of easy parallelism in encode and
-decode.
-
-A description of valid window functions for use with an inverse MDCT
-can be found in \cite{Sporer/Brandenburg/Edler}. Vorbis windows
-all use the slope function
-\[ y = \sin(.5*\pi \, \sin^2((x+.5)/n*\pi)) . \]
-
-
-
-\paragraph{floor decode}
-Each floor is encoded/decoded in channel order, however each floor
-belongs to a 'submap' that specifies which floor configuration to
-use. All floors are decoded before residue decode begins.
-
-
-\paragraph{residue decode}
-
-Although the number of residue vectors equals the number of channels,
-channel coupling may mean that the raw residue vectors extracted
-during decode do not map directly to specific channels. When channel
-coupling is in use, some vectors will correspond to coupled magnitude
-or angle. The coupling relationships are described in the codec setup
-and may differ from frame to frame, due to different mode numbers.
-
-Vorbis codes residue vectors in groups by submap; the coding is done
-in submap order from submap 0 through n-1. This differs from floors
-which are coded using a configuration provided by submap number, but
-are coded individually in channel order.
-
-
-
-\paragraph{inverse channel coupling}
-
-A detailed discussion of stereo in the Vorbis codec can be found in
-the document \href{stereo.html}{Stereo Channel Coupling in the
-Vorbis CODEC}. Vorbis is not limited to only stereo coupling, but
-the stereo document also gives a good overview of the generic coupling
-mechanism.
-
-Vorbis coupling applies to pairs of residue vectors at a time;
-decoupling is done in-place a pair at a time in the order and using
-the vectors specified in the current mapping configuration. The
-decoupling operation is the same for all pairs, converting square
-polar representation (where one vector is magnitude and the second
-angle) back to Cartesian representation.
-
-After decoupling, in order, each pair of vectors on the coupling list,
-the resulting residue vectors represent the fine spectral detail
-of each output channel.
-
-
-
-\paragraph{generate floor curve}
-
-The decoder may choose to generate the floor curve at any appropriate
-time. It is reasonable to generate the output curve when the floor
-data is decoded from the raw packet, or it can be generated after
-inverse coupling and applied to the spectral residue directly,
-combining generation and the dot product into one step and eliminating
-some working space.
-
-Both floor 0 and floor 1 generate a linear-range, linear-domain output
-vector to be multiplied (dot product) by the linear-range,
-linear-domain spectral residue.
-
-
-
-\paragraph{compute floor/residue dot product}
-
-This step is straightforward; for each output channel, the decoder
-multiplies the floor curve and residue vectors element by element,
-producing the finished audio spectrum of each channel.
-
-% TODO/FIXME: The following two paragraphs have identical twins
-% in section 4 (under "dot product")
-One point is worth mentioning about this dot product; a common mistake
-in a fixed point implementation might be to assume that a 32 bit
-fixed-point representation for floor and residue and direct
-multiplication of the vectors is sufficient for acceptable spectral
-depth in all cases because it happens to mostly work with the current
-Xiph.Org reference encoder.
-
-However, floor vector values can span \~{}140dB (\~{}24 bits unsigned), and
-the audio spectrum vector should represent a minimum of 120dB (\~{}21
-bits with sign), even when output is to a 16 bit PCM device. For the
-residue vector to represent full scale if the floor is nailed to
-$-140$dB, it must be able to span 0 to $+140$dB. For the residue vector
-to reach full scale if the floor is nailed at 0dB, it must be able to
-represent $-140$dB to $+0$dB. Thus, in order to handle full range
-dynamics, a residue vector may span $-140$dB to $+140$dB entirely within
-spec. A 280dB range is approximately 48 bits with sign; thus the
-residue vector must be able to represent a 48 bit range and the dot
-product must be able to handle an effective 48 bit times 24 bit
-multiplication. This range may be achieved using large (64 bit or
-larger) integers, or implementing a movable binary point
-representation.
-
-
-
-\paragraph{inverse monolithic transform (MDCT)}
-
-The audio spectrum is converted back into time domain PCM audio via an
-inverse Modified Discrete Cosine Transform (MDCT). A detailed
-description of the MDCT is available in \cite{Sporer/Brandenburg/Edler}.
-
-Note that the PCM produced directly from the MDCT is not yet finished
-audio; it must be lapped with surrounding frames using an appropriate
-window (such as the Vorbis window) before the MDCT can be considered
-orthogonal.
-
-
-
-\paragraph{overlap/add data}
-Windowed MDCT output is overlapped and added with the right hand data
-of the previous window such that the 3/4 point of the previous window
-is aligned with the 1/4 point of the current window (as illustrated in
-the window overlap diagram). At this point, the audio data between the
-center of the previous frame and the center of the current frame is
-now finished and ready to be returned.
-
-
-\paragraph{cache right hand data}
-The decoder must cache the right hand portion of the current frame to
-be lapped with the left hand portion of the next frame.
-
-
-
-\paragraph{return finished audio data}
-
-The overlapped portion produced from overlapping the previous and
-current frame data is finished data to be returned by the decoder.
-This data spans from the center of the previous window to the center
-of the current window. In the case of same-sized windows, the amount
-of data to return is one-half block consisting of and only of the
-overlapped portions. When overlapping a short and long window, much of
-the returned range is not actually overlap. This does not damage
-transform orthogonality. Pay attention however to returning the
-correct data range; the amount of data to be returned is:
-
-\begin{Verbatim}[commandchars=\\\{\}]
-window\_blocksize(previous\_window)/4+window\_blocksize(current\_window)/4
-\end{Verbatim}
-
-from the center of the previous window to the center of the current
-window.
-
-Data is not returned from the first frame; it must be used to 'prime'
-the decode engine. The encoder accounts for this priming when
-calculating PCM offsets; after the first frame, the proper PCM output
-offset is '0' (as no data has been returned yet).