skip navigation

Docsoft, Incorporated

Media File Formats

Author: Yury Delendik, Docsoft Inc.
Created: 6.26.2009


This article gives basic understanding of the digital media that represent audio and video contents, and popular methods of digital media information formats. Digital media information is represented by means of “0” and “1”. That let the information be processed by computers, be easily transmitted and stored.

Digital Media

Usually, an analog signal is captured and then converted to digital data using sampling. During the sampling process, the analog (or continuous) signal is converted to a sequence of fragments – samples or frames. In practice, the time between samples are selected the way so listener or viewer can easily substitute missing pieces of information, e.g. if the nature movie is sampled at the rate 25 frames per second the viewer’s brains can still grasp the movement of the water in the river. The longer time interval between the samples -- more information is missing; the listener’s or viewer’s brain has to work harder to process the information.

Sound Data Sampling

 A sound is a vibration/pressure that transmitted through air, liquids or solid materials. The sample of a sound is a measurement of the pressure. The value of pressure can be positive or negative number. Since the computer cannot operate with pure real numbers, the value can be rounded to rational or integer number in specific interval. That introduces further lost of information.

The sound can be characterized by its frequency – amount the repeats of same pressure pattern, measured in hertz (Hz). Human ears can hear the sounds with frequencies between 12 Hz to 20,000 Hz.  The sampling interval shall be selected carefully. For a human, the best sampling rate has to exceed 40,000 Hz. The audio CD stores samples that sampled with 44,100 Hz rate.

Analog signal with digital samples

Figure 1. The samples produced from the audio signal

After the sound is sampled it can be represented as a set of numbers, e.g. +35, -41, -6, +4, -4, +11, -3, 0, +4 numbers will represent audio sound from the figure above. These numbers will be used to process signal on the computer or to play back the sound to the listener.

Digital signal

Figure 2. The digital audio signal playback

To provide the sense of the sound that coming from the different directions, the sound has to be captured in the different locations. The sound data from the different locations is called sound channels. The stereo sound is captured from two places to be played later for left and right ears.

Image and Video Data Sampling

The digital image is represented as a grid with the colored cells (similar to cross-stitch work). The color of the grid cell is chosen to be close to the average color of the part that corresponds to the cell.

Images with sampling grid

Figure 3. The sampling grid over the image

Every color is encoded by three or four real numbers that are correspond to levels of color components (channels). The most popular for representing the image on the Web is RGB color model that allows combining levels of the red, green and blue colors to represent wide variety of colors. For example: to get the color “Spring Green” you have to add 100% of green color and 50% of blue color; in digital media it will be presented by triplet of numbers: 0.0, 1.0, and 0.5. And to capture the picture fragment from figure above, you will need 90 numbers (6 columns by 5 rows).

A video is a sequence of still images (frames) that represents some dynamic scene. The digital video is stored as a sequence of sampled images taken in equal time intervals.

Reducing the Data Size

The sampled data requires vast amount of storage to be saved. For example, to store 10 minutes of the stereo sound that was captured at 44,100 Hz with a sample that fits in 2-byte number (16 bit), it will be required to have 105,840,000 bytes of memory space available; and to store 10 minutes of the color video with resolution of 720x480 pixels at 25 frames per second, it will be required to have 15,552,000,000 bytes.

In most cases, the sampled data is intended to be transmitted to the viewer or listeners via the Web. The capability to transmit specific amount of data over the time measured by bandwidth and expressed in bits per second (bps). The average internet user in US has 1.5 mbps bandwidth to download the content from the web, and to transmit the video and audio 10 minutes content mentioned in example above, it will require more than 23 hours.

To reduce amount of the memory required to store or transmit, the specialized algorithms have to be used. Those algorithms understand nature of the data and that helps to increase compression ratio. There are two classes of compression algorithms: lossless and lossy. A lossless compression algorithm allows preserving all digital samples as they were. A lossy compression algorithm alters the samples by losing more media information. The most of the lossy algorithms allow setting parameters that limit output size. This is helps to limit overall size of the digital audio or video data to be reasonable for transmission over the Web.

A computer program that allows encoding or decoding digital media data often called a codec (from coder-decoder). A codec implements lossy or lossless algorithm that converts samples to data stream suitable for storage or streaming. Sometimes, a codec communicates directly with analog device that performs sampling or playback or, actually, is designed to be a part of the device to reduce the amount of the data passed through computer memory.

Storing the Media Data

The sampled and compressed data can be stored as one big continues stream of bits as a file on a computer hard drive. And to playback the data, it will be required reading or processing the whole file from the beginning to the end.

The nature of the digital media requires delivery of the specific fragment rather than the whole file content. In practice, the media data is divided into the small chunks (packets) that can be played. The duration of the chunks is a fraction of a second. Normally, the packets are delivered in sequence, but if the data is damaged or lost it will be easy to recover at least the portion of the media. Dividing of the data into packets also facilitates the ability to read the fragment at specific position without reading the whole data stream: random position playback or fast rewind/playback with preview.

Single digital media file or stream may contain several sources of the information, e.g. video and sound track. The data is stored in special format that called a multimedia container format. A container format allows store and index multiple packetized media streams with a purpose of simultaneous playback. A multimedia container format can store an audio and video information as well as media description, alternative content, text captions or still images.

a file-container with audio and video data

Figure 4. A file-container with audio and video data

The Popular Media Formats

In this section we will list the popular container formats may contain audio and/or video streams.

Table 1. The popular media container formats
Name File Extensions
MPEG-4 Part 14 .mp4, .m4v, .m4a, .mov, .f4v
Advanced Systems Format .asf, .wmv, .wma
Ogg .ogv, .oga, .ogx, .ogg, .spx
Flash Video .flv
RealMedia .rm, .ra
Audio Video Interleave .avi
MPEG-1 Part 1 .mpg, .mpeg, .mp2, .mp3, .m2v, .m2a


Table 2. The popular audio formats
Name Found in Containers
MPEG-1 Audio Layer 3 MPEG-1, MPEG-4, ASF, Flash Video, AVI
Advanced Audio Coding (AAC) MPEG-1, MPEG-4, ASF
Windows Media Audio ASF
Vorbis Ogg
Speex Ogg
Waveform audio format in .wav file
Free Lossless Audio Codec Ogg, in .flac file
RealAudio RealMedia


Table 3. The popular video formats
Name Found in Containers
Windows Media Video ASF
H.263 Flash Video, RealMedia
H.262/MPEG-2 Video MPEG-1, ASF, AVI
Theora Ogg
RealVideo RealMedia