Sustainability of Digital Formats: Planning for Library of Congress Collections

Introduction | Sustainability Factors | Content Categories | Format Descriptions | Contact
Content Categories >> Still Image | Sound | Textual | Moving Image | Web Archive | Datasets | Email and PIM | Design and 3D | Geospatial | Aggregate | Generic

Moving Images >> Quality and Functionality Factors


Scope
This discussion concerns a variety of media-independent digital moving image formats and their implementations. Some formats, e.g., QuickTime and MPEG-4, allow for a very wide range of implementations compared to, say, MPEG-2, an encoding format whose possible implementations are relatively more constrained. The following discussion proceeds in terms of two broad groupings:
• Format implementations for end-user applications
• Format implementations for specialized professional applications
Readers are cautioned that these two groups overlap and that many formats are implemented in both groups.

Format implementations for end-user applications are intended for home or classroom viewing, training, and the like. In effect, these implementations represent a successor to VHS tapes and a media-independent alternative to DVDs. The stream of imagery is often bitmapped and organized as GOPs (groups of pictures) to which temporal compression has been applied; one frame is fully reproduced while "difference information" is recorded for adjacent frames. In some implementations, the sound and picture data is interleaved: the stream carries one unit of picture, followed by one unit of sound, then the next unit of picture, and so on. Meanwhile, some formats support object-based implementations, i.e., a given file contains one or more picture objects, one or more sound objects, and a timeline that carries instructions about sequence and synchronization.

Any moving image format may support animation in a broad sense, i.e., files may contain a sequence of artist-drawn frames. There are, however, a special group of end-user-oriented formats that dynamically generate animations and/or interactive programs for end-users. The leading example is Flash SWF_7, a format for animated shorts for web delivery or for playback on personal computers. At this Web site, animations that consist of only a few frames, e.g., an animated GIF_89a, are covered under the heading of Still Images.

Format implementations for specialized professional applications support the production, post-production, and distribution of broadcast content or theatrical motion pictures. Roughly speaking, production means shooting, while post-production means editing. These format implementations may contain temporally compressed bitstreams, e.g., the most common variants of MPEG-2. Many professional implementations, however, maintain the integrity of individual frames in a stream, important for the maintenance of quality, since temporal compression is ruled out by definition. Maintenance of frame integrity also supports editorial reuse of footage, i.e., when end-users wish to cut frame-accurate segments from a video file and use them in a new production. Formats like DPX_2 and MJP2_FF (Motion JPEG 2000 file format) encode individual frames as separate files or as distinct entities within a wrapper. The frame-integrity approach is central to such activities as the Digital Cinema specification (see DCDM_1_0) for the distribution of digital movies to theaters. Meanwhile, other activities in industry and in public archives have begun to explore the use of frame-integrity approaches for video and film preservation. Some frame-integrity formats are object-based and incorporate sound elements as objects, while others require that sound elements be managed separately.

The dichotomy between the two implementation groups is weak or, to put it another way, some formats (like MPEG-4_FF_2) may be used in both end-user applications, characterized by low-resolution linear streams or interactive programs, and in specialized professional applications, characterized by high quality, frame-integrity files suitable for archiving. Similarly, images encoded within both groups may be created via cameras, graphic arts technology, or a combination of the two. Some formats can support both stream and frame-integrity data; the MPEG-2 encoding format, for example, may contain bitstreams that employ temporal compression or that contain "all I-frame data" in which each frame is a distinct entity. Some formats are limited as to picture size; others may be capable of embracing high definition television (up to 1920 samples by 1080 lines), Digital Cinema's 4K images (4096 samples by 2160 lines), or images of "arbitrary" shape and size.

This is written at a time of profound change for the industries that produce moving image content. Broadcasters are in the throes of the FCC-mandated switch to Advanced Television Standards Committee (ATSC) digital transmissions, as cable and satellite distribution entities make parallel changes. At the same time, moving image production and post-production (shooting and editing) is moving from film and tape to disk, server, and workstation, while digital formats will be employed for theatrical distribution in the foreseeable future.

Computer games form a category of their own; at the moment, these are not collected by the Library of Congress.

Normal rendering for moving images
Normal rendering for moving images is associated with end-user implementations, and consists of playback of a single image stream with accompanying sound in mono or stereo through one or two speakers (or equivalent headphones). Player software provides user control over some picture elements (brightness, hue, contrast), some sound elements (volume, tone, balance), and navigation (fast forward, go-to-segment, etc.). Normal rendering would also allow playback through software that allows the analysis and excerpting of picture and sound. Normal rendering must not be limited to specific hardware models or devices and must be feasible for current users and future users and scholars.

For formats implemented in specialized professional applications, the same type of normal rendering does not obtain. Some professional authoring or editing systems, e.g., those used in non-linear video editing, permit playback in a manner comparable to that described in the preceding paragraph. But in other contexts, for example, working with the DPX format's picture-only sets of frame images, normal playback will only occur "downstream," i.e., from a newly made file that has been derived from the DPX source. This is especially true when the image information represents an extended range of brightnesses or color values (what we sometimes call "rich data"), for which no display devices exist. For many specialized professional implementations, normal playback will be afforded by, say, the MPEG file that has been produced from the DPX picture information and separate soundtrack data.

Clarity (support for high image resolution)
Clarity refers to the degree to which "high image resolution" content may be reproduced within this format. Generally speaking, this factor pertains to bitmapped representations and not to vector-based animations like Flash SWF_7 files, which are inherently scalable and often employ color in a fully managed way. In this document, the term clarity is meant broadly, referring to the factors that will influence a careful (even expert) viewing experience. A real test of clarity occurs when the reproduction is repurposed, e.g., when selected footage is edited into a new video program.

One key factor in the clarity of moving image files is picture size and, in general, greater picture size means greater clarity. Today, many digital video files are derived from standard definition television signals,1 with an aspect ratio of 4:3, and most producers create files at "quarter screen" or "full screen" resolution.2 These files may be oriented for display on traditional television monitors (non-square pixels) or digital-television or computer monitors (square pixels). Picture sizes and aspect ratios encountered in digital file formats will change as high definition television (HDTV) comes into use as a source for media-independent digital files. The ATSC digital television standard is focused on terrestrial, cable, and satellite broadcast and permits as many as eighteen configurations, both standard and high definition, many of which have a 16:9 aspect ratio.

It is worth noting that the largest number of moving image files encountered on the Web or, for that matter, on DVDs and other disk media, have been derived from pre-existing or newly created television signals. When such a video stream serves as the source, then the resulting digital file inherits the source's aspect ratio and maximum possible picture size, as well as the form of the frame scan, i.e., interlaced or progressive. Video produced in the past (and much being produced today) is interlaced, although the implementation of ATSC digital television will result in increasing use of progressive imaging. An interlaced video frame consists of two fields each containing one-half of the image information, captured about 1/60th of a second apart. This time offset allows for some subject movement between the two field exposures and thus blur within a single video frame. In contrast, a progressive scan captures all of the information in a single exposure that provides greater clarity but which requires more processing power to render, since all of the lines must be displayed more or less instantaneously. Video Streams as Sources for Files discusses the influence of video source materials on clarity.

In the case of specialized professional applications, video streams may be the source for the imagery but other, higher resolution sources may also be used, e.g., scanned 35mm motion picture film, especially in implementations that maintain the integrity of frames. For content derived from motion picture film, a desirable outcome is the creation of a stream of coherent frames, even when interlacing is employed.3 Another clarity element that applies with greater force for specialized professional applications has to do with bit depth and dynamic range. Image sources like actual scenes (shot with a digital camera) or film that is scanned have an extended dynamic range (a wide range of brightnesses). When the representation of the dynamic range is important, as it often is professional applications, the stakes are raised for a format's ability to encode extended bit depth and to support linear as well as logarithmic representations of brightness information.

Clarity in a moving image file is also affected by the presence of the visible artifacts that may result from the application of compression or watermarking. Although it is possible to have a media-independent file containing an uncompressed or losslessly compressed video stream,4 most files produced today are compressed, within each bitmap (in a JPEG manner) and temporally across a group of pictures. Typical compression encodings for video stream formats conform to specifications from the Moving Picture Experts Group (MPEG-1, MPEG-2, MPEG-4), various implementations of Apple QuickTime, Motion JPEG, Microsoft Windows Audio Video Interleaved, and Windows Media Video. Each format allows for variations in how compression is accomplished, and these may affect clarity. For example, for the MPEG-2 formats, a producer may select between options for applying types of compression to GOPs (sets of frames) or whether to encode at a fixed or variable bit rate. Some moving image formats are wrappers, e.g., QuickTime, that permit creators to select from a number of encoding options, e.g., Cinepak and Sorenson.

Clarity (in a temporal sense) is influenced by frame rate. In order to reduce the extent of data, some files are made at frame rates lower than the nominal 30 for video and 24 for film. Individual bitmaps are not affected, but fewer frames are shown and the overall viewing experience may be diminished by jerkiness.

The MPEG-2 and MPEG-4 standards (and possibly others) include conformance points that define bitstreams with particular characteristics, summarized as levels and profiles. Levels have to do with picture size and data rates, key factors for quality, while the profiles have to do with the encoding tools and the resulting complexity of the compression. Manufacturers of decoders indicate the performance of their product in terms of these conformance points. Meanwhile, compression encodings for frame-integrity formats, still an emergent technology, favor still-image algorithms, e.g., JPEG and JPEG 2000.

For video stream formats, decisions made about picture size, frame rate, interlacing, encoding, and other variables, yield files with different data rates (or average data rate, if variable) for the resulting bitstream. A rule of thumb is the higher the rate within a given size and compression codec, the greater the clarity of the image. Some encoding types and/or codecs provide greater compression (and thus lower bit rates) together with higher quality than a competing encoding type and/or codec.

Fidelity (support for high audio resolution)
Fidelity refers to the degree to which "high fidelity" sound may be reproduced within this format. In this context, the term fidelity is meant broadly, referring to the factors that will influence a careful (even expert) listening experience. Strictly speaking, this factor is limited to formats that reproduce sound waveforms, where a real test of fidelity occurs when the reproduction is repurposed, e.g., a "master file" is used as the basis for the master for a new audio-CD music release.

The quality of the audio in the video source material will influence fidelity just as incoming picture quality influences clarity. And just as image compression is a fact of life for the current generation of media-independent digital video files, so to is the compression of the accompanying audio. Production practices generally keep audio quality in step with picture quality and the generalization offered above will apply here as well: the higher the bit rate, the greater the fidelity of the sound.

For more information on sound in digital formats, see also Sound: Quality and Functionality Factors.

Support for multiple sound channels
Multiple channel support refers to the degree to which formats may represent sound in terms of aural space or as complementary streams. Engineers often refer to representations of aural space as sound field, e.g., as stereo or surround sound. Many media-independent video formats offer surround sound, a typical feature of media-dependent DVD distribution of moving image content. Multiple channel sound may also feature two or more complementary signal streams that provide alternate or supplemental content, e.g., narration in French and German, commentary from film directors or actors, sound effects separate from music, karaoke content, or the like.

Functionality beyond normal video rendering
Various video image formats offer functional features beyond those mentioned above in Normal Rendering. The following paragraphs highlight some of the most interesting of these.

Scalability is a significant feature of formats like RealVideo and MPEG-4. In a moving image context, scalability refers to the format's ability to adjust to the particulars of a given file-streaming event by, for example, varying the number of frames an enduser receives each second (as with RealVideo's Scalable Video Technology) or varying individual frames in terms of picture quality (as with MPEG-4's spatial scalability). These kinds of capabilities permit the use of streaming protocols and server software that dynamically judges the Quality of Service (QOS) during delivery to an enduser and then adjusts the stream to keep the video accurate to time. As the QOS decreases on a slow network, the quality of the delivered stream will also decrease, e.g., because a few frames are omitted each second or by presenting lower-quality images.

Another beyond-normal-rendering feature is interactivity for end-users, offered by formats like Macromedia's Flash SWF_7 format, widely used to produce animations that can be played in Macromedia's proprietary software or saved by the producer in other digital video formats like QuickTime.

Interactivity and many other elaborations are found in the emerging MPEG-4 standard, although it remains to be seen which of these will be fully implemented and widely adopted. For example, the MPEG-4 specification provides for alpha channels in its video codec and a special screen description language (BIFS, for Binary Format for Scenes) that permit video to be composited after decoding in real time. This feature allows producers to separate background and foreground image information, which in turn supports end-users' interaction with objects, e.g., changing the color of house in a scene, or tagging a soccer player in order to watch all of her moves. MPEG-4 also offers support for multiple media objects, e.g., a map (which may even represent three dimensional space) or a picture-in-picture window, all of which can be navigated by the end-user. Several commentators call attention to the influence of Virtual Reality Modeling Language on the development of MPEG-4. MPEG-4 also includes support for stereoscopic scenes, arbitrarily shaped objects, and stream and object synchronization. The latter feature will permit the presentation of user-selected audio streams, like the film director's or actor's commentary in a multi-audio-track DVD.

1 Standard definition in this context refers to signals that meet the specifications of the two American standards bodies: the currently prevalent National Television Standards Committee (NTSC) broadcast signal or the new Advanced Television Systems Committee (ATSC) digital standard. ATSC standard definition pictures may be interlaced or progressive scan (e.g., 480i or 480p).

2 Some industry jargon is derived from teleconferencing standards: QCIF for Quarter Common Intermediate Format (176 non-square-pixels by 120 or 144 lines), CIF for Common Intermediate Format (352 non-square-pixels by 240 or 288 lines). Full screen in a 4:3 aspect ratio is sometimes referred to as CCIR 601 (525/60) (720 non-square-pixels by 480 lines) while HDTV is High Definition Television (various, including 1920 pixels by 1080 lines), typically with 16:9 aspect ratio. In a square-pixel environment, quarter-screen picture size is 320x240, while full screen is 640x480 pixels. These sizes are for the United States NTSC and ATSC standards. In European nations as well as some others, standard definition will yield a quarter-screen image of 352x288 non-square-pixels and a full screen image of 720x576. See http://www.iki.fi/znark/video/conversion/ for an interesting discussion of this and related topics.

3 Sophisticated technology is required to render content that has been transferred from film to video, especially in the context of conventional interlaced video. The technology must compensate for the fact that 24 fps (frame per second) film is transferred to 30 fps video by using what is called 3-2 pulldown. For each second of play time, 12 of the total of 24 film frames are reproduced three-video-fields-apiece (a video frame and a half), and the other 12 in two-video-fields-apiece, thus producing the one-second-long total of 60 video fields. (Just to be precise, in the United States NTSC standard, 60 video fields, i.e., 30 video frames, have an actual duration of 29.97 seconds.) Things are easier in Europe, where both the film and video frame rates are 25. One advantage of ATSC digital television is that video can be recorded at 24 fps, which eliminates the need for 3-2 pulldown and will yield file sizes smaller than if the video frame rate was 30.

4 Uncompressed video streams may be placed in the QuickTime wrapper, in proprietary OMF (Open Media Format) files associated with AVID editing systems, and the encoding employed in the frame-based Motion JPEG 2000 format may be used to produce lossless (or lossy) compressed bitstreams. Other options also exist.


Back to top

Last Updated: 01/21/2022