Audio-visual Coding

The motivations for the choice of audiovisual standards are described in detail in the MICE evaluation report, and so will not be discussed in great detail here. In the traditional circuit based videoconferencing world, the set of standards called p x 64 (also known as H.320) are by far the most widely used.

The H.320 standards [lio] cover video, voice, data; they include the H.261 standard for compressed video, several for voice, and H.221 framing for multimedia streams over serial links. Usually video data and the voice are multiplexed within an H.221 stream - which is isochronous - for transmission over serial links. However, the raw H.261 coding for the video is not isochronous. This makes raw H.261 suitable for use over packet switched networks, and so workstation-based software implementations thus treat H.261 video and audio separately. This also allows us to route video and audio separately, for instance sending the audio via direct internet multicast and the video via a CMMC, or to prioritise the video and audio differently, so that sites on low bandwidth links only receive the audio. If any inter-stream synchronisation is to be done, it then has to be done by adapting the playout buffers of the individual streams.

It is much more onerous to encode the video than to decode it. Whilst only a powerful workstation could be considered for the encoding, decoding requires significantly less processing power. Where software implementations of H.261 exist, they can be performed using less processing power by omitting to perform a search for motion vectors - this results in slightly lower quality video, but brings the processing requirements within the bounds set by today's workstation technology.

For a point-point conference over a circuit-switched network, it makes sense to integrate the voice with the video (typically using H.221 framing). For multi-way conferencing over a packet-switched net, this is less suitable: voice is much more sensitive to delay and jitter than the video; moreover, it can be compressed to use much less bandwidth (between 4.8 - 16 Kbps). By using silence suppression, and the fact that usually only one person is talking at any one time in a conference, it is possible to service a large audience with voice for a modest outlay of bandwidth provided multicast is used. The last five IETF (Internet Engineering Task Force) meetings have demonstrated that audio can be multicast to audiences worldwide (500 sites in 17 countries and 5 continents), though they have also demonstrated that some networks and routers may suffer from very bad delay variation, and sometimes periodic loss bursts. However, very often this turns out to be due to bad router configuration (typically traffic being processed in the main CPU of the router), and can be resolved once the trouble spot has been located.

It is also possible to implement voice relays which mediate between sites with different coding schemes - often heavier coding is adopted because of the availability of reduced communication bandwidths - this can be one of the facilities of the MICE CMMC. While the standard method for encoding voice is 64Kb/s PCM, there are several coding standards for reducing its bandwidth; examples are CELP [che], ADPCM [ben], LPC [ata] and GSM [var].

Previous: The Big Picture
Up: Specification of the MICE Conference Management and Multiplexing Centre

Version 2.0
Next: Video
Previous Page: The Big Picture
Next Page: Video

mhandley@cs.ucl.ac.uk
Mon Jan 10 18:40:57 GMT 1994