Dorcey Article

CU-SeeMe Desktop VideoConferencing Software

by Tim Dorcey, Cornell University.

From Connexions, Volume 9, No.3, March 1995
Introduction

     CU-SeeMe is a desktop videoconferencing system designed for use on the
Internet or other IP networks. It runs on Macintosh and Windows platforms
and requires no special equipment for video reception beyond a network
connection and gray-scale monitor. Video transmission requires a camera and
digitizer, a combined version of which can be purchased for the Macintosh
for under $100. CU-SeeMe video is represented on a 16-level gray scale, and
is provided in either of 2 resolutions:  320 x 240 pixels (half diameter of
NTSC television) or 160 x 120 pixels. At this writing, audio is available
in the Macintosh version only, (ed. note: audio for Windows was released 
August, 1995) with audio processing adapted from the"Maven" program, 
written by Charlie Kline at the University of Illinois.
When network conditions or equipment deficiencies prohibit reliable audio,
ordinary telephone connections can be employed. In addition to basic
audio/video services, CU-SeeMe offers crude white board capabilities in the
form of a "slide window" that transmits full-size 8-bit gray scale still
images and allows for remote pointer control. A plug-in architecture has
also been developed to allow 3rd parties to write binary modules which
extend the capabilities of CU-SeeMe. Two-party CU-SeeMe conferences can be
initiated by one participant connecting directly to the other, whereas
larger conferences require that each participant connect to a "CU-SeeMe
reflector," a unix computer running CU-SeeMe reflector software that serves
to replicate and re-distribute the packet streams.


     The main objective in the development of CU-SeeMe was to produce an
inexpensive videoconferencing tool that would be useable today. As well as
providing direct benefit to its users, we expected that valuable lessons
could be learned about how videoconferencing actually works in practice,
how the experience should be organized, what features are necessary to
support multi-party conferencing, and so on. While others worked to advance
the state of the art in video compression, high-speed networking, and other
low-level technologies necessary to support high quality videoconferencing,
we hoped to facilitate the accumulation of experience that would provide
impetus for those efforts and guide their direction.


     Similar efforts have focused on unix workstations, for which several
tools are currently available, including "nv" [1], "ivc" [2] and "vic" [3].
 In fact, it was Paul Milazzo's [4] demonstration of such a tool in 1991
that inspired development of CU-SeeMe.  However, it is our belief that the
value of a communication tool is largely determined by the number of people
that can be reached with it.  We sought to increase the accessibility of
videoconferencing by focusing on low-end, widely available, computing
platforms. Currently, CU-SeeMe can be found in places ranging from the
grade school to the national laboratories--often with a connection between
them. It has appeared in over 40 countries [5] and on every
continent--including Antartica [6]. This article presents a brief overview
of two central components of the CU-SeeMe software: Conference Control and
Video Encoding.   


Conference Control

     Each participant in a CU-SeeMe conference periodically transmits a
single packet that advertises their interests with respect to all of the
other participants. These advertisements are termed "OpenContinue" packets,
in recognition of the fact that in a connectionless protocol the
information necessary to open a connection is no different than that used
to continue it. The OpenContinue packet consists of a standard header that
is common to all CU-SeeMe packets, followed by a section of general
information about the issuing participant. Then, for each other participant
that the sender is aware of, follows a collection of variables that express
the sender's interests with respect to that other participant (e.g., I want
their video, I want their audio, I want their slides, etc.). Reflectors
examine OpenContinue packets to develope source specific distribution lists
for the various media involved in a conference and then forward them to all
other participants. Because the protocol requires each participant to
process dynamic status information for every other conference participant,
it does not scale well to large conferences, say larger than about 30. 
However, it does provide considerable control, in a robust fashion, over
the details of smaller conferences. Furthermore, various possibilities,
beyond the scope of this discussion, exist for extending the protocol to
larger conferences.


    The primary motivation for developing the reflector software was the
absence of multicast capabilities on the Macintosh. We have therefore been
careful about extending its role beyond the replication and distribution of
packets, allowing it to add value where it can, but avoiding dependence on
it. We have, however, increasingly come to appreciate the degree to which
the reflector architecture allows for fine tuning of the data streams sent
to each recipient, and we do not expect this to become any less important
when multicast becomes more widely available.


Video Encoding

     The predominant objective here was to devise algorithms that would be
fast enough for real time videoconferencing on the typical Macintosh
platforms that were available in mid-1992, which mainly consisted of 68020
and 68030 based machines. The decoding algorithm, in particular, needed to
be extremely efficient in order to support multiple incoming video streams.
The main technique for achieving these goals was to begin with a massive
decimation of the input video signal, and then to process what remained in
a manner that took maximal advantage of the capabilities of the target
processors, as described below. For simplicity, discussion will focus on
the smaller size video format, which has become most popular in practice.
Video processing proceeds in 3 basic steps: 1) decimation 2) change
detection and 3) spatial compression.


     The first step in the video encoding is to decimate the captured 640 x
480 pixel video frame down to 160 x 120, with each pixel represented on a
4-bit gray scale. In comparison to full size, 16-bit color, this represents
a 64:1 reduction in the amount of data to be handled by subsequent
processing. The user is provided with brightness and contrast controls to
adjust the mapping of input intensities to the 16 level gray-scale. With
proper adjustment and reasonable lighting conditions, surprisingly good
picture quality can be achieved.  


     Next, the video frame is subdivided into 8x8 pixel squares, and a
square is selected for transmission if it differs sufficiently from the
version of it that was transmitted most recently. The index used to measure
square similarity is the sum of the absolute value of all 64 pixel
differences, with an additional multiplicative penalty for pixel
differences that occur nearby to one another. Inclusion of the
multiplicative penalty was based on the assumption that changes in adjacent
pixels are more visually significant than isolated changes. Its exact form
was dictated by computational convenience, devised so as not to introduce
any additional computional burden except during the initialization of a
look-up table at program load time. To account for the possibilty that
updates may be lost in transit, transmission is also triggered if a square
has not been transmitted for a specified number of frames (refresh
interval). This ensures that a lost update will not corrupt the image into
the indefinite future. 


     Once a square has been selected for transmission, a locally developed
lossless compression algorithm is applied. The most interesting feature of
this algorithm is the degree of parallelism it achieves by manipulating
rows of 8 4-bit pixels as 32-bit words. This allows for high-speed
performance on a 32-bit processor, but also complicates exposition of the
algorithm. The basic idea is that a square row is often similar to the row
above it, and, when it is different, it is likely to be different in a
consistent way across columns. Letting r[i] represent a 32-bit word
containing the ith row of pixels in a square, compression is based upon the
representation:  r[i] = r[i-1] + d[i], where d[i] is constructed from
either a 4, 12, 20 or 36 bit code. If d[i] is thought of as being composed
of 8 4-bit pixel differences, then spatial redundancy in the vertical
direction suggests that the differences will all be near to 0, whereas
correlation in the horizontal direction suggests that they will be near in
value to each other. Under those assumptions, the sorts of d[i] that are
most likely to occur can be predicted and a scheme devised to represent
them using a relatively small number of bits. Roughly speaking, for each
d[i], a 4-bit code is used to specify a) a common component of all pixel
differences (restricted to the range [-2,2]) and b) whether there are
0,8,16 or 32 bits of additional data to represent deviations around that
constant component. In reality, of course, d[i] is not composed of
individual pixel differences, since carry bits can occur in the 32-bit
arithmetic, but the technique still seems to work reasonably well,
achieving around 40% compression (compressed size is approximately 60% of
original). Although a 40% compression ratio may not appear impressive,
recall that this is for images that have already undergone a 64:1
decimation from the original, and that the information discarded at the
outset was that most suitable for compression.


     The CU-SeeMe video encoding has proven to be surprisingly robust
against packet loss when the subject matter is a typical talking head.
Often, the only observable effect of high packet loss is a reduction in
frame rate. This can be explained as follows. After decimation and
compression, it is almost always the case that the information required to
update a frame will fit within a single (less than 1500 byte) packet. Hence
a lost packet corresponds to a lost frame update, rather than a partial
frame update. Secondly, when the subject is a talking head, most squares
are either changing every frame or not at all. Hence a square update that
is lost was likely to be replaced in the next frame anyway. Only when a
square changes, but then does not change in the next frame, will corruption
occur.


     This suggests a simple method for embedding several frame rates within
a single video stream. Say, every 3rd frame, transmit a square if it is
different than it was in the preceding frame OR if it is different than it
was 3 frames ago. Recipients who desire a slower frame rate could accept
every 3rd update and still get a clean video image. The observations
regarding packet loss suggest that this would not introduce a great deal of
additional traffic i.e., a square that differs from 3 frames ago is also
likely to differ from the preceding frame. Some variation on this scheme
will be implemented in an upcoming version of CU-SeeMe to better support
conferences involving participants with differing network capacity.


Conclusions

     CU-SeeMe employes a conference control protocol that has proven to be
quite robust and allows for the expression of detailed state regarding the
relations of each conference participant to each other participant. In
conjunction with the reflector software it allows for customized
distribution of conference media, so that nothing is transmitted unless it
will be used. The protocol is limited in the size of a conference which it
can serve, but it can be extended.
     

     CU-SeeMe video is encoded in an ad hoc format that was designed for a
particular family of desktop machine that were widespread in mid 1992. What
it lacks in mathematical elegance, it makes up for in quickness. As
computing power increases, it will eventually become obsolete. Nonetheless,
it played an essential role in making CU-SeeMe interesting enough to
warrant further work, and it, or its derivatives, will continue to play an
important role for some time.

CU-SeeMe can be obtained from ftp://cu-seeme.cornell.edu/pub/video.


Notes and References

[1]  Frederick, Ron.  "Experiences with real-time software video
compression."  In Proceedings of the Sixth International Workshop on Packet
Video.  Portland, 1994.

[2]  Turletti, Thierry.  "The INRIA Videoconferencing System (IVS)." 
Connexions, Vol. 8, no. 10, 1994.

[3]  McCanne, Steve & Jacobsen, Van.  vic man page.  Available at
ftp://ftp.ee.lbl.com/conferencing/vic as of December, 1994.

[4]  Milazzo, Paul.  Informal demonstration of dvc at the December 1991
meeting of the IETF in Santa Fe.

[5]  This estimate is based on subscribers to the CU-SeeMe mailing list as
of January 1995.

[6]  "Tomorrow's TV Today." Time, October 10, 1994, p. 24.