The Lack of PC-based
Video-Telephony:
How I Lost A Technology Bet
Gordon Bell
9 May 2002
Technical Report
Microsoft Research
Microsoft Corporation
The payment of a June 1996 bet with
“By
·
1-10 frames per second,
·
videophone with
·
telephone quality voice.
At least 20% of the video-enabled PCs (i.e. 10% of the total shipments) shipped
during the preceding quarter will use a videophone at least once a day or at
least 5-times per week to participate in a teleconference. The connection will
be via any kind of telephony, cable modem, LAN, or WAN.
Not included in the platform count are:
Network Computers[1]
(NCs) telecomputers
(combined phone and PC); game computers; and television computers (PCs embedded
in television sets).
Winner (Gray) treats loser (
Loser (
The question remains, – How and when will computer assisted
communication including video become as ubiquitous as email or surfing the web? It is just a matter of time, but …
Unlike my perfect record of winning bets against optimists (Table 1), I lost this one because I was the optimist. Now, clearly the bet was based on “wishful thinking” that video ubiquity could and perhaps should occur, not analysis. CUseeMe was a product in use for college students to keep in touch. Camera support in NetMeeting, PowerPoint, and NetShow would surely stimulate ubiquity within five years. Pornographic sites demonstrated that people liked video contact. $20 camera assemblies were being incorporated into cameras and portables. Intel had introduced cards and cameras for video conferencing that would stimulate the market. I was caught up in the enthusiasm of the .com build-up, with startups introducing video-mail that might “kick start” the bandwagon. All these things happened, In 2002, a large number of camera products can be easily connected to three-five year old PCs. Still, the ubiquity has not happened.
It was also a defiant bet against naysayers
– “videophones were tried between NY and
To help establish our research, we observed why tele-enabled communication, meetings, and conferences were failing. In 2001, Gemmell wrote “How to Fail at Video Conferencing[4]” that gave what we believe is the recipe for failure. While we are certain that these almost guarantee failure, we’ll have to wait until computer enhanced communication with/without video succeeds, before we understand the heuristics for success.
1. Voice quality must be at least competitive
with telephony. This implies low latency
and negligible jitter. Furthermore, good microphones are required. The ITU
latency guideline for IP telephony is 150 ms. for voice. In 2002, we measured
Instant Messenger’s voice latency on a LAN to be under 130 ms. However,
Internet latencies are unpredictable. Also, software may add latency (we
measured NetMeeting’s delay to be 290 ms). High latencies will drive users to
the telephone for audio, meaning two calls (video + audio) have to be made[5].
To avoid echoes, the user must have a headset, handset, or echo-canceling
speakerphone. None of these are expensive or difficult to install, but they are
not part of the typical PC package, and no network effect exists to induce
customers to buy/install them.
Today, the millions of teleconferences have poor audio quality – so it is a low
bar to surpass. Still, the IP- products
are inferior today. This makes attending
teleconferences much more fatiguing as the listeners strain to recognize and
understand the speakers. We believe that computer enhanced audio quality can be
substantially better than a telephone or commonly used conference phones (it
could be CD-quality with surround sound).
In essence, the telephone that uses about 8 Kbps of its 64 Kbps channel is just
good enough for 2-way and conference communication. Until video provides some
additional value at virtually 0 cost, it is likely to remain a curiosity.
2. Video technology should increase, not
decrease “presence”. Four attributes
of current videoconferencing systems distract from presence: (1) postage stamp size,(2)
low frame rates that produce jerky images, (3) low fidelity fuzzy or fragmented
images and (4) lack of eye-to-eye communication called gaze awareness[6] (the camera is not looking at you and you do
not appear to be looking at participants).
Increased improved cameras, and displays, and increased bandwidth, and processing
power will ameliorate all four problems. Image size, frame rate, and fidelity are a
function of processing speeds, bandwidth, and displays. The “actual” image is
limited by display size. If “life size”
is required, we’ll simply have to wait for larger and more displays for more
pixels.
Gemmell and others have shown[7]
that gaze can be adjusted by tracking the head and correcting the eye location
so that the participants “appear” to be looking at one another. In 2002 gaze correction is still limited by
vision research.
3. Setting up a videoconference has to be as
easy as making POTS telephone call. The interface, directories, and keyboard have
to be competitive with 1 button or 10+ digit phone dialing. After a fairly
complex installation, current video conference systems depend on complex
directory systems and have call set up times of 20 seconds to buffer and
synchronize.
Broadband links to the home are also essential because computers must always be
on and connected to minimize call setup time.
Messaging e.g. ICQ, AIM, Windows Messenger facilitate call setup. Again, video conferencing could be simpler
and faster than telephones. A WM IP telephone
call set-up is faster than dialing when others are on line. When many numbers
are called the computer can find and dial the number faster than personal
lookup.
4. Video conferencing must be as ubiquitous as
telephones. The network effect[8] needs
to begin to allow every caller to assume that the callee
is video enabled. While corporations users report a 70% use of telephonic
conferencing, only 16% and 20% use video and web conferencing.
AT&T’s original Picturephone Meeting Service was
introduced in 1979 among a dozen cities.
While the service was discontinued, it set a high water mark for video
conferencing using $500K room installations and dedicated 384 Kbps links that
no other system attained. In 2001 only
75K group video conferencing units and 40K desktop video units were sold
according to Wainhouse Research.
Increased processing speed and better connectivity reduce the impediments to both IP telephony and video-telephony. Whether gaze awareness, larger screens, better microphones, cameras, and displays are the impediments for ubiquity is unclear. Social issues may ultimately be the impediment. However, it is clear that video telephony can not occur until IP telephony is commonplace.
Three technologies will open pathways to computer-based personal video communication:
1. IP telephony must penetrate the voice market to some degree (>10%), providing proof the voice over IP quality of service and convenience rivals circuit switched telephony.
2. Instant messaging used in corporations and by children voice and video e.g. AOL’s Instant Messenger, Window’s Messenger achieves wide scale use. This implies more standardization that AOL is willing to provide. The facility and services available via Windows XP and IP phone providers is being fought by SBC because it dis-intermediates both local and long distance telephony. Others believe such systems invade privacy.
3. 2-1/2
and 3G mobile services offering video phones could drive the need for both
stationary and computer video telephony provided adequate bandwidth is
allocated.
All of these technologies are certain to aid in closing the technology gap for computer-based voice and video telephony. In addition to solving the technical problems, simplicity of installation and use have to be comparable to the phone while providing significant benefit. Still, social issues may overwhelm the technology. The “vain” argue that users have to be prepared to be always “on camera”. The “over-committed” argue that users can run a parallel video process such as handling email or reading. Indeed, the most useful benefit of video assisted communication is probably: 1. a few sensory feedback bits from receiver to transmitter; and 2. forcing an attendee to be “present” and appear to be connected.
Now, I still believe that all the forms of personal video
communication will be regularly in use within the next decade along the same
lines as the bet – but don’t ask me to bet on it.
Table 1.
When |
What |
Who |
1987-Present |
|
3
annual teams |
1990 |
AT&T would either
destroy or divest itself of NCR within 5 years |
Wilmot |
1991 |
By 1996 supercomputing
would be done predominately with >1000 processors. |
Hillis* |
1994 |
In 1994 a Microunity multimedia processor would be delivered, |
Mousouris* |
1993 |
By 1996 video-on-demand
would have 500,000 subscribers. |
Reddy et al |
1993 |
In 2003 AI would be
thought more important than the transistor. |
Reddy et al |
1993 |
In 2003 Cars that drive
themselves would be available by sale. |
Reddy |
1996 |
By 2001, video cameras
would be shipped and in use on most PCs. |
|
1997 |
By 2002, 10,000
workstations would communicate at Gbits/sec. |
Reddy et al* |
1998 |
By 2001, and 2002 One
billion Internet users worldwide. |
Negroponte* |
1999 |
By 2004, Light Emitting
Polymer units would exceed Liquid Crystal Displays |
Hauser |
1999 |
In 2004, Electronic Ink
will out-ship LCDs as measured by unit area. |
Wilcox |
*Yet to be paid.
[1] At the time of the bet, there was much industry discussion, most notably by Larry Ellison and by Scott McNealy that thin-client fixed-function browser-centric computers would soon dominate. This has yet to happen.
[2] AT&T
introduced PMS (PicturePhone Meeting Service) in
1979. Intel and Digital used it between
[3] Gemmell,
J. and
[4] Private communication.
[5] The results were obtained by measuring the difference between the sound source and its arrival.
[6] Finn, K.E., A.J. Sellen, and S.B. Wilbur editors, Video-Mediated Communication, L. Erlbaum Associates, Publishers, Mahwah, NJ, 1997
[7] Gemmell, Jim, Zitnick, Larry, Kang, Thomas, and Toyama, Kentaro, Software-enabled Gaze-aware Videoconferencing, IEEE Multimedia, Vol. 7, No. 4, Oct-Dec 2000, pp. 26-35
[8] Metcalfe’s Law says the value to a subscriber is equal to the number of subscribers, N; hence the network value to all subscribers is N**2.