CyberAll: A Personal Store for Everything

 

Abstract

CyberAll is a project to archive all my personal and professional information content including that which has been computer generated (since the mid 70s), scanned and recognized, and recorded on VHS tapes.  The archive includes books, correspondence (i.e. letters, memos, and email), transactions, papers, photos and photo albums, and video taped lectures.  In 2000, only 10 gigabytes, costing $100 incrementally, are required, and the accumulation rate is projected to be 1-2 gigabytes per year.  Encoding, indexing, and data-management costs swamp storage costs – by 1000:1 or more. The clear challenge is to automate the capture, search, and retrieval so that it comes close to the storage cost.  It is inconceivable to think of manually managing or purging this electronic file since the storage costs are only $100.  Indeed, copies are stored in 2 or 3 locations for redundancy.

Introduction

Michael Lesk (Lesk, 1997) provides a comprehensive view of the problem of storing everything at a national or international scale, including the problems of encoding existing and evolving libraries of all types.   In March 2000, Brewster Kahle’s non-profit organization, www.archive.org, archived the 1 billion web pages in 14 Terabytes. It is beginning to archive the output of 20 television channels.

In contrast, CyberAll is aimed at the personal scale.  It is my sole store for all personal documents, photos, music, and videos as described by Bush over 50 years ago (Bush, 1947) and more recently by Gates (Gates, 1997). 

CyberAll holds personal reference articles e.g. Amdahl’s Law, special computer manuals e.g. Digital PDP-1, CDC 6600[1], and magazines and clipped news articles e.g. Economist graphs that heretofore would be stored in files or on shelves. At present, only books remain in “atomic” form; but it will include all books as soon as they become e-books.  Already, three books I authored have been scanned and are on my website (http://research.microsoft.com/~gbell).

Within the next decade personal computers will store a terabyte. In 2000, 40 gigabyte drives costing $400 are more than adequate to hold the content for most of a professional’s lifetime reading, presentations, and audio recordings.  A CD encoded at 128 kilobits per second can be stored at a cost of $0.60.  A typical user’s CD collection requires about the same space as the scanned and OCR’d versions of all his paper-based files.

The next phase of CyberAll will deal with voice capture of conversations, interviews, meetings, and presentations.  Recording all the audio conversations in one’s personal and professional lives would require over a terabyte when encoded at 8 kilobits per second.  Since a terabyte costs about $10,000 now and should be $1,000 in 5 years, recording conversations seems like a reasonable near-term goal. Clearly a ubiquitous, high-quality 360 degree camera/microphone that would attach to a personal computer would be a useful and welcome device.

Video is more challenging.  For home use, a terabyte holds only 500 hours of DVD quality videos and 1500 CDs, but more compression increases the content by a factor of at least 10. Recording a lifetime of everything seen via video requires 100 terabytes.   Doing this economically is still a decade or more away – now it would cost more than $10,000 per year.  But in two decades, it should cost only $100 per year.  

This paper presents the decisions, logistics, time, and costs to CyberAll my documents.  Nearly all of the basic technologies for cyberization are improving at a rate approximating Moore’s Law: getting two times better every 18 months.  There is extraordinary progress in all areas, ranging from processor speed, storage capacity, scanner speed and accuracy, camera resolution and software, OCR accuracy and capability (e.g. scan to HTML), audio and video encoding, printing and display, and standards. Thus, one can always rationalize waiting for a better system or standard – things will be SO much better in 18 months.  However, the cost of content capture is increasing also[JG1]  – so it is important to start now, especially with the compelling economics. 

CyberAll raises questions about:

Longevity and Long-Term Retrievability – Paper and film can have centuries of lifetimes (although most of our 50+ year old film and photos show fading), while current digitized formats are almost certain to be un-readable in 10, 20, or 50 years based on media, platform/file, and applications obsolescence.  So, digital content requires frequent conversion to new media and often to new formats (because the old formats are no longer supported).  Historically, these format conversions have been lossy.  ASCII  is the only format that has stood the test of time, but it carries no semantics or application behavior.  Automatic and failsafe backup is critical.  CyberAll requires that digital documents never be lost and are forever preserved.

Access and Access Control – Access to personal information must be easily controlled by the owner. Privacy suggests that, by default, others should not have access to the content. However, those of us with public web sites need to be able to more simply map information in our CyberAll into a variety of increasingly public sites versus having to maintain an array of separate sites.  In essence, more public sites are cached, slaves of CyberAll.

Databases And Retrieval Tools For Non-Textual Information – Handling photos, photo albums, conversations, audio, and video is a fertile, new product area.  Current products have a long way to go to satisfy the very wide range of CyberAll users.

Usability  – Building and using CyberAll today is tedious and requires technical skill.   Just setting up CyberAll is a major problem.   New products, standards, and services are needed to make using it a painless process so that everyone in a family could easily store items that would be forever retained.  Storing items need to be as easy as discarding them… in fact, storage is just one step away from the recycling bin.

Motivation

The motivation for CyberAll ranges from the technical challenges (i.e., “because we can” or will soon be able to) to a desire to provide an archive for our progeny.  High on the list is simply coping with the exponential increase in the amount of information (e.g. web pages, pictures, audio, and video) that is becoming part of our personal and professional lives.  Given the tools to easily en masse- produce documents, we are well on our way to converting ourselves into a world of filing clerks!  This cycle has to stop.

CyberAll is consistent with or parallel to Nathan’s Laws of Software: (1.) Software is a gas that expands to fill the container it is in. (2.) Software grows until it is limited by Moore’s Law. (3.) Software makes Moore’s Law possible. (4.) Software is only limited by human ambition and expectation.  One could replace the word “software” with the word “data” and get Nathan’s four Laws of Data.

Many share my “pack rat” mentality that wants to store everything in case we need it to remind us, or in case we need to remind others. This is a strong motivation that creates an infinite storage appetite.  In essence, CyberAll is an almost infinite attic that can store anything that could conceivably be used to answer some future question or to help explain to others (e.g. our progeny) what it was like when.  It is both a memory aid and a device to help tell stories.  For some, this might mean storing everything from second grade spelling tests and grade cards to home videos.

Co-existing with Paper

The notion of the paperless office has been out of fashion for several decades. Rather, we have built ever more productive tools to generate paper. Surprisingly, the amount of paper and file folders only grows with inflation, while printer capacity continues to grow at a 20% annual rate!  File storage capacity and area devoted to paper storage grow slowly with population as people seem to retain a constant amount of paper.

CyberAll aims to eliminate paper that is used for storage and transmission, but not for certain viewing applications where paper’s advantages are well known. CyberAll’s near-term goal is to reduce the need for paper document filing while appropriately handling the transactions that would have required paper for transmission, reading, and permanent storage.  The two-year goal is to eliminate all paper except documents that represent money i.e.  plain old money, notes, stock, and unfortunately, cancelled checks.  Tragically, the financial community – hiding behind “user resistance” – is decades behind in their thinking or ability to electronically deal with all of these items, except money!

In order to replace paper for reading, screens may need a resolution of 200 dpi and higher contrast ratios. Paper is also lighter and more portable for small documents.  Still, there is extraordinary progress in display resolution, size, price, and weight.

The advent of a standard image format will have the most impact on document archiving and use because it will provide a single and universal format for storing documents, including images and recognized text for searching.  In this way, it will no longer be necessary to store or transmit paper documents[JG2] .  The next generation TIF standard that can hold images and recognized text could eliminate the need to store or transmit paper.  PDF, MIME, MHT, and DjVu are also candidates for such a standard.

At last, electronic filing cabinets such as Ricoh’s eCabinet (Ricoh, 1999) are being introduced that can accept both computer generated and scanned documents and know all of the words in the documents they hold! Of course, existing filing systems (e.g. Windows 2000, Office) include the ability to index their documents.  However, scanned documents first need a recognized form. 

Table 1 shows the various kinds of content that occur in an individual’s personal and professional lives for archival (mainly reference) and daily (working) use, e.g. cancelled checks, email, and music.  It also shows some of the use of the content that arises in these contexts. This includes encoded legacy content e.g. papers, photos, audio and video tapes to computer generated papers, presentations, JPEG images, “ripped” CDs, and video tapes.  .

Table 1. Data-types and use for timeliness and user context.

User Context / Timeliness

Personal

Professional (job related)

Archival (historical reference)

Documents, photos, music, video memory-aid, entertainment, medical history, progeny

Books, papers, reference documents
memory-aid

Working
(daily use)

Documents, email, photos, audio (CDs), video communication, entertainment, finance, record

Documents, email,
content for profession use to communication

Storage Size and Cost

Tables 2 gives the storage requirements and costs for holding various kinds of data items of potential interest.  It is clear that all written information and photographs cost nearly zero to store and these will reside in everyone’s cyberspace within the next decade.  Also, the risk of deleting a potentially useful file is much higher then the space savings; hence, storing everything costs substantially less than any alternative. 

It should also be noted that the cross-over for storing encoded CDs is about 1/20th the cost of the original CD, not counting the time to attend to the encoding. Unless the encoding can be done in parallel with some other task, the encoding times and cost swamp the cost of the CD.  Emerging music storage appliances and personal computers will likely change the entire music distribution system.  MP3.com sells recorded music via the web and also offers a service that transmits content to an owner of a CD, thereby reducing the users’ encoding cost.

Table 2. Storage requirements and cost for common data items

Item

Size (Bytes)

Encoded size

Items/GByte

Cost ($)/item*

page (b/w) fax

100 K

4K

10 - 250 K

0.00004 - .001

page (color)

6 M

0.3(jpeg)

160 –3 ,500

0.003 - 0.06

business card

5 K

500

200 K

0.00005

Photograph

3 M

25-400 K

10,000

0.001

book 350 pp

25 M

1-2 M

40-750

0.01 - 0.25

 

 

 

 

 

CD (1 hr)

640 M

60 M

1.5 -16

$0.60

 

 

 

 

 

LowQ video/hr

50-300 K/bs

20-300 M

3.3 - 50

0.002 – 3.30

Mpeg video/hr

1.5 Mb/s

670 M

1.5

6.70

HiQ video/hr

DVD 4 Mb/s

1.8 G

0.6

18

*2000 system prices of $10,000 per terabyte or $10 per Gigabyte

 

Table 3 estimates the storage requirements for storing various types of content arising in an individual’s life.  It is clear that an individual will be able to record all of the information accumulated in one’s entire personal and professional life in a few terabytes, including everything spoken, but not including anything captured via video recording.  Certainly this archive would include all home videos for most families, hopefully with editing.  The table shows the various jumps in storage required going from recording lifetime text, transcribed or encoded speech, and video.  The need to recognize and only handle transcribed speech is clear based on storage and on the ability to search.

 

Table 3. Size for storing everything read/written, heard/spoken, photographed and seen (via video)

Data-types

Rate
(Bytes/hour)

Per day /
per 3 year

Lifetime amount

read text, few pictures

200 K

2 –10 M/G

60-300 G

Email, papers, written text

 

0.5 M/G

15 G

photos w/voice @100KB

200 K

2 M/G

60 G

photos @200 KB

Ten photos/day

2M/2G

150 G

 

 

 

 

spoken text @120wpm

43 K

0.5 M/G

15 G

Spoken text @8Kbps

3.6M

40M/40G

1.2T

 

 

 

 

video-lite 50Kb/s POTS

22 M

0.25 G/T

25 T

video 200Kb/s VHS-lite

90 M

1 G/T

100 T

DVD video 4.3Mb/s

1.8 G

20 G/T

1 P

 

The actual amount of storage used (Table 4) is considerably less than the lifetime estimate, because until recently the author purged files to stay within file cabinet constraints.  Only a few documents were preserved.

The author has a number of albums that archive family and trips, some of which have been posted on a website (Bell, 2000).  Typical albums occupy 3-5 Mbytes, consisting of 30 pages of JPEG photos encoded at 150 KB/per page.

Table 4. Author’s document, photograph, videotape; and 150 CD archive

What

Files

Size(MB)

MB/file

GB/Yr

Archive of scanned TIF & PDF

2,897*

4,665

1.6

 

Computer files 10 yr archive (3K)  & working

5,927

712

0.2

 

GB books (4 encoded)

2,027

494

 

 

Photos: digital

997

158

0.2

 

Photos: scanned albums, pictures, slides

1,730

480

0.3

 

Mail (last 2 years only)

4

236

 

200

GB Videos (lectures, 8mm family movies)

20

4,000

200.0

 

Total personal/prof. archive & working

10,705

10,745

 

 

150 CDs MS WMA multimedia encoding @16 KBps

1200

8,640

57.6

1000

Grand Total

11,905

19,385

 

 

 

Encoding Formats and Cost for Legacy Data

Table 5 lists the items that one might want to cyberize and the potential formats to use.  Legacy data-types, e.g. paper, photos, and videotapes, have stood the test of time. There are various kinds of “players” that allow them to be converted to computer readable form to exist in Cyberspace. For computer created data, the application program that created the data is often no longer available – so the document is essentially lost. Over the long term, complex programs like databases, word processors, and computer games can no longer run on new systems.  This means the information about the various documents, i.e. meta-data, might appear within the files.  In the future, I would anticipate that systems should be able to deduce much of the meta-data about a document (e.g type, title, author, keywords, creation date). The document creation data is probably the second most useful meta-data, and often missing.  Information must be held in as few, golden primitive forms as possible.

This golden data format problem will be discussed in the following section. 

 

Table 5.  Taxonomy of legacy and computer data item types and storage formats

Information

Encoding

Legacy

 (non-computer generated to encode)

 

    Paper: b/w, color, mixed

B/W TIF, PDF, DOC/RTF, HTML

    Voice including phone

MP3

    Photos, slides, overhead transparencies

JPEG (future TIF standard will encode n-photos)

     Photo albums, slide shows, slide talks

JPEG folder, PDF, DOC/RTF, PPT, HTML thicket, MHT (html thicket as a single file)

    Music: CDs, tapes, and records

MP3

    Videotapes and film

MPEG-j,

  

 

Computer generated

 

“golden” formats: TXT, TIF, JPEG, MP3, MPEG-j

 

    Files & containers (DOC, RTF, PPT, HTML, PDF, XLS)

 

        Databases (e.g. Access, dbase II, etc.)

        Email databases (e.g. Eudora, Outlook)

Questionable long-term access!  Unreadable indexes!

Eudora: TXT database!

            Applications (e.g. Money, Quicken)

Annual versions that may have to be upgraded. Reports convert to TXT!

 

Encoded documents are stored in two formats to increase the likelihood of reading the document in the distant future. Black and white documents are retained in their primitive scanned TIF formats and, in addition, converted to either PDF, DOC/RTF, or HTML to enable the document to be searched, viewed on a screen, printed at a high quality level to allow the recipient to recreate the same feel as the original, and quite possibly recreated so it can be edited.  For photographs, retrieval by content is an unsolved problem, although systems such as the Altavista search exist to attempt to find images using various attributes, e.g. color, people, or buildings and then to find similar photos with those attributes. 

Some documents (mixed -- black/white text, color figures, and photos) hold the color images, original, and recognized text in one or more files.   For example, a scanned copy of the 1889, 13-page Hollerith patent TIF file requires 700 Kbytes and 79 Mbytes for black and white and color, respectively.  Storing the color scan as JPEG images in containers such as Word, PowerPoint, or PDF, requires about 2 Mbytes.  This file produces a near likeness of the original, aged document.  Depending on how the document was scanned, it can be OCR’d, but “on-screen” viewing is difficult.   The black and white image stored in a PDF file occupies 950 KBytes and contains the original image for limited on-screen viewing, printing, and the OCR’d text for searching. DjVu stored color documents appear to encode compound color and text documents in half the size of other formats.

Document Scanning

One of the most difficult tasks is to cut a relatively rare bound book, paper, or report apart for scanning and then to discard it (Bell, 2000).  Some content (e.g. engineering notebooks and handwritten notes) are not being captured at this time due to the inability to recognize the material and the difficulty of reading low contrast material.

We used the HP Digital Sender (a scan server connected to Ethernet) to scan to either black and white or color TIF or PDF. Adobe Circulate converts among the various data types (e.g. PDF, TIF, and JPEG).  Several other programs, e.g. Caere's PageKeeper, ScanSoft's Pagis and PaperPort scan to alternative, proprietary TIF format dialects. They also recognize text and build a search indices for retrieval.  The author uses PaperPort for holding temporary working, professional documents – if a document is likely to be preserved, it is converted to TIF or PDF.

TIF format is the basis for virtually all OCR and page input programs:

Document ® Scan ® TIF  future versions of TIF include OCR’d text

  |® Acrobat ® PDF(with OCR’d text)

                                      e.g. Omnipage +manual effort ® DOC | HTML & Simages 

 

TIF is a golden format because it is a non-proprietary and evolving standard that has a huge installed base.  Future, proposed versions of TIF contain the OCR’d text.  Currently, a TIF is converted via Acrobat to PDF[2] with the image recognized so that the document can be searched and easily distributed. 

Alternatively, an OCR program converts the image into a word document in a near likeness of the original that provides for repurposing. In this way, a new “original” document is created for subsequent use.  Alternatively, a program converts the document into an HTML page, including the file of images (also known as the HTML thicket) for on-screen viewing.  A new format, MHT derived from the MIME encoding of a mailable HTML page, holds the thicket in a single file and is possibly a future standards competitor.

Future TIF standards include the image plus the OCR’d text to enable searching as well as meta-data (e.g. creation date, scan date, author, document type, and keywords that further describe the document).  As previously discussed, future retrieval systems should be able to deduce many of the attributes. With time, scanners will evolve to include the encoding software.  Scanners that directly connect to a personal computer usually just provide bitmap images to the computer and, depending on the interface software, images can be stored in a variety of formats.   In the future, scanners will continue to have more capability for scanning images and converting these into a variety of other forms, e.g. TIF with encoding, and JPEG. 

Finally, the evolution of TIF and HTML-XML to be able to hold different image encodings, including the recognized text, will make scanning more economical by allowing all scanned documents to be recognized and indexed within a single archive.  These two capabilities are critical to eliminating the need to store or transmit paper.

Capturing Photos and Albums

We encoded non-digital or legacy photos in two ways. The photos are scanned separately and put into a folder that is containerized by PowerPoint into a .PPT document that holds the set of photos as an album. PowerPoint has a special plug-in to collect a set of images to build the album.  Depending on the expected use, either folders or PowerPoint holds the photos.

SPhotos ® Scan  ® SJPEG ® PowerPoint ® .PPT

 

The second method is to directly scan an album (i.e. just a collection of pages of photos) into a single PDF document that holds the various JPEG images of each page. The PDF document can also be unstacked and each page converted into a folder of JPEG images.  The process can continue on to create a single PowerPoint document or Word file (.doc) for storage or display.  For web hosting an album, an HTML document is an alternative storage container.

Album ® Scan ® PDF(JPEG) ®Circulate ® SJPEG ® PowerPoint ® .PPT | .DOC

 

Note that a TIF is not used as the intermediate images format because of size. Virtually all images for personal use are JPEG encoded.  Alternatively, I send 35mm slides, negatives, and photos of virtually any size to Kodak for conversion at a cost of $1/image.  The Kodak Photo CD holds 100, 5 Mbyte images in multiple JPEG resolutions in its .PCD format.

Formats

Table 6 gives the advantages and disadvantages of holding the documents in various formats.  It is important to note that none are ideal, but PDF comes relatively close because it can act as a container for virtually any data-type (e.g. TIF, GIF, JPEG), although extracting the data in the correct format using the labyrinth of Adobe tools is a great challenge.  In addition, the Acrobat 4 OCR facility, Capture, provides recognition so that documents can be searched while retaining the original document as an image.

Table 6 also shows the need of future standards that can retain printed images as well as the recognized text for searching, together with the ability to display the images on a variety of screens.

 

Table 6. Characteristics of various scanned document formats

Format

Advantages

Limitations

TIF

A “golden format” from scanning.  Evolving standard will hold images and recognized TXT.

Must be OCR’d to search.
Color files are very large.

PDF       

Defined to carry both image and recognized text.  Holds all data-types.

Sole sourced tools, not editable, poor on screen viewing

DOC/
RTF

Many editors and viewers, well-defined, container for all data-types

Separate software for OCR

HTML->XML

Open standard, editing tools, on screen via browsers, most universal

Compound documents create an “HTML Thicket” of files; this is solved with MHT

 

Table 7 gives times and/or costs to scan and encode various legacy documents.  As a rule, each item (page, photo, or slide) costs about $1 from commercial services.  For legacy documents, using Acrobat to create PDF for it’s indexing capability saves an incredible amount of time versus having to recognize and recreate a perfect copy of the document.  In certain cases, one may want to recognize the document and convert it to a word document (i.e. DOC/RTF) or an HTML document for web viewing. This requires “perfect” recognition together with the need to format the document exactly like the original.  In essence, a document is being re-published. To scan, recognize, and edit a page can easily require 10 minutes to create a formatted document that is suitable for repurposed use.

TIF is constantly being evolved by Caere, Kodak, Scansoft, Xerox, etc. to hold compound documents text, tables, and TIF and JPEG images.  However, like PDF, all of the company formats are unique, proprietary, and constantly being evolved.  Provided a relatively high-resolution image can be regenerated from the TIF components, character recognition can be done on the document to create searchable text.

 

Table 7. Time or cost to encode legacy documents, photos, and CDs

Task

Time (min)

Cost($)

Page scan

1

0.10-1.00

10 page paper scan (HP Sender)

2

 

TIF > PDF per page (@400 MHz)*

0.3

 

TIF > Word or HTML per page

1-10

 

Photo scan

2

1

35mm slide scan with feeder > JPEG

?

1

Images > RTF/DOC, PPT/ PDF album

?

?

Encode CD > MP3; encode 33 RPM record

?

20

   *Batch process with OCR for black & white docs

Longevity, aka the “8-Track Tape” Problem

The most serious impediment to a true archive that will last more than a few years is the fast evolution of media, platforms, formats, and the applications that create them.   Unique, proprietary, and constantly evolving data-formats, e.g. Microsoft Word or Outlook, Acrobat-j, Quicken 20xx, Kodak .fpx, MPEG-j, etc. suggest that “old” formats may soon be discarded, and these formats will soon be old.  The new programs will probably not read legacy data on legacy platforms forever.  The basic question is: “How will the data be readable in 10, 20 or 50 years?”

Over time, applications evolve and they simply don’t recognize data that they once helped create.  Ideally, they should provide eternal support in the context of storing everything.  But some apps or the company that created them disappear. Is it expecting too much for 20-something-year-old data to be interpretable by its creating app (e.g. Acrobat, DB2, Draw, Eudora, Office, Quicken, or Real Networks)?  Based on history, it seems most data will be un-interpretable within 20 to 50 years without extra-ordinary effort aimed at keeping them current and interpretable by some current platform. Apps will move to other platforms, or evolve to be more Internet or next-big-thing centric. 

Since CyberAll will store all personal information, e.g. documents, photos, and videos, this data needs to be valid and hence understood in an indeterminate future! High quality paper will hold information for a millennium (or at least several centuries), and film is sometimes rated at several hundred years (if you keep it very cold).  A CD is likely to be readable in 50 years, but finding the CD reader/computer and file system/app to read it will clearly be impossible if history is a guide[3].  Is paper the only true long-term store? 

Digital documents are committed to a conversion treadmill. With each generation of media (e.g. 8” floppy disks c1978), the computer system (e.g. CPM), and the application (e.g. Wordstar), a conversion is required.   This happens about once a decade, if you pick your formats carefully.  If you do not chose carefully, the conversion may not be possible.  An app that encoded video just two years ago has gone away, leaving data useless. That was because of the evolving nature of proprietary formats coming out of the format wars and the need to abandon a pioneering standard. Any one of the MPEGs might have been a better choice.   For plain documents, the alternative is storing 10-foot paper stacks of personal information in file cabinets as the compulsive info pack rats do today, versus a single DVD that a computer can search! 

Are there a few basic formats e.g. TIF that will be forever interpretable so that one doesn’t have to print and store as stacks of irretrievable paper waiting to be encoded or to be otherwise found? Probably not, but there are some things that are likely to last for 10 to 20 more years.  Like JPEG that is constrained by camera equipment and the need to interoperate, TIF is constrained because of its use a FAX standard. Gary Starkweather (2000), the inventor of the laser printer, scans all documents (photos, papers, books, and journals) into TIF at 400 dpi (a multiple of fax resolutions), using a 70 page per minute scanner.

For one thing, data has learned that in order to be understood in the future, it cannot be subject to the highly volatile apps that change every year such that a particular version has to be executed in order for data to be understood, e.g. Quicken 95…2000.[4]  As apps evolve, this means data maintains the creating version of the app or all past data associated with a named app has to be converted forward. This is also an issue and perhaps failure of object technology that runs on a single, universal machine.  

Alternatively, the one way to ensure interpretability of a simple form, is to transform an app’s progeny, i.e. its data, into a generic form that one has a very long-term confidence in.  This assumes there are a few, golden, generic formats that will live indefinitely.   ASCII text is probably the only proven long-term data type.  JPEG is becoming golden due to the plethora of digital cameras.  It is too early to tell how long HTML will be a golden format given the several billion web pages, or if will it be replaced by multiple versions of XML. PDF captures most all paper documents and even a collection of HTML pages in a single document, but can it prove it has a commitment to longevity?

One solution to longevity is to have just a few data-types that have wide acceptance and standardization that data can be transformed into and that are not subject to the whims of rapidly evolving apps.  Forget about data in a complex database like drawing programs, or databases, e.g. DB2 or Outlook[5].  What golden formats will exist in addition to ASCII, and hopefully TIF?  How long will data held in RTF, PDF, JPEG, various MPEGs, and MP3 be interpretable[6]?  Given the vast amount of data in Microsoft’s Office apps[7], what commitment will these apps make to their data?

Using CyberAll

CyberAll is for personal use.  Over time, as searching techniques improve and with the addition of user or system created meta-data about items not implicit in the items or known to the user, others could use it.  CyberAll is meant to operate in a COMOHO (commercial office, mobile office, and home office) environment, i.e. computing anywhere, anytime. The main desktop office machine (CO) holds all files.  This machine is backed up locally in the San Francisco BARC lab and at Redmond. The author’s portable computer, depending on disk size, contains a working subset or cache of the CyberAll CO.  It is the principle computer and most often used in a MOHO environment with larger displays, keyboards, and remote access.   In the MO location, modems, hotel LANs, etc. communicate via the corporate network to CO via RAS (remote access services) and PPTP (point-to-point tunneling protocol) respectively for absent documents. In the HO location, ADSL and cable modems provide the final links to the corporate network and CO.

Files, Folders, Meta-Data, and Document Databases

A decision made initially was to not invest time in the creation of a database to hold, i.e. point to, the various items or files, much to the disappointment of my database colleagues at our laboratory.  This decision was based on the variation of document types, the time to create the various columns or meta-data for a useful database, the inflexibility of moving or modifying files in an established database, the concern that databases are not golden data-types and hence are likely to be unusable in several decades, the ability of ordinary search programs to serve most of the needs, and finally because of the belief that programs should increasingly be able to automatically extract the relevant meta-data.

The files (documents, photos, etc.) are stored in a relatively flat 2- or 3-level folder hierarchy with 24 folders in the first level and an average of 4 folders in the second level.  Photos are a special case, and there is no attempt to address the recorded conversation or video retrieval problem at this time.  A plethora of specialized music database programs, aka jukeboxes to encode, organize, and play CDs, are available to download.

Over the last several decades, the author has used long, descriptive file names to be able to retrieve information.  Thus, a name may include the subject, organization, keyword and even the creation data as a part of its file name. In this way, just searching the file names is likely to find the document.  For articles, the file name is in essence the bibliographic reference. Given search engines, including the automatic indexing of Office documents using the Windows 2K file system, it is no longer necessary to put the meta-file information in the file name. 

 

Table 8. Test meta-data that need be associated with various data-types

TIF

title, dates (created, modified, used, etc.)

Photo    

JPEG2000 includes title, subject, location, description, category, dates (taken, modified, etc) camera information,

Article

Standard bibliographic reference fields, keywords, and abstract (to aid  “off line” use)

Correspondence

Date, to, cc, subject, keywords

Audio

Recognized script identifying the speakers

Video

Photo information plus pointers to various scenes, recognized text

 

The author believes that all the information of or about a file should eventually be contained in the file, especially if the file is to have archival value.  Currently, not all file systems provide a search on file creation date, and this is probably one of the most valuable searching attributes.

Currently, documents are retrieved using “find file” where it may be necessary to search all the words in a file. However, for a personal database, this is essentially instantaneous using Personal Altavista or the Windows 2000 file system. The space cost is minimum.

Storing and Retrieving Scanned and Digital Photos

Photos and photo albums are stored in personal and professional folders.  The personal photos are organized by subject folders, and each subject folder may have two more levels of folders that go down to a folder that holds a set of photos about a particular time, sub-subject, event, etc.  A folder of photos or individual photos may also be included in other albums; otherwise, photos are put into albums that are just containers for a set of images, and the individual images can be discarded since they can always be recovered from the albums.

Ideally, each photo has all the possible attributes about it; but alas, this is a time-consuming process that most of us are unwilling to invest in.  JPEG has over a dozen attributes, e.g. subject, location, description, and keywords, all of which can be used as text keys for retrieval. Unfortunately, most users don’t take the time to fill them out, and even more unfortunately, the key dates are not handled automatically including time taken, transferred to the computer, modified, and used.  Given that many photos are scanned, often the most valuable search parameter – creation date – is currently useless!

For digital photos, software can easily improve to provide all the dates that can be used in retrieval.  In addition, one of the most useful types of meta-data of digital photos could be a few seconds of recorded voice information about the photo.  This could be automatically recognized for searching. Some cameras, e.g. Kodak and Sony, provide audio recording, but all but the simplest and lowest cost digital cameras could provide this information that would be recorded at the time the photo is taken.  Arcsoft’s Photobase provides the ability to record audio with each image held in its database.

Photos are retrieved in several ways depending on whether the search is for a photo or photo within an album:

·        Image using Windows Explorer thumbnail view.

·        Image file names and any text attribute in the JPEG image that text retrieval finds.

·        Image or attributes using the master search database that holds filed photos and attributes, i.e. meta-data about the photo.  In addition, special photo database programs (e.g. Photobase) can display “contact sheets” in the named database album.  Three attributes can be added per photo. 

·        Album text that may include photo titles, comments, etc. as part of PPT album.

Questions that CyberAll Should be Able to Answer

By keeping every bit of information, CyberAll should be able to answer any type of question associated with this corpus and flow.  Some of the facts that might be useful in the future:

·        Recall the Paris hotels that I have been in during the last ten years. 

·        Recall a restaurant or wine from a dinner in Chicago about four years ago.

·        Show the figures from papers I wrote on supercomputers during 1980-1990.

·        Show the photos from a trip to Spain. Or, taken during July of 1999.

·        Find the articles, papers, etc. that mention Amdahl’s laws. Find the article on Amdahl’s Law.  

·        List all of the letters, recommendations, and papers written in 1989.

·        Recall the email and letters to or about X about five years ago.  

Future

Deciding among the array of products using mostly proprietary data formats and developing a process to deal with the encoding of documents for personal/professional and archive/working use is probably the most difficult task in building one’s CyberAll.   Once it is built, carrying the formats forward is a daunting task.  One strategy is to wait for the ideal solution and stability that is certain to come within the next five years.  Alternatively, keeping data in the most primitive, scanned or encoded form, i.e. TIF and JPEG, allows for future flexibility, including being able to utilize better OCR and more standardized encoding.

TIF v6 is being proposed as an open standard.  It is a document container, like PDF, that can hold scanned and encoded images (e.g. JPEG, GIF) together with the recognized text of the document.  Such a container is also capable of holding an HTML page or set of pages in a single file. 

The entire area of photo storage and retrieval continues to improve.  Camera acquisition software needs to handle dates.  All cameras should include audio recording that could be used for voice annotation, meta-data, and improving the value of the image with a bit of sound or voice. One problem is retrieving images of people, especially at different ages.  It is difficult for humans to differentiate among their own children at an early age, let alone distant relatives, so it is unlikely that a program can find a photo of Sally at age 1, 2, or 6 among pictures of Sally and other children.  Altavista and other image search programs (e.g. IBM’s query-by-image) can search for like images, e.g. buildings, people, or sunsets provided their spectra, various shapes, and other attributes are similar.

Already, the project has convinced me that a goal of paperless storage and transmission is attainable now for everything except books and items that represent money. 

The next phase will include larger scale use e.g. working on the Computer Museum History Center CyberMuseum archives where similar items are kept, as well as the artifacts themselves.  Also, it is necessary to widen the scope to include other family members including young children in order to get a better handle on everyday archiving for non-computer users.

Finally, it is essential to include voice conversations and address video. It is clear that a device that is better suited to recording the audio and video of interviews and meetings would be a welcome and necessary device for this phase. Video will be the focus for many users’ CyberAll.  Even the most avid Videographer can build a system with just today’s 100 Gigabyte disks. 

Conclusions

In 2000 the cost to store all personal and professional related, computer generated and paper forms of information, including photographs is nil. It is quite costly to load a personal store and to maintain it for the indefinite future.  One strategy is to retain information in multiple formats to increase the likelihood that information can be retrieved. 

Scanners to TIF with recognized text will make scanning everything as easy as discarding it.  Thus, a state of paperless storage and transmission is likely to exist in the future.   Standards and ease of use are the keys.

In the next five years, anyone will be able to have a personal computer that retains everything they’ve read, written, and presented via video that came from a computer or legacy source such as paper or videotape. This would include all of the transactions for a family (e.g. correspondence, and every conceivable medical and financial record).  In ten years, systems should be able to recall every personal lifetime conversation.  Currently, significant effort is required to build and utilize such systems if they involve the entry of legacy documents, e.g. books, papers, photos, and videotape.  In the future, such a system like CyberAll should be a “killer app” for personal computers.  Certainly the computers and software components are likely to exist. 

Acknowledgements

The author is indebted to colleagues at Microsoft Research, and especially the Bay Area Research Center (BARC) for assistance and ideas.  Robert Eberl made various scanners, including one that was withdrawn from the market, operational. Jim Gray deserves special thanks for reading and revising the paper. Vicki Rozycki did most of the scanning.

References

Bell, C.G. Current web page with selected books, papers, talks, videotaped lectures, trip albums, discussion of a CyberMuseum http://research.microsoft.com/~gbell

Bell, G. “Dear Appy, How Committed are you? Signed Lost and Forgotten Data”, ACM Ubiquity, 21 February 2000, Issue 1. http://www.acm.org/ubiquity/views/g_bell_1.html

Bush, V.  As We May Think, Atlantic Monthly, July 1945.

DjVu www.DjVu.com

ECabinet – Ricoh Corporation, Product Brochure; www.ricoh-usa.com  Press Release (November 1999)

Gates, B. The Road Ahead. Penguin Books, 1996.

Kahle, Brewster (2000).  http://www.archive.org

Lesk, M.  Practical Digital Libraries. San Francisco: Morgan Kaufmann Publishers, 1997.

PhotoBase - www.arcsoft.com/

Starkweather, G. Private Communication, February 2000

 

 



[1] The project also is working on helping prototype the transfer of physical documents into cyberspace as part of a CyberMuseum project at the Computer Museum History Center, Moffett Field, California.

[2] PDF is a current, significant de facto standard claiming roughly a billion documents implying a total capacity of at least 100 Terabytes.

[3] A friend has converted 3000 of the author’s c1975 documents stored on 8” floppy disks from the Digital PDP-8 Word Processing System (WPS 8) format into Word Perfect format using a PDP-8 emulator running WPS 8 software.

[4] Data written in 1990 on a MAC and converted forward to a more recent version can be used to almost generate an accurate report of the transactions.  Data written on a MAC cannot be converted across i.e. read on a PC without inordinate effort.  MacDraw, MacDraw Professional, and Draw (MacDraw for the PC) have essentially the same characteristics.

[5] Eudora creates a single ASCII file of messages for each folder; hence is almost certain to be readable.

[6] Having hardware devices such as cameras help create format inertia, but don’t guarantee longevity!

[7] Data written in the early 1980s can been converted forward from MAC versions 4.0 and converted across to the PC Office 2000 standards for Excel, PowerPoint, and Word.  PowerPoint can be converted to JPEG.


 [JG1] I do not believe this.  In fact, that is the key reason the cost is dropping.  The rate may be increasing but it is small compared to moore’s law and in your case it is divided by 60 (the years of history you have).  So the collection is growing VERY slowly.

 [JG2] Mention that pdf, mime, and DjVu are all candidates for this