While authors will initially use existing information storage formats, over time, new formats and transmission strategies will evolve to minimize storage and optimize transmission bandwidth. Each of the standard information formats is discussed below:
Most textual information has more structure than just a linear sequence of words (e.g. paragraph breaks, enumeration lists, headings, footnotes, cross-references, etc.) Some standard will be needed to encode this formatting information. It is desirable that the standard preserve the ability to make hypertext links to sub-spans of text (see below about hypertext links), let the reader select the font style in which the text is to be rendered, and get away from page oriented display (after all, this new from of "electronic paper" that does not really have the concept of the end of the page.) Thus, while Postscript comes to mind almost immediately, it is not a perfect match for the desired standard.
Using these two tricks, the viewer software will be able to quickly render an image of acceptable quality to the end-user even over relatively slow modems.
The first portion of the document is an attribute-value list which lists standard information about the document as a sequence of attribute-value pairs. The information the attribute value pairs is:
(More about documents go here.)
Both deliberate and accidental copyright infringement will occur with electronic publishing. Whenever two repositories have two reasonably long copyrighted bit spans that exactly match, there is a copyright infringement problem. It is not the responsibility of the repositories to identify copyright infringement; instead, they may rely on other people to discover the infringement problem. The procedures for resolving a copyright infringement case should basically be as follows:
The courts can demand that material be removed from publication for libel reasons. This is accomplished by marking the published material as being unavailable whenever it is asked for.
The search tree needs to be linearized into a document. The primary concern during the linearization process is to attempt to minimize the amount of I/O needed to search down to a leaf node in the search tree. For example, if the repository's most convenient blocking factor only contains three tree nodes, the access tree in Figure 12 ould be linearized in triples as (Holly, Deborah, Molly), (Betty, Ashley, Cathy), (Felice, Elsa, Gertrude), (Katherine, Jasmine, Louise), and (Opel, Nadine, Ruby); if, instead, the most convenient blocking factor is seven to eight tree nodes, it would be linearized as (Holly, Deborah, Molly, Betty, Felice, Katherine, Opel), (Ashley, Cathy, Elsa, Gertrude, Jasmine, Louise, Nadine, Ruby). When a search tree is published, the repository should tell the publisher what the current blocking factor is. Since as hardware evolves, the appropriate blocking factor will change, the viewer software should not be directly exposed to the actual blocking factor.
Since indices need to be updated over time, they are not static. For example, the author/subject/title indices need to be updated as new books are published. One strategy is to store the access tree on read/write storage and update the tree as new entries are added. An alternative strategy is to publish an index as a sequence of sub-indices that are searched as a whole; for this strategy, an index might be updated weekly and every quarter, the proceeding thirteen weeks would be consolidated into a quarterly index, and every year the proceeding three quarters would consolidated, etc. A single search tree would be huge, but it still would be quicker to search than searching a sequence of smaller sub-trees. The advantage of having a number of properly organized smaller search trees is that they can reduce search time; conversely, if the trees are organized incorrectly, they can increase search time.
Given that indices will be updated over time, it probably makes sense to structure search trees as 2-3 trees so that standard tree balancing algorithms can be applied. If search trees are only added to and never deleted, it is possible to store them on WORM (Write-Once/Read-Many) storage; otherwise, indices need to be stored on read/write storage.
In practice, there will probably be a separate index for each language (e.g. English, French, Spanish, etc.) and each language index will updated over time.
The entry for an overlay hypertext link consists of a textual descriptions (plus length) followed by an embedded hypertext link. Whenever the viewer software displays a span of text from a document, it will search each of the hypertext overlays indices specified by the user searching for hypertext links to display with the text. If the user has specified M indices where each index has an average of N entries, the search time will be O(M log N). Most hypertext overlay indices will adopt the convention of charging very little for reading the hypertext link textual description and charging most of the cost of traversing a hypertext link when the embedded hypertext link is read.
Each mask entry specifies a document offset and length. For restrictive masks, the viewer software will search the mask tree to see whether there are any applicable restrictions before actually displaying a text span. For highlight masks, the masks can be searched after the text span has been displayed.
Each royalty entry is specifies a document offset, length, and royalty amount. Royalty entries are stored in writable storage so that they be modified over time.
The first operation a cache performs is to establish a connect with a repository. The cache needs to send an electronic check to the repository to cover the costs of other operations. The response back from the repository is returned as an attribute value list which contains information such as:
The fundamental operation that viewer software performs is the read operation. The read operation specifies a document number, a starting bit offset, and the number of bits to be fetched and returns a list of rate/sub-span pairs; each When a cache receives a read request, it first sees whether it can honor the request from its local cache; if not, the cache establishes a connection to the repository and forwards the request. Upon receiving the request, the repository looks up the desired information and returns it back to the cache. After the request has been fulfilled, the appropriate amount of royalty money is transferred from the cache account to the appropriate author's account as described in the section on electronic banking.
Since read requests may incur a royalty cost, there needs to be a mechanism discovering the cost of information before actually reading it. This is done by merely querying the repositories royalty index.
Repositories will require a great deal of network I/O bandwidth. There needs to be enough network connections to connect to each cache that needs access. Initial repositories will need 10's to 100's of modems. It does not take too much imagination to imagine a highly utilized repository in the future having 1000's of network connections.
Repositories will also require a great deal of storage I/O bandwidth. Electronic publishing will only work if the latency from the time that a user asks for information until it shows up is measured in seconds. The way to reduce latency is to cache commonly accessed information in memory and spread the rest of the information across a fairly large number of disk drives. By spreading the information across a large number of drives, the average request queue length for each drive should remain fairly short.
While initial repositories will be able to get away with implementations based on a single processor, as they get larger, multiple processors will be required to keep up with the vast amount of needed I/O. Ultimately repositories may need specialize hardware for routing an information request from the network connection to the processor connected to the drive containing the information. Thus, large repositories in the future may have scaled down telephone switching networks to interconnect the network connections to the information storage devices.
You may read the next chapter discussing caches or go back to the table of contents.