Featured White Papers
- Hosted CRM comparison guide (Inside CRM)
- 5 Strategies for Making Sales the Engine for Growth (AchieveGlobal)
- Hosted CRM buyer's guide (Inside CRM)
Evaluating electronic texts in the humanities - Libraries and the Internet: Education, Practice & Policy
Library Trends, Spring, 1994 by Susan Hockey
Many different markup schemes have been developed for humanities electronic texts over the last forty years. Of these, the most notable are COCOA and its variants. COCOA was first devised for an Archive of Old Scots Texts in Edinburgh in the early 1960s (Aitken & Bratley, 1967) and is described fully in the Micro-OCP manual. It provides a way of encoding the canonical referencing structure of a text including parallel referencing schemes and can also be used for other features such as stage directions, editorial comment, and so on. It is used by most of the major text-analysis programs in current use in the humanities, notably the Oxford Concordance Program (OCP) and (in extended form) TACT. The Thesaurus Linguae Graecae developed its own markup scheme, called beta code, which has also been used by other projects in classics and religious studies. The retrieval program WordCruncher also has its own markup scheme. Many existing humanities electronic texts are encoded for use by these programs.
Typographic markup is also needed to print or display a text so that it is more easily readable. Even simple word processing programs include features such as italic, bold, and so on to highlight sections of a text to draw the reader's eye to them. A parallel set of markup schemes was thus developed for printing and formatting, most notably TeX, TROFF, and later various word processors, such as WordPerfect, where the markup is exposed by the Reveal Codes function.
The result of this plethora of markup schemes has been described as chaos (Burnard, 1988). By the mid-1980s, experience showed clearly that markup is essential for good quality texts, but no scheme had wide acceptance. Each scheme was designed for a specific project or application. Most schemes were poorly documented and had no provision for extension or were not otherwise sufficiently flexible. Much time was wasted on converting from one format to another. None of the existing markup schemes was suitable for adoption as a standard.
In 1986, the Standard Generalized Markup Language (SGML) became an international standard (van Herwijnen, 1990). SGML is not, in itself, an encoding scheme. It provides a syntactic framework within which descriptive information about an electronic text can be encoded. The principle of SGML is descriptive, not prescriptive--that is, it describes the structure of a text. It enables the word which is seen to be in italic to be described as part of a title, or a foreign word, or an emphasized word, or whatever the encoder wishes. At a very basic level, SGML views a text as being a collection of objects called elements. These may be chapters, pages, words, lines, stanzas, or whatever the user wishes. The set of elements for a particular text or group of texts and the relationship among them is defined in a document type definition (DTD). The DTD has a formal structure. It can be read by a computer program called an SGML parser which validates the markup in a text or by other SGML-based software which operates on the text. SGML provides a method of encoding which addresses many of the intellectual issues which previously used encoding schemes have not. A further advantage is that it also provides links to material which is not ASCII text--e.g., sound and images--which are likely to become increasingly important. Its one disadvantage is that it views a document as a single hierarchic structure and has no easy way of dealing with the multiple parallel referencing schemes which appear in many humanities texts.