Library Science

Indexing Process and Principles

By: e-gyankosh

1.0 Introduction: An index is a guide to the items contained in or concepts derived from a collection. Item denotes any book, article, report, abstract review, etc. (textbook, part of a collection, passage in a book, an article in a journal, etc.). The word index has its origin in Latin and means: ‘to point out, to guide, to direct, to locate’. An index indicates or refers to the location of an object or idea. The definition according to the British standards (BS 3700: 1964) is “a systematic guide to the text of any reading matter or to the contents of other collected documentary material, comprising a series of entries, with headings arranged in alphabetical or other chosen order and with references to show where each item indexed is located”. An index is, thus, a working tool designed to help the user to find his way out the mass of documented information in a given subject field, or document store. It gives subject access to documents irrespective their physical forms like books, periodical articles, newspapers, AV documents, and computer-readable records including Web resources. (Indexing Process) Indexing Principles and Process

Early indexes were limited to personal names or occurrences of words in the text indexed, rather than topical (subject concept) indexes. Topical indexes are found at the beginning of the 18th century. In the nineteenth century, subject access to books was by means of a classification. Books were arranged by subject and their surrogates were correspondingly arranged in a classified catalogue. Only in the late 19th century, subject indexing became widespread and more systematic. Preparation of back-of-the-book index, historically, may be regarded as the father of all indexing techniques. Indexing techniques actually originated from these indexes. It was of two types: Specific index, which shows a broad topic on the form of one-idea-one-entry, i.e. the specific context of a specific idea; and Relative index, which shows various aspects of an idea and its relationship with other ideas. Specific index cannot show this, it only shows broad topic on the form of one-idea-one-entry, i.e. specific context of a specific idea. The readymade lists of subject headings like Sears List and LCSH fall far short of the actual requirement for depth indexing of micro documents in the sense that the terms are found to be too broad in the context of users’ areas of interest and of the thought content of the present-day micro document.

 1.1 Purpose of Indexing:

Indexing is regarded as the process of describing and identifying documents in terms of their subject contents. Here, The concepts are extracted from documents by the process of analysis, and then transcribed into the elements of the indexing systems, such as thesauri, classification schemes, etc.

In indexing decisions, concepts are recorded as data elements organised into easily accessible forms for retrieval. These records can appear in various forms, e.g. back-of-the-book indexes, indexes to catalogues and bibliographies, machine files, etc. The process of indexing has a close resemblance to the search process. Indexing procedures can be used, on one hand, for organising concepts into tools for information retrieval, and also, by analogy, for analysing and organising enquiries into concepts represented as descriptors or combinations of descriptors, classification symbols, etc. The main purposes of prescribing standard rules and procedures for subject indexing may be stated as follows:

  1. To prescribe a standard methodology to subject cataloguers and indexers for constructing subject headings.
  2. To be consistent in the choice and rendering of subject entries, using standard vocabulary and according to given rules and procedures.
  3. To be helpful to users in accessing any desired document(s) from the catalogue or index through different means of such approach.
  4. To decide on the optimum number of subject entries, and thus economise the bulk and cost of cataloguing indexing.

1.2 Problems in Indexing:

A number of problems and issues are associated with indexing which are enumerated below:

a) Complexities in the subjects of documents-usually multi-word concept:

b) Multidimensional users need for information;

c) Choice of terms from several synonyms;

d) Choice of word forms (Singular / Plural form);

e) Distinguishing homographs;

f) Identifying term relationships – Syntactic and Semantic;

g) Depth of indexing (exhaustivity);

h) Levels of generality and specificity for representation of concepts (specificity);

i) Ensuring consistency in indexing between several indexers (inter-indexer consistency), and by the same indexer at different times (intra-indexer consistency);

j) Ensuring that indexing is done not merely on the basis of a document’s intrinsic subject content but also according to the type of users who may be benefited from it and the types of requests for which the document is likely to be regarded as useful;

k) The kind of vocabulary to be used, and syntactical and other rules necessary for representing complex subjects; and

l) Problem of how to use the ‘index assignment data’.

It is necessary for each information system to define for itself an indexing policy, which spell out the level of exhaustivity to be adopted, a vocabulary that will ensure the required degree of specificity-rules, procedures and controls that will ensure consistency in indexing, and methods by which users may interact with the information system, so that indexing may, as far as possible, be related to and be influenced by user needs and search queries. The exhaustivity and specificity are management decisions. Since document retrieval is based on the logical matching of document index terms and the terms of a query, the operation of indexing is absolutely crucial. If documents are incompletely or inaccurately indexed, two kinds of retrieval errors occur viz. irrelevant documents retrieval and relevant documents non-retrieval.

When indexing, it is necessary to understand, at least in general terms, what the document is about (aboutness). The subject content of a document comprises a number of concepts or ideas. For e.g. an article on lubricants for cold rolling of aluminium alloys will contain information on lubricants, cold rolling, aluminium alloys etc. The indexer selects these concepts, which are of potential value for the purpose of retrieval, i.e., those concepts on which according to him, information is likely to be sought for by the users. It is the choice of concepts or the inner ability to recognise what a document is about is in the very heart of the indexing procedure. However, it is the identification of concepts that contributes to inconsistencies in indexing.

The problem of vocabulary deals the rules for deciding which terms are admissible for membership in the vocabulary. There is also a problem of how to determine the goodness or effectiveness of any vocabulary. This implies that the system ranks each of the documents in the collection by the probability that it will satisfy given query of the user. Thus, the output documents relating to a search query are ranked according to their probability of satisfaction.

1.3 Indexing Process:

Before indexing, the indexer should first take a look at the entire collection and make a series of decisions like,:

a) Does the collection contain any categories of material that should not be indexed?

b) Does the material require general, popular vocabulary in the index?

c) What is the nature of the collection?

d) What is the characteristics of the user population?

e) The physical environment in which the system will function; and

f) Display or physical appearance of the index.

Essentially, the processes of indexing consist of two stages: (i) establishing the concepts expressed in a document, i.e. the subject; and (ii) translating these concepts into the components of the indexing language.

a) Establishing the concepts expressed in a document:

The process of establishing the subject of a document can itself be divided into three stages:

i) Understanding the overall content of the document, the purpose of the author, etc:

Full comprehension about the content of the documents depends to a large extent on the form of the document. Two different cases can be distinguished, i.e. printed documents and non-printed documents. Full understanding of the printed documents depends upon an extensive reading of the text. However, this is not usually practicable, nor is it always necessary. The important parts of the text need to be considered carefully with particular attention to: title, abstract, introduction, the opening phrases of chapters and paragraphs, illustrations, tables, diagrams and their captions, the conclusion, words or groups of words which are underlined or printed in an unusual typeface. The author’s intentions are usually stated in the introductory sections, while the final sections generally state how far these aims are achieved.

The indexer should scan all these elements during his study of the document. Indexing directly from the title is not recommended, and an abstract, if available should not be regarded as a satisfactory substitute for a reading of the text. Titles may be misleading; both titles and abstracts may be inadequate in many cases, neither is a reliable source of the kind of information required by an indexer.

A different situation is likely to arise in the case of non-printed documents, such as audio-visual, visual, sound media and electronic media.

ii) Identification of concepts:

After examining the document, the indexer needs to follow a logical approach in selecting those concepts that best express its content. The selection of concepts can be related to a schema of categories recognised as important in the field covered by the document, e.g. phenomena, processes, properties operations, equipment etc. For example, when indexing works on ‘Drug therapy’, the indexer should check systematically for the presence or the absence of concepts relating to specific diseases, the name and type of drug, route of administration, results obtained and/or side effects, etc. Similarly, documents on the ‘Synthesis of chemical compounds’ should be searched for concepts indicating the manufacturing process, the operating conditions, and the products obtained, etc”.

iii) Selection of concepts:

The indexer does not necessarily need to retain, as indexing elements, all the concepts identified during the examination of the document. The choice of those concepts, which should be selected or rejected, depends on the purpose for which the indexing data will be used. Various kinds of purpose can be identified, ranging from the production of printed alphabetical indexes to the mechanized storage of data elements for subsequent retrieval. The kind of document being indexed may also affect the product. For example, indexing derived directly from the text of books, journal articles, etc. is likely to differ from that derived only from abstracts. However, the selection of concepts in indexing is governed by the Indexing policy: exhaustivity and specificity adopted by the given system (See Section 4.2.7 of this Unit).

b) Translating the concepts into the indexing language:

In the next stage in subject indexing is to translate the selected concepts into the language of the indexing system. At this stage, an indexing can be looked from two different levels: document level, which is known as Derivative indexing; and concept level, which is known as Assignment indexing. Derivative indexing is the indexing by extraction. Words or phrases actually occurring in a document can be selected or extracted directly from the document (keyword indexing, automatic indexing, etc.). Here, no attempt is made to use the indexing language, but to use only the words or phrases, which are manifested in the document. Assignment indexing (also known as ‘concept Indexing) involves the conceptual analysis of the contents of a document for selecting concepts expressed in it, assigning terms for those concepts from some form of controlled vocabulary according to given rules and procedures for displaying syntactic and semantic relationships (e.g. Chain Indexing, PRECIS, POPSI, Classification Schemes, etc.). Here, an indexing language is designed and it is used for both indexing and search process.

1.4 Indexing Language:

An indexing language is an artificial language consisting of a set of terms and devices for handling the relationship between them for providing index description. It is also referred to as a retrieval language. An indexing language is ‘artificial’ in the sense that it may depend upon the vocabulary of natural language, though not always, but its syntax, semantics, word forms, etc. would be different from a natural language. Thus, an indexing language consists of elements that constitute its vocabulary (i.e. controlled vocabulary), rules for admissible expression (i.e. syntax) and semantics. More discussion on indexing languages can be seen in the Indexing language.

1.5 Theory of Indexing:

The lack of an indexing theory to explain the indexing process is a major blind spot in information retrieval. Very little seems to have been written about the role and value of theory in indexing. Those who have written about it, however, tend to agree that it serves a vital function. One important function of the theory of indexing is to establish an agenda for research. Equally important, by identifying gaps it suggests what remains to be investigated. Theories also supply a rationale for, or an argument against, current practices in subject indexing. They can put things in perspective, or provide a new and different perspective.

The contributions made by K P Jones and R. Fugmann [Quinn, 1994] in indexing theory are worth mentioning. According to Jones, an indexing theory should consist of five levels, which are as follow:

a) Concordance level: It consists of references to all words in the original text arranged in alphabetical order.

b) Information-theoretic level: This level calculates the likelihood of a word being chosen for indexing based on its frequency of occurrence within a text. For example, the more frequently a word appears, the less likely it is to be selected because of the indexer reasons the document ‘all about that’.

c) Linguistic level: This level of indexing theory attempts to explain how meaningful words are extracted from large units of text. Indexers regard opening paragraphs, chapters and/or sections, and opening and closing sentences of paragraphs are more likely to be a source of indexable Units, as are definitions.

d) Textual level: Beyond individual words or phrases lies the fourth level—the textual or skeletal framework. The author in his/her work presents ideas in an organized manner, which produces a skeletal structure clothed in text. The successful indexer needs to identify this skeleton by searching for clues on the surface.

e) Inferential level: An indexer is able to make inferences about the relationships between words or phrases by observing the paragraph and sentence structure, and stripping the sentence of extraneous detail. This inference level makes it possible for the indexer to identify novel subject areas.

Indexing theory proposed by Robert Fugmann is based on five general axioms, which he claims have obvious validity and in need of no proof and they explain all currently known phenomena in information supply. These five axioms are:

a) Axiom of definability: Compiling information relevant to a topic can only be accomplished to the degree to which a topic can be defined.

b) Axiom of order: Any compilation of information relevant to a topic is an order creating process.

c) Axiom of the sufficient degree of order: The demands made on the degree of order increase as the size of a collection and frequency of searches increase.

d) Axiom of predictability: It says that the success of any directed search for relevant information hinges on how readily predictable or reconstructible are the modes of expression for concepts and statements in the search file. This axiom is based on the belief that the real purpose of vocabulary control devices is to enhance representational predictability.

e) Axiom of fidelity: It equates the success of any directed search for relevant information with the fidelity with which concepts and statements are expressed in the search file.

Like theories in other disciplines, these theories of indexing are developed provisionally, with the understanding that subsequent research will either support or refute them.

1.6 Indexing Criteria:

It is possible, however, to minimize inconsistencies in indexing. Requiring that indexers systematically test the indexability of concepts by using a set of criteria can do this. It is obviously not possible to suggest criteria that would produce the same results when used by the same indexer at different times or by more than one indexer at the same time. The criteria at best enable greater agreement between indexers about concepts that should be indexed. Some of these criteria are given below in the form of a checklist of questions that indexers can ask themselves when faced with a document, to be indexed.

1. To what extent the document is about a particular concept? Mere mention of any concept in the document does not make it indexable. If the concept was a reason for the document or if without the concept the document would either not exist or be significantly altered, then the concept is worth indexing.

2. Is there enough information about the concept in the document? This is always a matter of judgment and indexers may disagree with one another about what constitutes ‘enough information’. However, experience in indexing, in answering queries, and subject knowledge can go a long way in arriving at good decisions concerning this question.

3. Another way of testing the indexability of a concept would be for the indexer to ask himself: would a user, searching for information on this concept, be happy if the document on hand is retrieved? Is there a likelihood of the concept figuring in search queries?

The answer to these questions would not only indicate the indexability of concepts but also the level of specificity at which concepts need to be indexed. To decide on the factors mentioned above, the indexer should have good judgment capacity, experience in answering search queries or reference service, a good understanding of users and their information needs.

1.7 Indexing Policy: Exhaustivity and Specificity:

Exhaustivity is a matter of an indexing policy and it is the measure of the extent to which all the distinct subjects are discussed in a particular document are recognized in indexing operation, and translated into the language of the system. Exhaustivity in indexing requires more number of index entries focusing different concepts (both primary and secondary) covered in the documents. The greater the number of concepts selected for indexing purpose, the more exhaustive is the indexing. If, in a given document, concepts A, B, C, D, E are selected for indexing then the indexing of the document is more exhaustive than if only concepts A< B< C are selected. When a relatively large number of concepts are indexed for each document, the policy followed is one of depth of indexing. Depth of indexing, in other words, allows for the recognition of concepts embodied not only in the main theme of the document but also in sub-themes of varying importance. Policy decision in respect of exhaustivity in indexing depends upon several factors like strength of collection, manpower available, economy and requirements of users.

In selecting a concept, the main criterion should always be its potential value as an element in expressing the subject content of the document. In making a choice of concepts, the indexer should constantly bear in mind the questions (as far as these can be known), which may be put to the information system. In effect, this criterion re-states the principal function of indexing. With this in mind, the indexer should:

  1. choice the concepts, which would be regarded as most, appropriate by a given community of users; and
  2. if necessary, modify both indexing tools and procedures as a result of feedback from enquiries.

Limit to the number of terms or descriptors, which can be assigned to a document should not be decided arbitrarily. This should be determined entirely by the amount of information contained in the document. Any arbitrary limit is likely to lead to loss of objectivity in the indexing, and to the distortion of information that would be of value for retrieval. If for economic reasons, the number of terms is to be limited, the selection of concepts should be guided by the indexer’s judgment concerning the relative importance of concepts in expressing the overall subject of the document.

In many cases, the indexer needs to include, as part of the indexing data, concepts which are present only by implication, but which serve to set a given concept into an appropriate context.

Specificity is the degree of the preciseness of the subject to express the thought content of the documents. It is the measure of the extent to which the indexing system permits the indexers to be precise when specifying the subject of the document. An indexing language is considered to be of high specificity if minute concepts are represented precisely by it. It is an intrinsic quality of the index language itself.

As a rule, concepts should be identified as specifically as possible. More general concepts may be selected in some circumstances, depending upon the purpose of the information retrieval system. In particular, the level of specificity may be affected by the weight attached to a concept by the author. If the indexer considers that an idea is not fully developed, or is referred to only casually by the author, indexing at a more general level may be justified.

Both Exhaustivity and Specificity are very closely related to recall and precision. A high level of exhaustivity increases recall and a high level of specificity increases precision.

1.8 Quality Control in Indexing:

The quality of indexing is defined in terms of its retrieval effectiveness—the ability to retrieve what is wanted and to avoid what is not. The quality of indexing depends on two factors: (i) the qualification of the indexer; and (ii) the quality of the indexing tools.

An indexing failure on the part of the indexer may take place at two stages of indexing process: establishing the concepts expressed in a document, and their translation. Failure in establishing concepts expressed in a document could be of two types:

a) Failure to identify a topic that is of potential interest to the target user group; and

b) Misinterpretation of the content of the document, leading to the selection of inappropriate term(s).

Translation failures may be of three types:

a) Failure to use the most specific terms) to represent the subject of the document;

b) Use of inappropriate term(s) for the subject of a document because of the lack of subject knowledge or due to lack of seriousness on the part of the indexer; and

c) Omission of important term(s).

For a given information system, the indexing data assigned to a given document should be consistently the same regardless of the individual indexer. Consistency is a measure that relates to the work of two or more indexers. It should, remain relatively stable throughout the life of a particular indexing system. Consistency is particularly important if information is to be exchanged between agencies in a documentary network. An important factor in reaching the level of consistency is complete impartiality by the indexes. Almost inevitable, some elements of subjective judgment will affect indexing performance and these needs to be minimized as far as possible. Consistency is more difficult to achieve with a large indexing team, or with teams of indexer working in different location (as in a decentralized system). In this situation, a centralized check stage may be helpful.

The indexer should preferably be a specialist in the field for which the document is indexed. He should understand the term of the documents as well as the rules and procedures of the specific indexing system.

Quality control would be achieved more effectively if the indexers have contact with users. They could then, for example, determine whether certain descriptors may produce false combinations, and also create noise at the output stage.

Indexing quality is also dependent upon certain properties of the indexing method or procedure. It is essential that an index should be able to accommodate new terminology, and also new needs of users—that is, it must allow frequent updating.

Indexing quality can be tested by analysis of retrieval results, e.g. by calculating recall and precision ratios.


This Article Collected From:

  • Unit-4 Indexing Systems and Techniques. (2017). Retrieved from http://egyankosh.ac.in/handle/123456789/11150
Tags

Md. Ashikuzzaman

Work at North South University Library, Bangladesh.

Leave a Reply

Your email address will not be published. Required fields are marked *

Close