Library Science

Automatic Indexing

By: e-gyankosh

Automatic Indexing: In many literatures of Library and Information Science, the term ‘automatic indexing’ is interchangeably used with the term ‘computerised indexing’. A fully automatic indexing system would be one in which indexing is conducted by computers, an internally generated thesaurus is prepared, and search strategies are developed automatically from a natural language statement of information need. Salton provides the following definition of automatic indexing: When the assignment of the content identifier is carried out with the aid of modern computing equipment the operation becomes automatic indexing. It has been suggested that the subject of a document can be derived by a mechanical analysis of the words in a document and by their arrangement in a text. In fact, all attempts at automatic indexing depend in some way or other on the test of the original document or its surrogates. The words occurring in each document are examined and substantive words are selected through statistical measurements (like word frequency calculation, total collection frequency, or frequency distribution across the documents of the collection) by the computer.

However, the use of computers in generating indexes of documents started from KWIC indexing developed by H.P. Luhn.

The idea of analysing the subject of a document through automatic counting of term occurrences was first put forward by H.P. Luhn of IBM in 1957. He proposed that :

a) The frequency of word occurrence in a text of the document furnishes a useful measure of word significance;

b) The relative position of a word within sentence furnishes a useful measurement for determining the significance of sentences; and

c) The significance factor of a sentence will be based on a combination of these two requirements.

The basic idea behind Luhn’s automatic indexing was based on word extraction, that is, keywords were extracted from the text by counting the frequency of occurrence of words in a given document. Here, the computer was used to scan the text with the object of counting the words or phrases that occur most frequently in a machine-readable document, and the extraction programs select the words or phrases that occur most frequently to represent the subject-matter of the document. A ‘stop word’ list was first used to eliminate the common and non-substantive words. The system pioneered by Luhn was relatively effective and the words or phrases selected by computer were quite similar to those, which would be extracted by a human indexer.

In the early 1960s, some other attempts were made at implementing automatic indexing systems. These consisted in using the computer to scan document texts, or text excerpts such as abstracts, and in assigning as content descriptor words that occurred sufficiently frequently in a given text. A less common approach uses relative frequency in place of absolute frequency. In relative frequency approach, a word is extracted if it occurs more frequently than expected in a particular corpus. Thus in a document on ‘Aerodynamics’ the word ‘Air Craft’ and the word ‘Wing’ might be rejected, even though they are the most frequently occurring words in the document, and the word ‘Flutter’ might be selected even though, in absolute terms, it is not a high frequency words. Other approaches to automatic indexing use other types of extraction criteria in place of, or along with the statistical criteria, word position in the document, word type, or even the emphasis placed on words in printing—(e.g. boldface and italics)—may all be used as the basis for selection. Subsequently linguistics led the way by pointing out that a number of linguistic processes were essential for the generation of effective content identifiers characterizing natural language texts.

An ideal computerised indexing is one that has the ability to create and modify new subject terms mechanically, by minimising or without the help of human intellectual efforts. As computer can understand only machine code, so it is necessary to translate the information into machine code and in a fixed machine-readable format. Usually, the titles and abstracts are used for the purpose of computerised indexing. However there are two assumptions:

a) There is a collection of documents; each contains information on one or several subjects.

b) There exists a set of index terms or categories from which one or several of them can describe/represent the subject content of every document in the collection.

Manual Indexing vs. Computerized Indexing:

Manual IndexingComputerized Indexing
1) Identifying and selecting keywords from the tile, abstract and full text of the document to represent its content.2) Keywords and/or phrases denoting the subject matter of the document are extracted only from the title and abstract rather than the document’s full text.
2) Content analysis of the document is purely a mental process and carried out by the human indexer.2) The computer does content analysis by following the human instructions in the form of a computer programming.
3) Human indexer makes inferences and judgment in selecting index terms judiciously.3) Computer cannot think and draw inferences like human indexer and as such, it can select or match keywords, which are provided as input text.
4) Human indexer selects and excludes index terms on the basis of semantic, syntactical as well as contextual considerations.4) It is possible to instruct a computer through proper programming to select, or exclude a term by following the rules of semantic, syntactical and contextual connotations, like human indexer.
5) Scanning, analyzing the critical views, understanding the concepts and using indexer’s own subject knowledge and previous experience do indexing.5) Computer cannot do this. It involves less intellectual effort.
6) Selected index terms less in number.6) Selected index terms are more in number.
 7) It is time-consuming.7) It takes less time.
8) It is expensive.8) Index entries can be produced at lower cost.
9) It is very difficult to maintain consistency in indexing.Consistency in indexing is maintained.

Methods of Computerised Indexing:

A. Keyword Indexing: An indexing system without controlling the vocabulary may be referred as ‘Natural Language Indexing’ or sometimes as ‘Free Text Indexing’. Keyword indexing is also known as Natural Language or Free Text Indexing. ‘Keyword’ means catch word or significant word or subject denoting word taken mainly from the titles and / or sometimes from abstract or text of the document for the purpose of indexing. Thus keyword indexing is based on the natural language of the documents to generate index entries and no controlled vocabulary is required for this indexing system. Keyword indexing is not new. It existed in the nineteenth century, when it was referred to as a ‘catchword indexing’. Computers began to be used to aid information retrieval system in the 1950s. The Central Intelligence Agency (CIA) of USA is said to be the first organization to use the machine-produced keywords index from Title since 1952. H P Luhn and his associates produced and distributed copies of machine produced permuted title indexes in the International Conference of Scientific Information held at Washington in 1958, which he named it as Keyword-In-Context (KWIC) index and reported the method of generation of KWIC index in a paper. American Chemical Society established the value of KWIC after its adoption in 1961 for its publication ‘Chemical Titles’:

KWIC (Keyword-In-Context) Index:

As told earlier, H P Luhn is credited for the development of KWIC index. This index was based on the keywords in the title of a paper and was produced with the help of computers. Each entry in KWIC index consists of following three parts:

a) Keywords: Significant or subject denoting words which serve as approach terms;

b) Context: Keywords selected also specify the particular context of the document (i.e. usually the rest of the terms of the title).

c) Identification or Location Code: Code used (usually the serial numbers of the entries in the main part) to provide address of the document where full bibliographic description of the document will be available.

The operational stages of KWIC indexing consist of the following:

a) Mark the significant words or prepare the ‘stop list’ and keep it in computer. The ‘stop list’ refers to a list of words, which are considered to have no value for indexing / retrieval. These may include insignificant words like articles (a, an, the), prepositions, conjunctions, pronouns, auxiliary verbs together with such general words as ‘aspect’, ‘different’, ‘very’, etc. Each major search system has defined its own ‘stop list’ ;

b) Selection of keywords from the title and / or abstract and / or full text of the document excluding the stop words;

c) KWIC routine serves to rotate the title to make it accessible from each significant term. In view of this, manipulate the title or title like phrase in such a way that each keyword serves as the approach term and comes in the beginning (or in the middle) by rotation followed by rest of the title: d) Separate the last word and first word of the title by using a symbol say, stroke [ / ] (sometime an asterisk “*” is used) in an entry. Keywords are usually printed in bold type face; e) Put the identification / location code at the right end of each entry; and finally

f) Arrange the entries alphabetically by keywords.

Let us take the title ‘control of damages of rice by insets’ to demonstrate the index entries generated through KWIC principle:

Control of damages of rice by insets 118

Damages of rice by insets / Control of 118

Insets / Control of damages of rice by 118

Rice by insets / Control of damages of 118

In the computer generated index, the keywords can be positioned at centre also.

Variations of KWIC:
Two important other versions of keyword index are KWOC and KWAC, which are discussed below:

KWOC (key-word out-of-context) Index:

The KWOC is a variant of KWIC index. Here, each keyword is taken out and printed separately in the left hand margin with the complete title in its normal order printed to the right. For examples,

Control

Control of damages of rice by insets 118

Damages

Control of damages of rice by insets 118

Insets

Control of damages of rice by insets 118

Rice

Control of damages of rice by insets 118

Sometime, keyword is printed as heading and the title is printed in the next line instead of the same line as shown above. For examples,

Control

Control of damages of rice by insets  118

Damages

Control of damages of rice by insets  118

Insets

Control of damages of rice by insets 118

Rice

Control of damages of rice by insets 118

KWAC (key-word Augmented-in-context) Index:

KWAC also stands for ‘key-word-and-context’. In many cases, title cannot always represent the thought content of the document co-extensively. KWIC and KWOC could not solve the problem of the retrieval of irrelevant document. In order to solve the problem of false drops, KWAC provides the enrichment of the keywords of the title with additional keywords taken either from the abstract or from the original text of the document and are inserted into the title or added at the end to give further index entries. KWAC is also called enriched KWIC or KWOC. CBAC (Chemical Biological Activities) of BIOSIS uses KWAC index where title is enriched by another title like phrase formulated by the indexer.

Other Versions:

A number of varieties of keyword index are noticed in the literature and they differ only in terms of their formats but indexing techniques and principle remain more or less same. They are

i) KWWC (Key-Word-With-Context) Index, where only the part of the title (instead of full title) relevant to the keyword is considered as entry term.

ii) KEYTALPHA (Key-Term Alphabetical) Index. It is permuted subject index that lists only keywords assigned to each abstract. Keytalpha index is being used in the ‘Oceanic Abstract’.

iii) WADEX (Word and Author Index). It is an improved version of KWIC index where the names of authors are also treated as keyword in addition to the significant subject term and thus facilitates to satisfy author approach of the documents also. It is used in ‘Applied Mechanics Review’. AKWIC (Author and keyword in context) index is another version of WADEX.

iv) DKWTC (Double KWIC) Index. It is another improved version of KWIC index.

v) KLIC (Key-Letter-In-Context) Index. This system allows truncation of word ( instead of complete word), either at the beginning (i.e. left truncation) or at the end (i.e. right truncation), where a fragment (i.e. key letters) can be specified and the computer will pick up any term containing that fragment. The Chemical Society (London) published a KLIC index as a guide to truncation. The KLIC index indicates which terms any particular word fragment will capture.

Uses of Keyword Index:

A number of indexing and abstracting services prepare their subject indexes by using keyword indexing techniques. They are nothing but the variations of keyword indexing apart from those mentioned above. Some notable examples are:

      • Chemical Titles;
      • BASIC (Biological Abstracts Subject In Context);
      • Keyword Index of Chemical Abstracts;
      • CBAC (Chemical Biological Activities);
      • KWIT (Keyword-In-Title) of Laurence Burkeley Laboratory;
      • SWIFT (Selected Words in Full Titles); and
      • SAPIR (System of Automatic Processing and Indexing of Reports).

Advantages

1) The principal merit of keyword indexing is the speed with which it can be produced;

2) The production of keyword index does not involve trained indexing staff. What is required is an expressive title coextensive to the specific subject of the document;

3) Involves minimum intellectual effort;

4) Vocabulary control need not be used; and

5) Satisfies the current approaches of users.

Disadvantages:

1. Most of the terms used in science and technology are standardized, but the situation is different in case of Humanities and Social Sciences. Since no controlled vocabulary is used, keyword indexing appears to be unsatisfactory for the subjects of Humanities and Social Sciences;

2) Related topics are scattered. The efficiency of keyword indexing is invariably the question of reliability of expressive title of document as most such indexes are based on titles. If the title is not representative the system will become ineffective, particularly in Humanities and Social Science subjects;

3) Search of a topic may have to be done under several keywords;

4) Search time is high;

5) Searchers very often lead to high recall and low precision; and

6) Fails to meet the exhaustive approach for a large collection.


For citing this article use:

Tags

Admin

Declaration: Articles shared in this blog are collected from different sources available on the internet to help students of Library and Information Science. Sources are mentioned in the reference section of the article. If you have any objections about the content of this blog, feel free to contact the site admin at media24xnew@gmail.com

Leave a Reply

Your email address will not be published. Required fields are marked *

Close