Viatscheslav Iatsko (Abakan, Russia)
Linguistic Aspects of Summarization
Summarization comprises transformation operations (elimination, modification, transposition, etc.) applied by an abstractor in creating a secondary source (summary) reflecting the main content of a primary source (original text). Methodologies of summarization aimed at identifying the most essential information contained in the primary source are outlined. Symmetric extraction methodology involves calculating functional weight of sentences so as to select and include in the summary those having high functional weight; marking extraction involves selecting sentences by means of the dictionary of markers and connectors. Procedures for summarizing scientific texts and works of fiction that can be applied to teaching a foreign language are described.
From the linguistic point of view the process of summarization comprises three main constituents: the abstractor, a person who does summarization; a primary source, i.e. a text which is summarized; and a secondary source, i.e. a summary or an abstract, the result of summarization.
To do summarization the abstractor must study the structure of the primary source, and apply some methods transforming the original text so as to create the secondary source reflecting the content of the original. So the abstractor is supposed to know:
The specific feature of summarization methods is that they are aimed at identifying the most essential information contained in the primary source and reproducing this information in a compressed form in the secondary source. Identification and retrieval of the most essential information contained in the primary source is one of the main problems of summarization.
PhiN 18/2001: 34
This paper is aimed at describing existing linguistic methodologies of summarization outlining specific features of the linguistic structure of the primary and the secondary source and suggesting methodologies that can be applied to teaching foreign languages.
2 Extraction methodologies of summarization
Existing linguistic methodologies of summarization involve extracting some sentences from the original so as to make up its summary. Being selected according to some linguistic criteria indicated in a dictionary such sentences are considered to represent the main content of the original text.
2.1 Marking extraction
According to the traditional approach the information that abstracts must reflect is contained in different parts of the original text, such as methods, theoretical principles, results, conclusions, etc. (Maizell and Smith (1971), Trawinsky (1989)).
One of the best-developed techniques of summarization within this approach (Blumenau (1982), Leonov (1986)) is based on the assumption that the author of the primary source points out sentences that he considers important, using special words and expressions (markers). With the help of a dictionary of such markers the abstractor selects sentences in the primary source to make up the abstract. The dictionary must include lexical units that must meet the following requirements.
They must not belong to any specific subject field, that is must not include speciality words.
They must be correlated with parts of the original text (its subject headings). For example,
existing method of problem solving (EMPS),
evaluation of existing method of problem solving (EEMPS),
new method of problem solving (NMPS),
evaluation of new method of problem solving (ENMPS),
The process of summarization consists of looking through the text of the original and writing out sentences with markers. If a selected sentence has lexical units such as next, following indicating its connections with the next sentence or sentences the next sentence or sentences are written out. If a selected sentence has lexical units such as demonstrative, personal pronouns, co-referent terms the previous sentence is written out. Only close connections between sentences are taken into account. Distant connections indicated by such lexical manifestations as above are not taken into account. Lexical units denoting close connections between sentences are referred to as connectors.
PhiN 18/2001: 35
Table 1 (see Appendix) demonstrates the application of these procedures using Caudery's (1998) paper. The total number of sentences in the original text is 217. The number of selected sentences is 10.
It is important to notice that subject headings do not necessarily correspond to headings of sections in the primary source. Caudery's paper has the section "Conclusions", but in this section there are only two sentences, one of which (216) was selected under the subject heading ENMPS, but not R.
The advantage of this methodology of summarization is that it doesn't require the knowledge of a particular subject field and can be applied to any scientific paper or monograph in any subject field. It allows changing the size of abstract. To make it smaller, reduce the use of subject headings such as EMPS and EEMPS.
2.2 Symmetric extraction
Another approach developed in the works of Luhn (1958), Skorokhodko (1983) is based on the assumption that in the primary source the most essential information is contained in separate sentences that have a higher functional weight than the other sentences. In the process of summarization the functional weight of sentences in the original text is calculated and the sentences having a higher functional weight are included in the summary. The functional weight of a sentence is understood as a number of connections between a given sentence and the other sentences in a given text. Each connection is a repetition of a noun, verb, participle, adjective, or adverb belonging to a given subject field. Here follows our interpretation of this approach that can be called symmetric extraction.
Consider example 1, which is a paragraph from the scientific paper (Murray (2000)) consisting of four sentences.
The italic indicates repeated words that don't belong to this subject field and that should be ignored. The underline indicates repeated words belonging to the subject field. These words should be taken into account during summarization. It is important to understand that left-hand as well as right-hand connections are taken into account. For example the word writing in the first sentence is repeated in the second sentence, and, taking into account right-hand connections, we can say that sentence (1) has one connection with sentence (2) through the repetition of the word writing; taking into account left-hand connections we can say that sentence (2) also has one connection with sentence (1) through the repetition of the same word. Writing is also twice repeated in the fourth sentence: first as writing and then as written.
PhiN 18/2001: 36
Writing and written are considered to be one word as they have the same root morpheme. Thus the 1st sentence has 2 connections with the 4th sentence through the repetition of writing. On the other hand the 4th sentence also has two connections with the 1st one through the repetition of the same word. Thus the principle on which this methodology works is that of symmetric relationship: if a sentence Z has 4 connections with a sentence Y then a sentence Y has 4 connections with a sentence Z. Another principle is that not only the repetition of whole words but also the repetition of components of their morphological structure (usually root and prefix morphemes) is taken into account. Following these principles we can state the number of connections for each sentence so that to see that sentence 4 has the highest number of connections and the highest functional weight while sentence 3 the lowest one. Accordingly sentence 3 is sure not to be selected during summarization.
(1)(2) writing, (4) alphabetic, writing, written, Greeks 5 connections
The methodology of symmetric extraction has the following advantages.
It is relatively easy to automate on condition that the computer has a dictionary of terms belonging to a given subject field. In this case the computer looks through the text selecting sentences with repeated words and calculating the number of connections.
You can easily change the size of summary setting the computer the assignment to select only sentences with 6 connections, with 5 connections, etc., in the latter case the size of the summary being bigger and in the former case smaller.
It's easy to change retrieval characteristics. For example, we can take into account only the repetition of nouns ignoring other repetitions thus getting a different distribution of connections.
(1)(4) writing, Greeks 2 connections
We can take into account only left-hand connections, i.e. connections with the preceding text, ignoring right-hand connections. In this case the relationship between sentences becomes non-symmetric.
PhiN 18/2001: 37
It is interesting to notice that in all these variants the 4th sentence has the highest functional weight, which means that this sentence is really important for the structure and the content of the text.
3 Methodologies of editing an extract
Extraction methodologies meet with one big problem: incoherence and incomprehensibility of resulting texts (extracts). Extracts are incoherent as sentences are extracted from the primary source without being transformed. Extraction can be considered only a stage of summarization.
Going back to the model of summarization we can say that the extraction methodologies give some insight into the structure of the primary source, but they give few insights into the activities of the abstractor and no insights into the structure of summaries. We still have no notion of transformation procedures applied by the abstractor in order to create a secondary source. How can we get this information making use of the results of extraction?
One of the possible solutions can be contrastive analysis of the linguistic structure of an extract and a summary of the same paper. In the Appendix an incoherent extract of Caudery's paper (1998) and a coherent summary of the same paper (compare Table 1 and Text 1) can be found. The summary was prepared after editing the extract manually. The sentences are numbered according to their sequential order in the primary source. Comparing two texts we have a good opportunity to see what was changed thus getting insight into the abstractor's activities. While comparing two texts we must ask two questions: What and Why What did the abstractor change? Why did he change it?
The first thing that arrests our attention is that some phrases were eliminated from the text of extract.
PhiN 18/2001: 38
Why were these phrases eliminated? This question can be given two different answers. The first six phrases were eliminated because they indicate connections between selected sentences and sentences that were not selected during extraction. Having eliminated manifestations of such connections the abstractor changed the communicative dimension of the text.
The last two phrases were eliminated because the abstractor, the author of the summary, is not the author of judgements expressed by sentences. The author of these judgements is the author of the primary source that is why summaries do not admit of the first person. Having eliminated the personal pronouns the speaker changed the modal dimension of text.
It should be noted that while the elimination of the parenthetical clause in sentence (189) doesn't present any difficulties, the elimination of the principal clause in (138) involves changing the semantic type and syntactic structure of the sentence, turning the object into the subject and introducing a new predicate.
(138) I will mention one final advantage ... > (138a) Another advantage ... is that ...
Here elimination is accompanied by another procedure that can be called modification. Using modification the abstractor changed the semantic and syntactic dimension of text.
The abstractor also changed the order of sentences placing (72a) before (40a) because this order of the sentences is logical. So there are some logical relations between judgements expressed by sentences, and the abstractor changed relational dimension of text using this procedure that can be called transposition.
Comparing two texts we have found out about three procedures applied by the abstractor and besides we have got additional information about the structure of the primary source. It comprises, along with lexical dimension, at least four more dimensions: communicative, modal, semantic and syntactic, and relational.
It is not always the case with manual summarization that the abstractor just takes sentences out of the primary sources and transforms their structure. Sometimes he creates entirely new sentences. The specific feature of such sentences is that they express judgements about the content of the primary source, whereas sentences extracted from the primary source express judgements about some properties of the object studied in the primary source. The name of the object makes up the topic of such sentences. Cf.:
The monograph examines gender stereotyping in a sample of advertising texts
Using traditional terminology we may call sentences of these types indicative and informative. Summaries consisting of indicative sentences can be called indicative, summaries consisting of informative sentences can be called informative, and summaries consisting of some combinations of these two types can be called indicative-informative.
PhiN 18/2001: 39
Now we have some knowledge of the components of summarization. The linguistic structure of the primary source comprises five dimensions: a lexical dimension consisting of subject field vocabulary; a communicative dimension including manifestations of connections between sentences; a semantic and syntactic dimension including semantic types of sentences; a modal dimension including different types of modi, and a relational dimension including different types of logical relations between judgements expressed by sentences. The abstractor's linguistic activities include such procedures as elimination, modification, transposition, substitution and others. The linguistic structure of a secondary source can be indicative, informative, and indicative-informative. Moreover we can distinguish three main stages of summarization:
This knowledge has practical applications to teaching foreign languages.
4 Summarization in teaching a foreign language
4.1 Summarizing scientific texts
Extraction methodologies can be applied in teaching foreign languages at an advanced level since they involve such linguistic skills as memorizing and correctly using a number of lexical units, transforming syntactic structures and defining connections between sentences. While the application of symmetric extraction seems problematic, as it is difficult to calculate connections manually, marking extraction can be successfully applied to summarizing scientific texts in classroom environment.
The specific summarization procedures follow.
Application of these procedures can be exemplified by text 1 (see Appendix).
The list of these procedures is in no sense complete and must be further specified by comparing linguistic structures of extracts and summaries. It would be interesting to compare these procedures with those used in generative grammar (movement, insertion, deletion).
Students can use this methodology to abstract larger texts. It requires knowledge and linguistic skills in the fields of communicative syntax and systemic syntax. Students need to be aware of such notions as topic, focus, and different types of connections between sentences, and different ways of transforming the syntactic structure of sentence. Consequently, this methodology is most appropriately applied at the college or university level.
4.2 Summarizing works of fiction
There are, of course, essential differences between scientific papers and fiction, which must be taken into account when summarizing fiction. In fiction the most essential information is the connection between the main characters and events in the plot. The following methodology can be used. It is based on memorizing lexical clichés giving information about the author, characters, and the plot of a short story or another piece of fiction. The clichés are arranged in the following plan for both written and verbal presentation.
1. The text (story) under discussion is entitled ... ...
The students are required to complete the sentence giving the title of the story.
2. It is written by ... a (the) famous (outstanding, prominent, well-known) English (American, Australian, Canadian, South African) novelist (short story writer, dramatist, poet) ...
The students complete the sentence giving the name of the author. They must memorize synonyms and display erudition defining the author's nationality and literary specialization. It should be pointed out to the students that one and the same author could have several literary specializations. For example, O. Wilde was a novelist, a poet, and a dramatist. Sometimes it's possible to indicate genre of literature; speaking about I. Asimov or H. Wells it would be natural to characterize them as science fiction writers. If the author is very famous, the students can give additional information stating his period of life, naming main works, and specific features of literary style.
PhiN 18/2001: 41
The teacher should draw attention to the use of article with the personal name in apposition: if the author is very famous the definite article must be used. It is natural to use the definite article while referring to such writers as G. B. Shaw, Ch. Dickens, J. London, and the indefinite article with a reference to P. Abrahams or A. Cronin.
Special attention should be paid to correct pronunciation of the authors' personal names. Experience shows that students make mistakes in the pronunciation of the first as well as the last names under the influence of their native tongue, e.g. Isaac is pronounced [I'sa:k], in the same way as in Russian. The teacher can give the students a transcribed list of the names or ask them to transcribe the names on their own using different reference sources.
3. The action takes place in ...
The students specify time and/or the place of action using information given in the text.
4. Main characters of the text are ... Secondary characters of the text are ...
The students give names of personages. Usually it isn't difficult for them to distinguish between main and secondary characters, especially in a short story or a separate chapter in a book. Students at advanced levels can give additional information about main characters, naming their occupation (doctor, teacher, businessman, etc.) and relations between them (love, hate, friendship, etc). For example, Main characters of the text are Andrew Manson, a doctor, and Ms. Barlow, a schoolteacher. They love each other. It is a good opportunity for students to memorize lexical units denoting different occupations, professions and human feelings.
5. The text can be divided into ... logical parts.
Students give the number of logical parts. As a rule different students will divide the same text into different numbers of parts. It's important to explain to them that there may be two main criteria for such division: according to the places of action and according to main events. Sometimes both criteria can be used. The teacher can suggest to the students dividing a text into parts using the two criteria so as to compare the two variants.
6. The first (second, third, etc.) part deals with (is concentrated upon, is focused upon, is devoted to) the description of ... (conversation between ..., smb.'s doing smth).
Students must describe contents of each part of the text. Three variants of the description can be suggested.
PhiN 18/2001: 42
7. The main idea of the text is to show (to reveal, to expose, to denounce, etc.) ...
To formulate this point is usually difficult for students and requires much thinking. The teacher should explain to them that they must reveal the purpose of the author and answer the questions Why did the author write the story? What did he want to show to the reader?
8. I like (dislike) the text because it has an entertaining (a boring) plot and it is true (not true) to life.
The students must express and substantiate their personal attitude to the text. The use of clichés according to the plan can be exemplified by one of I. Asimov's (1989) short stories. The following is the story's summary, which is defined as indicative-informative.
The story under discussion is entitled "The Fun They Had". It is written by Isaac Asimov, a well-known contemporary American science fiction writer. The action takes place in 2157. Main characters are Margie, a girl at the age of 11, and her elder brother Tommy, who is 13. Secondary characters are their mother, the County Inspector, and the mechanical teacher.
The advantage of text summarization according to such a plan is that it can be applied to any piece of fiction. When the plan is memorized the teacher can use it in oral exercises, suggesting to the students that they ask each other questions on the points of the plan, such as Who is the story under discussion written by? Into how many parts can the text be divided? Making up such questions is usually difficult for Russian speaking students because of essential differences between interrogative sentences in Russian and in English. Another exercise, which can be done in class, is giving incorrect statements, i.e. statements contradicting the content of the story. It would be a good opportunity to drill lexical units expressing disagreement, e.g. if the teacher gives the statement The action takes place in 1957 the students are supposed to answer I can't agree with you (nothing of the kind, on the contrary, etc.) the action takes place in 2157. At the next stage the pupils can make up and give each other incorrect statements themselves. Apart from this, the plan can be adjusted to fit students on different levels studying a foreign language. It can be simplified by excluding some points (e.g. Nos. 5, 6, 8) so that to make up a purely indicative summary, or on the contrary, complicated by adding some more points, e.g. by adding the analysis of stylistic devices used by the author to express the main idea.
PhiN 18/2001: 43
Methodologies of text summarization developed within the scope of computer and information sciences can be successfully applied not only to developing information retrieval systems but also to teaching a foreign language. Moreover they can give important insights into the linguistic structure of summarization that comprises 1) the linguistic structure of the primary source, 2) the linguistic structure of the secondary source, and 3) the transformation operations applied by the abstractor.
The linguistic structure of the primary source comprises five dimensions: a lexical dimension consisting of subject field vocabulary; a communicative dimension including manifestations of connections between sentences; a semantic and syntactic dimension including semantic types of sentences; a modal dimension including different types of modi, and a relational dimension including different types of logical relations between judgements expressed by sentences. The abstractor's linguistic activities include such procedures as elimination, modification, transposition, substitution and others. The linguistic structure of a secondary source can be indicative, informative, and indicative-informative. Summarization process consists of three main stages: 1) compiling a dictionary. According to the types of dictionaries summarization methodologies can be corpus based or non-corpus based. Corpus based methodologies operate with dictionaries of subject field (speciality) terms; non-corpus based methodologies operate with dictionaries of non-subject field terms; 2) extracting sentences that reflect the most important information contained in the primary source; 3) editing.
Three methodologies of text summarization have been outlined in this paper: symmetric extraction, marking extraction, cliché summarization. Symmetric extraction is a corpus-based methodology that can be applied to summarizing scientific and newspaper texts in information retrieval systems. Marking extraction and cliché summarization are non-corpus based methodologies that can be applied correspondingly to summarizing scientific texts and works of fiction in the process of teaching a foreign language.
Asimov, Isaac (1989): The fun they had. A practical course of English for third year students. Moscow: Vysshaya Shcola, 146149.
Blumenau, D. I. (1982): Problemy svertyvaniya nauchnoi informatsii. Leningrad: Nauka.
Caudery, Tim (1998): "Increasing students' awareness of genre through text transformation exercises: An old classroom activity revisited", in: TESLEJ 3/3, 116. [see: http://www-writing.berkeley.edu/TESL-EJ/ej11/a2.html, day of last visit: 2001-17-09]
Iatsko, Viatscheslav A. (1998): Prakticheskaya grammatika angliiskogo iazyka. Abakan: Katanov State University of Khakasia Press.
Leonov, V. P. (1986): Referirovanie i annotirovanie nauchno-tekhnicheskoi literatury. Novosibirsk: Nauka.
Luhn, H. P. (1958): "The automatic creation of literature abstracts", in: The IBM journal, 159165.
PhiN 18/2001: 44
Maizell, Robert E. et al. (1971): Abstracting Scientific and Technical Literature. An Introductory Guide and Text for Scientists, Abstractors, and Management. New York; Chichester: Wiley-Interscience.
Skorokhodko, E. F. (1983): Semanticheskie seti i avtomaticheskaya obrabotka teksta. Naukova Dumka, Kiev.
Trawinsky, B. A. (1989): "A methodology for writing problem structured abstracts", in: Information processing and management 25, 693702.
A Concise Dictionary of Markers
Aim of research paper (article, study, research)
Existing method of problem solving method (device, approach, methodology, technique, analysis, theory, thesis, conception, hypothesis)
Evaluation of existing method of problem solving method (device, approach, methodology, technique, analysis, theory, thesis, conception, hyposesis)
New method of problem solving method (device, approach, methodology, technique, analysis, theory, thesis, conception, hypothesis)
Evaluation of new method of problem solving method (device, approach, methodology, technique, analysis, theory, thesis, conception, hypothesis)
PhiN 18/2001: 45
PhiN 18/2001: 46
The main aim of this article is to draw attention to a type of learning task that combines the two types of activity the reading and the writing of texts of different genres.