Ioanna Malagardi: INFORMATION EXTRACTION AND KNOWLEDGE ACQUISITION FROM TEXTS

Κυριακή 17 Ιανουαρίου 2010

INFORMATION EXTRACTION AND KNOWLEDGE ACQUISITION FROM TEXTS

INFORMATION EXTRACTION AND KNOWLEDGE ACQUISITION FROM TEXTS USING BILINGUAL QUESTION-ANSWERING

John Kontos & Ioanna Malagardi

Department of Informatics
Athens University of Economics & Business

Journal of Intelligent and Robotic Systems. Kluwer Academic Publishers, 26(2): 103-122, 1999

Abstract
A novel approach is introduced in this paper for the implementation of a question-answering based tool for the extraction of information and knowledge from texts. This effort resulted in the computer implementation of a system answering bilingual questions directly from a text using Natural Language Processing. The system uses domain knowledge concerning categories of actions and implicit semantic relations. The present state of the art in information extraction is based on the template approach which relies on a predefined user model. The model guides the extraction of information and the instantiation of a template that is similar to a frame or set of attribute value pairs as the result of the extraction process.
Our question-answering based approach aims to create flexible information extraction tools accepting natural language questions and generating answers that contain information extracted from text either directly or after applying deductive inference. Our approach also addresses the problem of implicit semantic relations occurring either in the questions or in the texts from which information is extracted. These relations are made explicit with the use of domain knowledge. Examples of application of our methods are presented in this paper concerning four domains of quite different nature. These domains are: Oceanography, Medical Physiology, Aspirin Pharmacology and Ancient Greek Law. Questions are expressed both in Greek and English.
Another important point of our method is to process text directly avoiding any kind of formal representation when inference is required for the extraction of facts not mentioned explicitly in the text. This idea of using text as knowledge base was first presented in (J. Kontos, 1982) and further elaborated in (J. Kontos, 1985, 1992, 1996) as the ARISTA method. This is a new method for knowledge acquisition from texts that is based on using natural language itself for knowledge representation.
Keywords: Information Extraction, Knowledge Acquisition, Question Answering, Texts as Knowledge Bases
Short Title: EXTRACTION AND ACQUISITION WITH QUESTION ANSWERING

1. Introduction
The research presented in this paper is part of a project aiming at the development of a novel method for information extraction and knowledge acquisition from texts using bilingual question-answering techniques. The present state of the art in information extraction (J. Cowie and W. Lehnert, 1996) and (M.T. Pazienza, 1997) is based on the template approach whose early implementations are reported in (N. Sager, 1981) and (J. Cowie, 1983). Independently the implementation of an extraction system based on a syntax directed method applied to a corpus of abstracts from oceanographic papers was reported (J. Kontos, 1983). This work was a precursor of the method presented in the present paper. The template approach relies on a predefined user model which guides the extraction of information and the instantiation of a template that is similar to a frame or set of attribute value pairs as the result of the extraction process.
An example illustrating the concept of the template that is mentioned by R. Grishman (1997) is shown below. This is a simplified example from one of the earlier MUCs (Message Understanding Conferences), involving terrorist events (MUC-3, 1991). The Message Understanding Conference (MUC) series is organised by USA authorities with the purpose of formal evaluation of different information extraction systems constructed by various university groups and other research groups. The proceedings of this series of conferences are published as a series of volumes and the papers contained in them describe both the structure of the systems participating in the competition and the results of the formal evaluation of these systems. The systems are tested during the MUCs with text bases of authentic texts from newspapers or other sources. One of the first text bases used for these purposes refers to terrorist events. The terrorist event report example from (R.Grishman, 1997) is:
“A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy, but no casualties have been reported. According to unofficial sources, the bomb- allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650 (1250 GMT)”.

A template filled with information extracted from the above text fragment is:

INCIDENT TYPE BOMBING
DATE March 19
LOCATION El Salvador: San Salvador (city)
PERPETRATOR Urban guerilla commandos
PHYSICAL TARGET Power tower
HUMAN TARGET ------
EFFECT ON PHYSICAL TARGET Destroyed
EFFECT ON HUMAN TARGET no injury or death
INSTRUMENT Bomb
For each terrorist event, the system had to determine the type of attack (bombing, arson, etc.), the date, location, perpetrator (if stated), targets, and effects on targets. Other examples of extraction tasks are international joint ventures (where the arguments include the partners, the new venture, its product or service, etc.) and executive succession (indicating who was hired or fired by which company for which position).
Our question-answering based approach aims to create flexible information extraction tools accepting natural language questions and generating answers that contain information extracted from text either directly or after applying deductive inference. Our approach also addresses the problem of implicit semantic relations occurring either in the questions or in the texts from which information is extracted. These relations are made explicit with the use of domain knowledge. The foundation of our method can be found in (J. Kontos, 1970, 1983). Examples of application of our methods are presented in this paper concerning four domains of quite different nature. These domains are: Oceanography, Medical Physiology, Aspirin Pharmacology and Ancient Greek Law. Questions are expressed both in Greek and English.
The need for a question-answering approach for Information Extraction is even recently recognized as a desirable change in technology as discussed by Y. Wilks (1997), who is presumably unaware of the precedent of our work on the subject for the past quarter of the century. From Y. Wilks (1997) we quote:
Suppose parsing systems that produce syntactic and logical representations were so good, as some now believe, that they could process huge corpora in an acceptably short time. One can then think of the traditional task of computer question answering in two quite different ways. The old way was to translate a question into a formalised language like SQL and use it to retrieve information from a database- as in “Tell me all the IBM executives over 40 earning under $50K a year”. But with a full parser of large corpora one could now imagine transforming the query to form an IE template and searching the WHOLE TEXT (not a data base) for all examples of such employees both methods should produce exactly the same result starting from different information sources a text versus a formalised database. What we have called an IE template can now be seen as a kind of frozen query that one can reuse many times on a corpus and is therefore only important when one wants stereotypical, repetitive, information back rather than the answer to one-off questions. “Tell me the height of Everest”, as a question addressed to a formalised text corpus is then neither an IR nor IE but a perfectly reasonable single request for an answer. “Tell me about fungi”, addressed to a text corpus with an IR system, will produce a set of relevant documents but no particular answer. Tell me what films my favorite movie critics likes, addressed to the right text corpus, is undoubtedly IE as we saw, and will produce an answer also. The needs and the resources available determine the techniques that are relevant, and those in turn determine what it is to answer a question as opposed to providing information in a broader sense.
Almost all of the above considerations coincide with the thoughts on which we based the launching of our Question-Answering based Information and Knowledge Extraction project in the early eighties. However the kind of questions that we studied includes forms more complex than the form of a simple question like “Tell me the height of Everest” which is answered by the extraction of the value of a property of an entity. We have managed to treat questions which are answered by the extraction of the direct or even the indirect relationship between entities or changes of properties of entities. Most of our work has been based on the analysis of texts that contain knowledge about causal relations.
Another important point of our method is to process text directly avoiding any kind of formal representation when inference is required for the extraction of facts not mentioned explicitly in the text. This idea was first proposed in (J. Kontos, 1980) and was applied to simple information extraction from texts in (J. Kontos, 1982, 1983). The method was further elaborated in (J. Kontos, 1985, 1992, 1996) as the ARISTA method. This is a new method for knowledge acquisition from texts that is based on using natural language itself for knowledge representation. The basic idea of ARISTA lies in avoiding, wherever possible, the translation of natural language into a formal language. In the case of Information Extraction and Knowledge Acquisition with Question-Answering, employing our method ARISTA allows the avoidance of the translation of both questions and texts, by which questions are answered, into formal representations. Almost all other methods in this field are based on the translation of such questions and on texts to SQL, logic, templates or frames e.g. (J. Kontos, 1988).

2. Method
Our method is based on question answering and aims at the creation of flexible information extraction tools which accept natural language questions in either Greek or English and generate answers that contain information extracted from text either directly or after applying deductive inference. The information extraction task can thus be performed interactively enabling the user to submit natural language questions to the system and therefore allowing for greater flexibility than template based systems.
Our method uses a question grammar combined with a text grammar for the extraction of information or knowledge. These two grammars use syntax rules and domain dependent lexicons while the semantics of the question grammar provides the means of their combination. An illustrative question grammar fragment is presented below.

2.1. Question Grammar
The semantics of a question grammar may provide the means of the combination of question processing with information extraction. An Ancient Greek Law text will be used in this section in order to illustrate the form of a question grammar that processes questions and extracts information to answer these questions. This text exists as a stone inscription in the ruins of the ancient city of Gortys, which is situated in the Greek Island of Crete.
The Inscription is dated at the end of 6th or at the beginning of 5th century B.C. The whole text amounts to more than 600 lines and about 3000 words. The writing is a “boustrophedon”, the first line of each column running from right to left and the rest of the lines alternating in direction. The Code is inscribed in the archaic Greek alphabet of eighteen letters including F (digamma). The dialect of the Code was the Cretan Doric Greek one and particularly the dialect of Central Crete. Each of its regulations is formulated as a conditional sentence in the third person, with the protasis consisting of the assumed facts or “hypothesis” and the “apodosis” consisting of the legal consequences.
The Inscription includes the following items: Property, Marriage and Kinship, Heiress, Rape, Adultery and Divorce, Illegitimate Children, Adoption, The Administration of Justice (R. F. Willets, 1967). The user may submit to our system a question either in Greek or in English answered by the system after performing the necessary information and knowledge extraction from the Greek Gortys text. Two illustrative questions are given below in “wrong” English keeping the word order of the original Greek text for easy comparison between it and the English translation:

What does the child of a free woman inherit to whom comes a slave?
Ti klironomei to teknon eleytheris gynaikas pros tin opoia erhetai enas sklavos?
Whose answer is “the estate of the free woman”
Who inherits the estate of a free woman to whom comes a slave?
Poios klironomei tin periousia eleytheris gynaikas pros tin opoia erhetai enas sklavos?
Whose answer is “the free child”

The above two questions in correct English are written as follows:

• What does the child of a free woman to whom a slave comes inherit?
• Who inherits the estate of a free woman to whom a slave comes?

The part of the ancient legal text from which the information necessary for answering the above questions is extracted consists of two sentences with the following rendering in English:
 If a slave comes to a free woman, then a free woman bears free children.
 If a free woman bears a free child, then the child inherits her estate.

An illustrative fragment of a question grammar for questions like the above written in Turbo Prolog is presented below:

1).q(Q,TAP):-f(Q,Question_Word,R1),f(R1,Benefit,R2), f(R2,to,R3),f(R3,TEKNON,R4),f(R4,EleytheraS,R5),onom(EleytheraS,Eleythera),relative_clause(R5,Act,Agent),
relation(TEKNON,EleytheraS,Relation),template(S,s,Agent,Act,Eleythera),
template(S,a,Eleythera,Relation,Eleytheron),template(N,s,Eleythera,Relation,Eleytheron).

2). q(Q,TAP):-f(Q,Question_Word,R1),f(R1,Benefit,R2), f(R2,tin,R3),
f(R3,Perioysian,R4),f(R4,EleytheraS,R5),onom(EleytheraS,Eleythera),
relative_clause(R5,Act,Agent),
relation(TEKNON,EleytheraS,Relation),
template(S,s,Agent,Act,Eleythera),
template(S,a,Eleythera,Relation,Eleytheron),
template(N,s,Eleythera,Relation,Eleytheron).

3). relative_clause(Rel_cl,Verb,Sub_Obj):-pronoun(Rel_cl,R1),
f(R1,Verb,R2),morph(Verb,Verb_root),
prepositional_phrase(R2,Sub_Obj).

4). prepositional_phrase(P,SO):-f(P,pros,SO);f(P,SO,"").

5). pronoun(P,R):-f(P,pros,R1),f(R1,tin,R2),f(R2,opoian,R);f(P,i,R1),f(R1,opoia,R).

This grammar accepts questions in Greek of two forms described by the syntactic parts of rules (1) and (2). The first form defined by the syntactic part of rule (1) can be seen from the English translation e.g. “What does the child of a free woman to whom a slave comes inherit?”, which, when transformed in order to correspond to the Greek word order, becomes “What inherits the child of a free woman to whom comes a slave?” The question phrase which in Greek may be a single word and corresponds to the value of the variable Question_Word of the predicate “f(_,_,_)” which is a shorthand version of the inbuild prolog predicate “fronttoken(_,_,_)”. The main verb “inherits” (“lamvanei” in Greek) of the question corresponds to the value of the string variable “Benefit”. The subject of the main verb is the complex noun phrase “the child of a free woman to whom comes a slave” (here again we keep the word order of the original text). This complex noun phrase contains the relative clause “to whom comes a slave” the form of which is specified by rule (3), (4) and (5) which are common for both forms of questions described by rules (1) and (2). The predicate “relation(_,_,_)” illustrates the use of domain or microcosmos knowledge necessary for the analysis of the question. The predicate “template(_,_,_,_,_)”, which was given this name in order to remind one of the role played by templates in other methods of information extraction, connects the question grammar with the text grammar. By using more than once the predicate “template” in question analysis rules like (1) and (2) enables us to treat complex questions that need scenarios for the extraction of the information which is necessary for answering the questions.

Examples of questions that may be processed with the above grammar are:
qp:-q("ti lamvanei to teknon eleytheras pros tin opoian erhetai doylos").
Which means: What does the child of a free woman to whom a slave comes inherit?
qw:-q("poios lamvanei tin perioysian eleytheras pros tin opoian erhetai doylos").
Which means: Who inherits the estate of a free woman to whom a slave comes?
qr:-q("ti lamvanei to teknon eleytheras i opoia erhetai pros doylon",P).
Which means: What does the child of a free woman who comes to a slave inherit?

2.2. Text Grammar
The text analysis performed by the system that was implemented on the computer is based on logic grammars appropriate for each text domain. An original parsing method appropriate for languages with a relatively free word order was used in the Greek text. This method consists of the automatic translation of every sentence into a number of logical facts written in Prolog and of the recognition of syntactic constituents as logical combinations of these facts. These facts take the form of a logical predicate with three arguments. The first argument specifies the number of the sentence that contains a given word, the second argument specifies the position of the word in the sentence and the third specifies the word itself. In traditional methods of syntactic analysis by computer, which are mainly used, for the analysis of English texts one syntactic rule must be written for every particular sequence of words. This means that if we apply such a method for the syntactic analysis of Greek, a plethora of syntactic rules will be needed for the same constituent due to the word order freedom of this language. On the contrary the method followed in the present system allows the statement of a single syntactic rule for the parsing of two or more equivalent syntactic structures that differ only in the relative position of the words involved.
The form of the sentences that can be analyzed by the rules developed for the present system in the case of the Gortys text consists of one verb and its complements since this was the form of the hypotheses found in the text. The case of the missing subject is treated using the valency of the verb for predicting the number of its complements. The complements of the verbs are recognized by syntactic rules that analyze the following basic forms of noun phrases:

• Pronouns
• Nouns
• Article + Noun
• Article+ Participle
• Adjective+ Noun
• Noun Phrase + or+ Noun Phrase
• Noun Phrase + and+ Noun Phrase
• Noun in Nominative+ Noun in Genitive
• Noun+ Pronoun+ Article+ Noun
These forms are recognized by the following text grammar example rules written in Turbo Prolog where “f(_,_,_)” is again a shorthand for “fronttoken” as above and “c(X,Y,Z)” means that Z is the concatenation of X and Y. The predicate “template(_,_,_,_,_)” used above in the question grammar in this particular case must be equated to the predicate “pr(_,_,_,_,_)”. If one needs to extract information from texts with different text grammars then “template” will be equated with the appropriate predicate of that grammar.
Text Grammar Example
pr(N,s,M,E2):-s(N,S),f(S,E2,R),f(R,M,_).
pr(N,a,V,E1):-ap(N,A),f(A,E1,R),f(R,V,_).
pr(N,a,E1,V,E2):-ap(N,A),f(A,E1,R),
f(R,V,R1),f(R1,E2,"").
pr(N,s,E1,V,E2):-s(N,A),f(A,E1,R),
f(R,V,R1),f(R1,E2,"").
______________________________________________________________________
p(S,np,NP,P):-w(S,N1,D),w(S,N2,N),N2=N1+1,
l(D,_,d,Nu,P,G),l(N,_,e,Nu,P,G),
c(D,N,NP).
p(S,np,NP,P):-w(S,N1,D),w(S,N2,N),N2=N1+1,
l(D,_,d,Nu,P,G),l(N,_,met,Nu,P,G),
c(D,N,NP).

p(S,np,NP,P):-w(S,N1,Ad),w(S,N2,N),N1<>N2,%N2=N1+1,
l(Ad,_,ad,Nu,P,G),l(N,_,e,Nu,P,G),
c(Ad,N,NP).

p(S,np,NP,P):-w(S,N1,E1),w(S,N2,i),N2=N1+1,
w(S,N3,E2),N3=N2+1,
l(E1,_,e,Nu,P,_),l(E2,_,e,Nu,P,_),
c(E1,i,NP1),c(NP1,E2,NP).

p(S,np,NP,P):-w(S,N1,E1),l(E1,_,e,_,P,_),
w(S,N2,i),N2=N1+1,
p(S,npq,NPQ,P,N3),N3=N2+1,
c(E1,i,NP1),c(NP1,NPQ,NP).

p(S,np,NP,P):-w(S,N1,E1),w(S,N2,kai),N2=N1+1,
w(S,N3,E2),N3=N2+1,
l(E1,_,e,Nu,P,_),l(E2,_,e,Nu,P,_),
c(E1,kai,NP1),c(NP1,E2,NP).

p(S,np,NP,P):-w(S,_,NP),l(NP,_,e,_,P,_).
p(S,np,NP,P):-w(S,_,NP),l(NP,_,pr,_,P,_).

p(S,npc,NPc,o):-w(S,_,No),l(No,_,e,_,o,_),
w(S,_,Na),l(Na,_,e,_,g,_),r(No,Na,R),
c(No,Na,NPc),write(S,R),nl.

p(S,npc,NPc,o):-w(S,_,No),l(No,_,ad,_,o,_),
w(S,_,Na),l(Na,_,e,_,g,_),r(No,Na,R),
c(No,Na,NPc),write(S,R),nl.

p(S,npq,NP,P,N1):-w(S,N1,E1),l(E1,_,e,_,P,_),
w(S,N2,ek),N2=N1+1,w(S,N3,D),
l(D,_,d,_,g,_),N3=N2+1,
w(S,N4,Pr),l(Pr,_,pr,s,g,_),
N4=N3+1,w(S,N5,E2),l(E2,_,e,_,g,_),
N5=N4+1,c(E1,ek,S1),c(S1,D,S2), c(S2,Pr,S3),c(S3,E2,NP).

p(S,se,SE,V,NPc,_):-a(S,_,V,NE),l(V,_,iv,_),
p(S,npc,NPc,o),c(V,NPc,S1),c(NE,S1,SE).

p(S,se,SE,V,NP,_):-a(S,_,V,Ne),l(V,_,iv,_),
p(S,np,NP,o),c(Ne,V,CV),c(CV,NP,SE).

p(S,se,SE,V,_,A):-a(S,_,V,Ne),l(V,_,v,_),w(S,_,A),
l(A,_,ad,_,a,_),c(Ne,V,CV),c(CV,A,SE).

a(S,N1,V,den):-w(S,N1,den),w(S,N2,V),N2=N1+1.
(S,N,V,""):-w(S,N,V).

2.3. Extraction of Implicit Semantic Relations
During the processing of some texts the problem of discovering and extracting by computer implicit semantic relations between concepts occurred. The discovery of such semantic relations requires the codification and processing by computer of the appropriate domain knowledge (I. Malagardi, 1995a, 1995b, 1996).
The example of the ancient Greek text of the Law Code of Gortys is used below for the illustration of the kind of knowledge used. This knowledge is divided in two main parts: a) ontology of entities b) ontology of actions and c) specification of implicit relationships between entities. The actions for the domain of the ancient Greek text are expressed with verbs, which are classified as follows:
1. offenses: rape, take
2. existence: live, exist, die
3. general actions: leave, bear, divorce, guarantee, leave, marry

The categories of implicit relationships between nouns are “being a relative of” and “responsibility” e.g.:
1. being a relative of: brother of father
2. responsibility: responsible of divorce

This knowledge of implicit relationships is used for the analysis of noun phrases of the following forms:
• noun in the nominative + noun in the genitive
• adjective in the nominative + noun in the genitive
These relationships are used for generating the answers to the corresponding questions.

3. The lexica
The texts used as examples of application of our system obviously belong to quite different domains and therefore pose the requirement for specialized lexica. Each specialized lexicon contains all the words from the domain that the particular text processed by the system belongs to. These words are grouped in three categories depending on the number of their characteristic attributes. The first category consists of words, whose entries have a single attribute that specifies the part of speech. The words in this category are mainly function words. The second category consists of verbs and their entries have two attributes. The first attribute specifies whether the verb is transitive or intransitive and the second attribute specifies the number of the verb. The third category has entries with four attributes and contains the nouns, adjectives and participles found in every text. The four attributes specify the part of speech, the number, the case and the gender of every word of this category.

4. Knowledge Extraction from Text
The deductive computer analysis of knowledge delivered by texts is traditionally performed in two stages: in the first stage, the text is translated by computer or, more commonly, by hand into some formal representation. In the second stage, reasoning is performed using this formal representation of the content of the text. In our system the translation step is avoided and the analysis is performed directly from the natural language texts following the ARISTA method of text analysis (J. Kontos, 1992). In order to enable our system to analyse texts containing causal knowledge, we should provide access to the following types of prerequisite knowledge:

a. Knowledge related to the subject matter of the text to be processed.
b. Linguistic knowledge allowing the system to treat the linguistic structures appearing in the text.
c. Reasoning knowledge needed for processing and deductive question-answering.

In most cases our system aims at the extraction of causal knowledge. Causal knowledge can be delivered by the logical connection between phrases that express process-entity pairs through a “causal linker”.

4.1. Linguistic Expression of Causal Relations
In natural language, “causality” may be expressed by a variety of linguistic forms. The linguistic forms expressing the parts of causal relations involve, among others, clausal adverbial subordination, clausal adjectival subordination (i.e. anaphoric links), discourse connection, single sentences (passivised or not), and sentential constituents (nominalisations and adverbial modification).
A causal relation is typically a pair consisting of an “antecedent” and of a “consequent”. Antecedents and consequents may appear in two connected sentences in terms of some causal implication. In passivised sentences, the by-phrase introduces the antecedent of the causal relation. Antecedents of causal relations may also be identified in by-phrases associated with past participles, rather than passive verbs. Note that certain verbs (passivised or not) and participles may function as mere implicational/causal connectives (cause, lead to, brought on) in contrast to others (increase, introduce etc.) that may have some additional meaning component encoded.
Processes functioning as antecedents or consequents of some causal relation can also be referred to in terms of nominalisations in causal-relation delivering sentences. In dealing with sentential constituents rather than with complete sentences which may deliver one of the parts of some causal relation, we should note that, apart from nominalisations, adverbial modification (participles or prepositional phrases) may also provide material qualifying such antecedents/consequents.

4.2. Recognition of Causal Relations from Texts
Causal relations are recognized and extracted by means of a four-argument predicate called “cause”. The arguments of this predicate are:

(a) the consequent process related to some entity in the physical system,
(b) the entity in the physical system involved in the consequent,
(c) the antecedent process related to a similar (or to the same) entity, and
(d) the entity involved in the antecedent.

The automatic extraction of the appropriate terms which will fill in the process-entity slots in the “cause(_,_,_,_)” clause involves the following steps. Firstly, phrases qualifying for antecedents or consequents should be identified. Due to the variety of linguistic forms encoding causal relations, as indicated above, there may be more than one causal relation encoded in a sentence. Secondly, the terms representing each process-entity structure of the antecedent and consequent should be identified. Thirdly, after the identification of a process, the entity undergoing this process must fill in the entity-slot of the same antecedent or consequent "process-entity" structure. In cases where there are two entity nouns in a phrase qualifying some antecedents or consequents, a choice is made among these entities nouns on the basis of some additional knowledge. There is a further point to be made concerning the identification of the entity noun which is to fill in an entity-slot, namely, the recovery of a noun related to some process when the former is absent from the text.

5. Extraction from the Primary Production Texts
The corpus of texts to which the combination of question answering with information extraction was first applied (J. Kontos, 1983) were abstracts of research papers from the oceanographic domain of primary production in the sea. Primary production concerns the growth of phytoplankton and its dependence on environmental factors such as nutrients. These abstracts were taken from the “Deep-Sea Research Oceanographic Literature Review” for the years 1978, 1979 and 1980. The information extracted from these abstracts concerned facts about the causal dependence of biological processes such as growth and photosynthesis on environmental factors such as solar radiation and various chemical elements or compounds. These facts constitute the basic elements of scientific knowledge in this domain and are normally predicated with time and space information. An illustrative question-answering example for this application is:

Question: “What organism depends on what nutrient?”
Answer: “tricornutum depends on nutrient” from 1
“phytoplankton depends on N” from 2
“phytoplankton depends on nutrient” from 3
“phytoplankton depends on nitrate” from 4

Where the numbers 1-4 correspond to the text sentences from which the answers were extracted and are given below:

1. Growth of tricornutum related to nutrient content.
2. Numbers of phytoplankton correlated with N in the photic zone.
3. Nutrient enrichment in the basin stimulates phytoplankton growth.
4. Spatial distribution of nitrate correlated with phytoplankton activity.

The system described in (J. Kontos, 1983) that was capable of producing the above results was the first to accomplish a direct attack to the problem of question answering combined with information extraction from unformatted texts. The questions posed to this system were processed by use of a semantic grammar augmented with some form of domain ontology. The speed of the system was increased when its implementation was based on finite state automata parsing instead of the then traditional grammar based parsing method. It is remarkable that recent work on information extraction has “resurrected’ the finite state method of parsing in order to solve the speed problems faced when processing large corpora (M.T. Pazienza, 1997).

6. Extraction from the Lung Mechanics Text
The following example is an extract from a medical physiology book (A. C. Guyton, 1991) in the domain of lung mechanics. The processing of this text for information extraction was first presented in (J. Kontos, 1992) by the use of a number of scenaria related to causal knowledge chaining, which results from deductive reasoning performed by the system in response to the user's question. This text from the book contains the following sentences:

1. The alveolar pressure rise forces air out of the lungs.
2. The alveolar pressure rise is caused by elastic forces.
3. Elastic forces include elastic forces caused by surface tension.
4. Elastic forces caused by surface tension increase as the alveoli become smaller.
5. As the alveoli become smaller, the concentration of surfactant increases.
6. The increase of the concentration of surfactant reduces the surface tension.
7. The reduction of the surface tension opposes the collapse of the alveoli.
If the user makes the question “What process of alveoli causes flow of lungs air?”
The answer “become smaller” is produced automatically together with the following explanation which is a computer generated text:

alveoli become smaller causes increase of elastic forces because
surface tension elastic forces is part of elastic forces and
alveoli become smaller causes increase of surface tension elastic forces
alveoli become smaller causes rise of alveolar pressure because
alveoli become smaller causes increase of elastic forces and
elastic forces causes rise of alveolar pressure
alveoli become smaller causes flow of lungs air because
alveoli become smaller causes rise of alveolar pressure and
rise of alveolar pressure causes flow of lungs air
A second question that may be submitted by the user is: "What process of alveoli opposes collapse of alveoli?" which requires the definition of causal polarity for the proper treatment of the verb “opposes”. After positive and negative causal polarity have been defined as "+cause" and "-cause" respectively the system gives the answer again "become smaller" but now the explanatory text generated by the system is as follows:

alveoli become smaller +causes reduction of surface tension because
alveoli become smaller +causes increase of surfactant concentration and
increase of surfactant concentration +causes reduces of surface tension
alveoli become smaller -causes collapse of alveoli because
alveoli become smaller +causes reduction of surface tension and
reduction of surface tension -causes collapse of alveoli

7. Knowledge Extraction from the Aspirin Text
7.1. The Aspirin Text
The text chosen as another illustrative example is an article from the Scientific American (G. Weissman, 1991), entitled “Aspirin” which we shall call “the Aspirin text”. The author, professor of Medicine at New York University, and director of the division of rheumatology at the University Medical Center where he studies molecular biology of inflammation, focuses on the range of biological effects and side effects salicylates have on the body. The general plan of the “Aspirin text” is given below.
Early history and research up to 1970
Vane's Mechanism (VM)
Remaining details
VM support and elaboration
Weakness's of VM
Are NSAIDs' effects the result of their physical properties?
Neutrophil Interference Mechanism
Prostaglandins possess anti-inflammatory properties
The effect of NSAIDs and prostaglandin on cell signal transmission
The final blow to VM from a marine sponge
Much remains to be learned

Texts like “Aspirin text” contain causal knowledge, which we are interested in extracting that is delivered both “directly” and “indirectly”. Sentences delivering causal knowledge indirectly are - in most cases - embedded sentences, the main verb being a “metascience operator” (e.g. discovered, observed, has shown how, found that, has been impressed by the fact that, explain how, demonstrate that, argued, has been substantiated). Some examples of sentences delivering indirectly causal knowledge from the Aspirin text are:

• What Stone had discovered,...was that salicylates... reduced the fever and relieved the aches...
• Recent work in my laboratory, for example, has shown how aspirin-like drugs prevent the activation of cells that mediate the first stages of....
• Renal physiologists found that low doses of salicylates blocked the excretion of uric acid...
• Pharmacologists had shown that salicylates reduce pain by acting on tissues....

Causal knowledge appears directly when there is no predicator introducing it. In the present illustrative medical text we are focusing on causal relations between processes associated with parts of the physiological system of the human body. Such processes may be “(pain/fever) reduction”, “raising (levels of acid)”, ‘(prostaglandin) production” etc. An example of a sentence delivering directly causal knowledge is: “…salicylates reduce pain by acting on tissues and associated sensory nerves...„

7.2. Causal knowledge extraction from Aspirin text
In this section, we present an illustrative grammar fragment necessary for the processing of sentences from the Aspirin text. It should be noted that our grammar rules do not generate any structures. They are only used for recognising sentence structures found in texts and extracting information from them. Therefore, we need not complicate our grammars by attempting to constrain the rules so that they will not allow ungrammatical sentences.

7.2.1. A Noun Phrase Grammar
The following rule describes the structure of a noun phrase (np) consisting of another np preceded by an adjective:

np(Str,Rest,Adj,N):-fronttoken(Str,Adj,R1),adj(Adj),np(R1,Rest,N).

It enables the system to recognise np structures like “uric acid” and “ubiquitous local hormones”. Some of the other noun phrase (np) and prepositional phrase rules read as follows:

np(Str,Rest,N):-fronttoken(Str,Q,Rest1),q(Q),np(Rest1),Rest,N).
np(Str,Rest,Gen,N):-fronttoken(Str,Gen,Rest1),g(Gen),np(Rest1,Rest,N).
np(Str,Rest,Gen,N):-fronttoken(Str,Q,Rest1),q(Q),np(Rest1,Rest,Gen,N).
np(Str,Rest,N1,Prep,N2):-np(Str,Rest1,N1),pp(Rest1,Rest,Prep,N2).
np((Str,Rest,Pro,Prep1,N1,Prep2,N2):-pnp(Str,Rest1,Pro), pp(Rest1,Rest,Prep1,N1,Prep2,N2).
pp(Str,Rest,Prep,N):-fronttoken(Str,Prep,Rest1),prep(Prep),np(Rest1,Rest,N).
pp(Str,Rest,Prep,N):-fronttoken(Str,Prep,Rest1),prep(Prep),pp(Rest1,Rest,P2,N),prep(P2).
pp(Str,Rest,Prep1,N1,Prep2,N2):-fronttoken(Str,Prep1,Rest1),prep(Prep1),
pp(Rest1,Rest,N1,Prep2,N2).

The linguistic knowledge presented above enables the system to recognise many structures encountered in the Aspirin text.

7.2.2. Sentence Structures
Below, we present grammar rules describing some of the sentence structures encountered in the Aspirin text. The following rules allow the recognition of sentence structures in terms of the entity/process pairs and of relating the causal relations expressed, to the four-argument clause -described above- representing the causal relations. Five types of rules have been distinguished, and their Prolog expression is given below.

s(Str):-fronttoken(Str,"when",R),ep(R,Ea,Pa,R2),fronttoken(R2,",",R3),pe(R3,Pc,Ec,"").
s(Str):-pe(Str,Pa,Ea,R2),pe(R2,Pc,Ec,"").
s(Str):-prs(Str,Pa,R1),fronttoken(R1,"that",R2),pe(R2,P,E,R3),pe(R3,Pc,Ec,"").

The last rule treats the relative as a condition on which the causal relation (expressed by the rest of the sentence) can be considered valid and used for drawing inferences. It should be noted that the present structure deals with the case of a missing entity string that is necessary to be associated with the process of the antecedent. The system 'consults' prerequisite knowledge in order to fill in the entity-slot in the four-argument structure it recognizes. The first entity-slot likewise will have to be filled in by the system 'retrieving' prerequisite knowledge available to the system.

s(Str):-pe(Str,Pa,Ea,R1), pe(R1,P1,E1,R2),n(E1),fronttoken(R2,"and",R3),ep(R3,E2,P2,R4),n(E2),
fronttoken(R4,"that",R5),pe(R5Pc,Ec,"").

Apart from causal knowledge, inferencing also needs ontological knowledge which is knowledge about the structure of the physiological system. Some of this knowledge is delivered by the text and is extracted by grammar rules like the following:

s(Str):-ep(Str,N1,P1,Rest1),n(N1),prvn(P1),fronttoken(Rest1,By,Rest2),by(By),
pe(Rest2,P2,N2,N4,""),prn(P2P).

7.3. Prerequisite Ontological and Lexical Knowledge
Ontological knowledge is domain knowledge that includes an inventory of entities and of processes related to that domain and an inventory of relations between these entities. The relations involved can be meronomic or taxonomic as well as causal. Lexical knowledge is linguistic knowledge that refers to the characteristics and relations of individual words and may be related to ontological knowledge.
Lexical knowledge is normally contained in the lexicon. In the lexicon, there are a number of issues to be specified such as the set of grammatical categories and semantic classes. Multiple membership (words falling in more than one class) is handled by allowing a word to appear in more than one class, with a separate word definition.
The extraction of information from the Aspirin text is based on prerequisite Lexical and Ontological Knowledge. Lexical and Ontological Knowledge is included in the lexicon that consists of
• general vocabulary items to which no attributes of the type described above are assigned
• medical terminology items,
which may either appear in everyday speech as well, e.g. morphine, kidneys, heart attack, pharmacologists, or not, e.g. neutrophils, thromboxane, piroxicam, etc. Ontological Knowledge is expressed by using terminological classes grouping medical terms. A list of some of the terminological classes of items appearing in the Aspirin text taken into account is provided below:

CLASS LEXICAL ITEMS
MEDICAL PERSONNEL renal physiologists, pharmacologists, investigators, John R. Vane...
ENTITIES
BODY PARTS tissues, liver, blood vessels, brain, sensory nerves, joints, cell membranes,
BODY FLUIDS synovial fluid, blood...
BODY CHEMICALS uric acid, arachidonic acid...
HORMONES prostaglandins, insulin...
ENZYMES prostaglandin H synthase...
DRUGS aspirin, morphine, codeine, salicylates, NSAIDS, ...
PROCESSES activation, adhesion, aggregation ...
SYMPTOMS pain, redness, heat, swelling, fever...
Apart from the ontological/causal prerequisite knowledge, another type of prerequisite knowledge is necessary for supplying domain/subject matter information that we called the Domain Default Knowledge. It differs from prerequisite ontological knowledge in that it associates processes to entities rather than specifying hierarchical relations between entities.

7.4. Inference Engine
The reasoning knowledge needed is used mainly in the form of an inference engine. In our experiments we used an inference engine realized in Prolog. The system, that is, will make use of Prolog rules as inference rules for processing the knowledge acquired from text fragments. In the case of causal knowledge the following simple causal inference rules in Prolog are used:
cause(P1,N1,P3,N3):-cause(P1,N1,P2,N2),cause(P2,N2,P3,N3).
cause(P1,N1,P2,N2):-part(N3,N2),cause(P1,N1,P2,N3).

The first rule combines two pieces of causal knowledge A and B, where the consequent of A is the same with the antecedent of B in order to synthesize a new piece of knowledge with antecedent equal to the antecedent of A and a consequent equal to the consequent of B. The second rule controls the synthesis of a new piece of causal knowledge by use of prerequisite meronomic knowledge.

8. Conclusion
A novel approach was introduced in this paper for the implementation of a question-answering based tool for the extraction of information and knowledge from texts. This effort resulted in the computer implementation of a system answering questions directly from a text using Natural Language Processing. Domain knowledge concerning categories of actions and implicit semantic relations was found useful for performing information extraction from the texts treated.
The method developed was applied to different kinds of texts such as an ancient Greek legal text, two medical texts and a text base of abstracts of papers in oceanography. In all these cases our method produced satisfactory results and the questions submitted to the system were answered successfully.

References
1. Cowie, J., 1983, Automatic analysis of descriptive texts. In Proceedings of the Conference on Applied Natural Language Processing.
2. Cowie, J., and Lehnert, W., 1996, Information Extraction. Communications of the ACM. Vol. 39, No. 1, pp. 80-91.
3. Grishman, R., 1997 Information Extraction: Techniques and Challenges. In Pazienza, M. T. Information Extraction. LNAI Tutorial. Springer, pp. 10-27.
4. Guyton, A. C., 1991, Textbook of Medical Physiology. Eighth Edition, An HBJ International Edition. W.B. Saunders.
5. Kontos, J. and Papakonstantinou, G., 1970. A Question-Answering System Using Program Generation. Proceedings of A.C.M. International Computing Symposium, Bonn Germany.
6. Kontos, J., 1980, Syntax-Directed Processing of Texts with Action Semantics. Cybernetica, 23, 2 pp. 157-175.
7. Kontos, J., 1982, Syntax-Directed Plan Recognition with a Microcomputer. Microprocessing and Microprogramming. 9, pp. 227-279.
8. Kontos, J., 1983, Syntax-Directed Fact Retrieval from Texts with a Micro-Computer. Proc. MELECON '83, Athens.
9. Kontos, J., 1985, Natural Language Processing of Scientific/Technical Data, Knowledge and Text Bases. Proceedings of ARTINT Workshop. Luxembourg.
10. Kontos, J. and Cavouras, J. C., 1988, Knowledge Acquisition from Technical Texts Using Attribute Grammars. The Computer Journal.,Vol 31, No 6, pp 525-530.
11. Kontos, J., 1992, ARISTA: Knowledge Engineering with Scientific Texts. Information and Software Technology, Vol. 34, No 9, pp.611-616.
12. Kontos, J., 1996, Artificial Intelligence and Natural Language Processing (In Greek) E. Benou, Athens, Greece.
13. Malagardi, I., 1995a, Comparative Analysis of "na" and "ya na" Sentences of the Greek Language with the Equivalent Structures of German Language and Related Problems in their Machine Translation. Unpublished Dissertation. University of Athens.
14. Malagardi, I., 1995b, The Resolution of the Subject Ambiguity in Sentences with "ya na" using Domain Knowledge, and Related Problems in Machine Translation. Proceedings of 2nd. International Congress on Greek Linguistics. Salzburg.
15. Malagardi, I.,1996, Computer Determination of Relations between the Elements in Noun Phrases of Sublanguages. 17th annual meeting of the Department of Linguistics. Aristotle Univ. of Thessaloniki.
16. MUC-3., 1991, Proceedings of the Third Message Understanding Conference. Morgan Kaufmann.
17. Pazienza, M. T., 1997, Information Extraction. LNAI Tutorial. Springer.
18. Wilks, Y., 1997, Information Extraction as a Core Language Technology. In Pazienza, M. T. Information Extraction. LNAI Tutorial. Springer, pp. 1-9.
19. Weissman, G., 1991, Aspririn, Scientific American, pp. 58-64.
20. Willets, R. F., 1967, The Law Code of Gortyn. Kadmos : Supplement I. Berlin.

Δεν υπάρχουν σχόλια:

Δημοσίευση σχολίου

Ioanna's Blog

Στο ιστολόγιο μου θα αναρτώνται κείμενα επιστημονικού και γενικού ενδιαφέροντος. Ελπίζω σε μελλοντικές εποικοδομητικές συζητήσεις και προτάσεις.

Κείμενα που αναρτώνται στο ιστολόγιο εκφράζουν προσωπικές απόψεις των συγγραφέων τους.

My blog will present texts of scientific and general interest. I am looking forward to constructive discussions and proposals.

Articles published οn the blog expresses personal opinions of their authors.

Email: imalagardi2005@yahoo.gr

Ποίημα Ιωάννη Κόντου

ΡΥΠΑΝΣΗ

Βάλε χέρια, βγάλε χέρια,
λαμποκοπούν τ’αστέρια.
Βάλε πόδια, βγάλε πόδια,
παραμονεύουν τα χταπόδια.
Όμως νεκρά τα ψάρια
στης ρύπανσης τα χνάρια.
Ευτροφισμός και πίσσα
στυγνού παντρόνου λύσσα.
Γεμίσαν μαύρα λύματα
τα γαλανά μας κύματα.
Δεν ιριδίζουν κρυσταλλένια
και μοιάζουν ατσαλένια.
Σε σκάφανδρα μονάχα
Θα ζήσω τώρα τάχα.

Ioanna Malagardi

Κυριακή 17 Ιανουαρίου 2010

INFORMATION EXTRACTION AND KNOWLEDGE ACQUISITION FROM TEXTS

Δεν υπάρχουν σχόλια:

Φεστιβάλ Ικαρίας

Old Greek Industry

Σελίδες

Daedalos Ikaros