Causality Extraction from Text Data

 

Introduction

Causality, as an important part of natural language semantic structure, is of great importance in natural language processing.
The research on automatic extraction of causality relationship has been going on for a long time and there has already been some exciting results. But the causality extracted before just focus on the causality between nouns. We focus on the automatic extraction of causality among not only nouns but also verbs, adjectives, adverbs and so on. We extract features from three parts: the statistic features, the structural features and the semantic features using tool and corpus like WordNet, Google N-gram. Then we use the neural network to analyses all these features and find whether there is causality between any two words.

Currently, there are two main approaches for this task, namely the knowledge based and the corpus based approaches. However, these approaches are more suitable for measuring semantic similarity between two words rather than the more general multi-word expressions (MWEs), and scalability issues from manual tagging as well as corpora dependency and availability also limit their applicability.

New progress

Methods are shown briefly as the following:

1. Candidate pairs extraction. Extract a great number of sentences that we believe to have causality. Split the sentence by the patterns and filter the stop words. Words before the patterns are connected to the words follow the patterns( each two pairs is connected) and we will get a directed graph(or net). The number of lines between two words represents the strong or weak relationship between them.

2. Feature extraction. We try to find features from different angles so that we can classify the causal pairs and the others. We consider it from the semantic and structure. Besides, some statistic features are also considered.

3. Train a classifier and test.

Approach

WordNet and Google N-gram are two main corpuses used in extracting semantic features. When a word from original corpus, our processing procedure is showed in Figure1.

Figure 1

Figure 1: Processing Procedure in Semantic Feature Extraction.

Experiment Results

To prepare training data, we random sample 1000 pairs from our filtered data and calculate the features. Then we invite three people to label the pair. For each person, if he (or she) can come up a sentence contains word A and word B such as "If A, then B" or "because A, B", then he (or she) label this pair as having causal relation. After three people label these random 1000 pairs, we merge the results together. If two or more people think this pair has causal relation, then we label this pair as having causal relation. There are 237 pairs which we consider having causal relation, the other 763 pairs don't have causal relations.

The main disadvantage of neural network is that there is no theory tells us how to design neural network, how many neurons we should choose and how many layers the network should have. If we use too many neurons, training data may be over fitted which mean poor generalization ability. But if the number of neurons is too small, the network may not be able to classify the data. And the number of training data may also affect the classification ability of trained network. Too many training data many import more noise, while too few training data may lead to poor generalization ability.

After many attempts, we find out that two hidden layers network, whose first layer has three 3 neurons and second hidden layer has 4 neurons, will generate the best training result. As far as the number of training data, we use 800 pairs as input and 200 pairs as test set at first, but the training result make us unset, the precision and recall is so small. We think the imbalance of training data many weak the training results. So we use 200 positive examples and 200 negative examples as training set, and the other 37 positive examples with 37 negative examples as test set. The training result is far beyond our expectation, the precision is 80% and recall is 97.30%.

Table 1 shows the experimental results on Wikipedia dump.

Table 1: The list of highly ranked causal pairs.

IDCauseEffect
1monoxideincomplete_combustion
2nuclear_holocaustworld war iii
3epstein-barr virus infectionearthquake_zone
4inbreeding depressionpopulation bottleneck
5pesticideair pollution
6populationenvironmental stress
7anxietydestructive behavior
8hyperbilirubinemiared blood cell destruction
8colicpremature_death

Contact

Zhiyi Luo(jessieluo1991@gmail.com): Shanghai Jiao Tong University
Yuchen Sha(jessieluo1991@gmail.com): Shanghai Jiao Tong University
Bowen Li(bwbw1992@163.com): Shanghai Jiao Tong University
Yunchou Li(liyunchou94@gmail.com): Shanghai Jiao Tong University
Kenny Q. Zhu (kzhu@cs.sjtu.edu.cn): Shanghai Jiao Tong University