Medical Information Extraction

EMR Extraction (David)


This project aims to extract structured data from the record provided from AstraZeneca. We mainly focus on the present illness history and past history, extract key information based on time, thus producing several records from one case of illness.

An electronic medical record (EMR) is a digital version of a paper char that contains all of a patient’s medical history from one proactive. An EMR is mostly used by providers for diagnosis and treatment.

Data Set

Because this the project collaborate with AstraZeneca, the data set cannot open to the public. However the brief description of the data set is as follows: the total size of the input data is up to 50MB, which contains approximately 35000 cases. All of them contain past & present history and other useful information.


We use the Java regular expression external jar as well as the Stanford Parser tools to parse the input, extract the diseases concerned (we established a disease dictionary and the URL from which you can download) as well as symptoms, narrative expression and many other features.

Experimental Results

The final statistic is as follows:

Following is the links of the test data above:

The output of 2001_2007data can be downloaded here, 100 randomly selected cases we regard as benchmark, manually extracted version is here and software-extracted result is here.

Information Extraction on ECG Using OCR (Jinyi)


The goal of this project is to use OCR to extract useful Information from the ECG pictures.

OCR is short for Optical Character Recognition, which is used to convert scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. At the same time, ECG pictures contain may useful information in a structured format, which makes it possible to correct errors in the OCRed result.

Data Set

The experimental data set for this project is 80 ECG pictures from Beijing University Hospital.

These pictures can be divided in to 12 different types. The largest one contains 41 pictures and the experiment is focusing on these 41 pictures now. The useful information in this pictures are all key-value-unit triples.


The whole process includes:

  1. Pre-processing on pictures.
  2. Thresholding technology is used to remove the noise in the picture.

    Original picture & Thresholded picture:

    Key: Vent. rate Value: 81 Unit: bpm


  3. Using OCR engine.
  4. An open source OCR engine named Tesseract is used. So text output can be got.

    OCR result:


  5. Post-processing on the OCR results.
  6. Actually, there are some position error in the OCR result. So it should be re-organized by using coordinate in the picture.

    Re-organized result:


  7. Information extraction on the processed results.
  8. Some scoring methods are used to find the correct information in the text with noise.

    Triples extracted:

    Vent. rate[81]bpm
    PR interval[148]ms
    QRS duration[88]ms
    P-R-T axes[65, 67, 47]

Experimental Result

For all 41 pictures, which contain 5 useful key-value-unit triples, the roughly accuracy rate is 73% for now.