Information Extraction from Medication Leaflets


With the constant growth of medical electronic systems, including decision support systems and personal wellbeing applications, the need for machine understandable information has increased.

However, much of the data currently available is in free-form text, which is a convenient way for people to express concepts and events, but is especially challenging for machines to process. Information extraction can relieve some of the problems related with processing free-form text, by providing a semantic interpretation and abstraction of texts.

This work presents the PharmInX information extraction system, which aims to automatically extract information from pharmacological texts, more precisely medication leaflets. The system was designed to target several different kinds of information regarding pharmacological products, particularly their posology, side effects and indications.

The primary goal is to provide high-quality and machine understandable information, which is currently not available for medical electronic systems. With such information, these systems could provide better well care services for patients, and enhance decision support systems for health care professionals. The PharmInX system was designed and developed with these goals in mind.

It includes 6 components each with different capabilities: 1) text pre-processing, 2) document reader, 3) general natural language processing, 4) named entity recognition, 5) relation extraction and finally 6) information consumers.

The reasoning of these components relies in rules, regular expressions, searches in external resources and machine learning. Once all these stages are completed, we can then access the information extracted through an ontology which was carefully developed to support the pharmacological information that we intended to extract.

For the purpose of both development support and evaluation of the system, some pharmacological documents were manually annotated and used as gold standard. The results achieved by the system resorting to this evaluation indicate that pharmacological and clinical information can successfully be extracted from free-form texts in Portuguese, presenting a F1 score of 99.23% when recognizing entities and a F1 score of 97.43% when extracting relations between those entities.


Author: Bruno Aguiar

Type: MSc thesis

Partner: Faculdade de Engenharia da Universidade do Porto

Year: 2012