Explainable epidemiological thematic features for event based disease surveillance

Event based disease surveillance (EBS) systems are biosurveillance systems that have the ability to detect and alert on (re)-emerging infectious diseases by monitoring acute public or animal health event patterns from sources such as blogs, online news reports and curated expert accounts. These information rich sources, however, are largely unstructured text data requiring novel text mining techniques to achieve EBS goals such as epidemiological text classification. The main objective of this research was to improve epidemiological text classification by proposing a novel technique of enriching thematic features using a weak supervision approach. In our approach, we train and test a mixed domain language model named EpidBioELECTRA to first enrich thematic features which are then used to improve epidemiological text classification. We train EpidBioELECTRA on a large dataset which we create consisting of 70,700 annotated documents that includes 70,400 labeled thematic features. We empirically compare EpidBioELECTRA with both general purpose language models and domain specific language models in the task of epidemiological corpus classification. Our findings shows that epidemiological classification systems work best with language models pre-trained using both epidemiological and biomedical corpora with a continual pre-training strategy. EpidBioELECTRA improves epidemiological document classification by 19.2 score points as compared to its vanilla implementation BioELECTRA. We observe this by the comparison of BioELECTRA verses EpidBioELECTRA on our most challenging dataset PADI-Web where our approach records 92.33 precision score, 94.62 recall score and 93.46 score. We also experiment the impact of increasing context length of train documents in epidemiological document classification and found out that this improves the classification task by 7.79 score points as recorded by EpidBioELECTRA's performance. We also compute Almost Stochastic Order (ASO) scores to track EpidBioELECTRA's statistical dominance. In addition, we carry out ablation studies on our proposed thematic feature enrichment approach using explainable AI techniques. We present explanations for the most critical thematic features and how they influence epidemiological classification task We found out that biomedical features (such as mentions of names of diseases and symptoms) are the most influential while spatio-temporal features (such as the mention of date of a given disease outbreak) are the least influential in epidemiological document classification. Our model can easily be extended to fit other domains.

Saved in:
Bibliographic Details
Main Authors: Menya, Edmond, Interdonato, Roberto, Owuor, Dickson, Roche, Mathieu
Format: article biblioteca
Language:eng
Subjects:fouille de textes, épidémiologie, surveillance épidémiologique, santé animale, analyse de données, maladie infectieuse, fouille de données, santé publique, http://aims.fao.org/aos/agrovoc/c_dca12b72, http://aims.fao.org/aos/agrovoc/c_2615, http://aims.fao.org/aos/agrovoc/c_16411, http://aims.fao.org/aos/agrovoc/c_431, http://aims.fao.org/aos/agrovoc/c_15962, http://aims.fao.org/aos/agrovoc/c_34024, http://aims.fao.org/aos/agrovoc/c_eb9cea5d, http://aims.fao.org/aos/agrovoc/c_6349,
Online Access:http://agritrop.cirad.fr/609247/
http://agritrop.cirad.fr/609247/1/Menya_et_al_ESWA2024.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!