Universal HIV screening programs are costly, labor-intensive, and in practice unable to identify all individuals at risk of HIV infection. Automated risk assessment methods that leverage longitudinal electronic health records (EHRs) could catalyze targeted screening programs in Emergency Departments and across public health jurisdictions. While information on social and behavioral determinants of health are typically collected in unstructured fields, previous analyses have only considered structured EHR data. We sought to characterize whether clinical notes can improve predictive models of HIV diagnosis.
181 individuals who received care at an academic medical center in New York City prior to a confirmatory HIV diagnosis were included in the study cohort. 543 HIV- controls with similar utilization patterns were selected using propensity score matching. Demographics, laboratory tests, and diagnosis codes were extracted from longitudinal records. Clinical notes were preprocessed using both topic modeling and an n-grams approach. We fit 3 predictive models using Random Forests including a baseline model which included only structured EHR data, the baseline model plus topic modeling, and baseline model plus clinical keywords.
Predictive models demonstrated a range of performance with F-measures of 0.59 for the baseline model, 0.63 for the baseline plus topic modeling and 0.74 for the baseline plus clinical keyword model. The baseline plus topic model displayed low precision but high recall while the baseline plus clinical keyword model displayed high precision but low recall. Clinical keywords including ‘msm’, ‘unprotected’, ‘hiv’, and ‘methamphetamine’ were indicative of elevated risk.
Clinical notes improved the performance of predictive models for automated HIV risk assessment. Future studies should explore novel techniques for extracting social and behavioral determinants from unstructured text in longitudinal EHRs.
M. Yin, None
P. Gordon, None
N. Elhadad, None