White paper

Enhancing healthcare cost replication strategies with textual data

A natural language processing framework for constructing leading indicators

ByHervé Andrès, Alexandre Boumezoued, Rodrigo Dufeu, Quincy Hsieh, Adam Schenck, and Dan Mulhern

29 June 2026

Related Content

Healthcare cost modeling
Milliman Health Trend Guidelines
Get accurate monthly information on healthcare expenditures and utilization for individuals enrolled in U.S. commercial insurance plans.
Learn more
Healthcare cost modeling
Health Cost Guidelines Suite
Estimate expected claims costs and model healthcare utilization with Milliman’s Health Cost Guidelines™, an industry gold standard.
Learn more

The trend in per capita medical costs in the United States is a structurally complex and non-tradeable economic variable. Actuarial benchmarks—such as those reflected in the Milliman Health Trend Guidelines—are built from claims data that show heterogeneous drivers such as utilization intensity, reimbursement levels, and drug prices. They are typically reported on a monthly basis.

A substantial portion of additional forward-looking information regarding these cost drivers also emerges in qualitative form: regulatory draft proposals, clinical trial outcomes, policy commentary, reimbursement negotiations, and industry guidance. Historically, the interpretation of such information has relied on domain experts who translate dispersed signals into implicit judgments about future cost trends. However, this expert processing is difficult to scale, systematize, or replicate within a quantitative framework. This paper proposes a methodology for formalizing that interpretive layer.

Specifically, we construct a Healthcare Sentiment Index (HSI) that leverages large language models to encode expert-informed taxonomies of healthcare cost drivers and systematically map unstructured textual data—news feed—into a directional, time-indexed signal. The primary contribution is methodological rather than predictive. The HSI represents an automated translation of domain expertise into a structured quantitative index, enabling qualitative information to be incorporated into formal empirical analysis and potentially systematic asset allocation programs.

Key discussion points include the following.

Natural language processing methodology: A multistage pipeline that prioritizes domain expertise over generic sentiment scoring.
Sentiment index construction: A multistage aggregation framework culminating in a dynamic Bayesian smoothing approach.
Time series explainer: A specialized retrieval-augmented generation pipeline designed to interpret why the sentiment index moved.

Download the full paper (PDF).

Enhancing healthcare cost replication strategies with textual data

Explore more tags from this article

About the Author(s)

Hervé Andrès

Alexandre Boumezoued

Rodrigo Dufeu

Quincy Hsieh

Adam Schenck

Dan Mulhern

We’re here to help

CHOOSE A LOCATION AND LANGUAGE