Developing ORCHID-Notes, the first large-scale dataset of millions of de-identified clinical notes from the U.S. organ procurement system- enabling research into transplant equity and advancing LLM-based de-identification at scale.
Creating MIMIC-IID, a scalable analysis pipeline that tests IID assumptions in MIMIC-CXR at scale, improving dataset transparency and fostering equitable AI development in critical care imaging.
Engineering a high-throughput deep-generative framework that synthesizes mixed-type (numeric, categorical, and text) tables with preserved joint distributions and feature dependencies—validated in collaboration with Liberty Mutual to accelerate reliable, scalable machine learning development..