The idea of an AI handing out letter grades for your heart health sounds like science fiction.
But that’s exactly what happened when a Washington Post reporter, Geoffrey Fowler, connected ChatGPT Health to ten years of Apple Watch data, and the results were wildly inconsistent and alarmingly confident.
Fowler’s experience is entertaining in a “wait, really?” kind of way, but it also exposes the limitations of AI masquerading as a health advisor.
Also: 5 things Apple changed in the new AirTag 2 that make finding lost items faster and less stressful
His doctor dismissed the assessment outright, pointing out his cardiac risk was so low that insurance wouldn’t even cover extra testing to disprove the bot’s findings.
Cardiologist Eric Topol called it “baseless.” Yet the AI didn’t stop there. On repeated queries, Fowler’s score bounced from an F to a B. The bot even forgot basic facts, such as his age and gender, despite having full access to his records.
ChatGPT Health treats fuzzy metrics like VO2 max and heart-rate variability as gospel. Anyone who has used an Apple Watch knows these numbers are estimates.
Swings in readings are common when devices are upgraded or recalibrated. The AI doesn’t contextualize that. It just spits out grades with all the confidence of a doctor, which can make users anxious or give a false sense of security.
Sleep scores get the same treatment. A late night can tank your Apple Watch rating, and ChatGPT will amplify that with an authoritative stamp of approval or failure.
The privacy implications are worth noting. OpenAI promises encryption and says data isn’t used for training, but ChatGPT Health isn’t HIPAA-covered. You’re sharing highly personal information with a company whose primary mission isn’t healthcare.
Fowler’s experiment shows the risk is way beyond theoretical. Claude, Anthropic’s competitor, didn’t do much better. It graded Fowler’s heart health a C but ignored the nuances of the data.
This is the kind of hype that Apple users should be skeptical of. The pitch of AI-driven personal health insights sounds amazing, but right now it’s still experimental.
Tools like ChatGPT and Claude can produce neat visualizations and identify general trends, but turning that into a meaningful health grade is premature.
And with the FDA saying it will “get out of the way” to promote innovation, there’s little regulatory pushback. That makes the stakes higher for anyone who trusts a bot over a doctor.
Fowler’s test is a cautionary tale. Your Apple Watch can track activity and trends reliably over time, but letting an AI assign letter grades to your heart health is not ready for prime time.
These bots are fascinating and hint at the future of digital health, but right now, they are best treated as experimental tools.
Users should enjoy the curiosity factor, not make decisions about their health based on a fluctuating grade from a machine that can’t even remember their age.
Doctors make fatal misdiagnoses all the time.