Posts written by Thomas Lumley (2644)

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

March 3, 2026

Question wording

It’s impressive that someone has done a poll in the US with only 1/3 of respondents wanting to abolish ICE. It was Third Way, who would like you to think of them as moderate.

When you look more closely you can see how it was done. The 34% figure is for “We need to abolish ICE and halt all immigration enforcement inside the country”.  These two policy changes are not especially closely linked — the US had immigration enforcement for a long time before creating ICE — and the first proposal, rather than the more extreme second one, is given the headline.

 

February 26, 2026

Waiting for the details

In August, the Guardian and the BBC reported a successful clinical trial of an “AI stethoscope”. I’ll note at the start that this isn’t ChatGPT, it’s the older deep neural network style of AI that works more predictably and consistently.

The BBC said

A British team conducted a study using a modern version and say they found it can spot heart failure, heart valve disease and abnormal heart rhythms almost instantly.

and the Guardian said

Those examined using the new tool were twice as likely to be diagnosed with heart failure, compared with similar patients who were not examined using the technology.

Patients were three times more likely to be diagnosed with atrial fibrillation – an abnormal heart rhythm that can increase the risk of having a stroke. They were almost twice as likely to be diagnosed with heart valve disease, which is where one or more heart valves do not work properly.

These reports were based on a presentation at a scientific conference, the European Society of Cardiology meeting. Yes, the trial was called TRICORDER.  “A good name is better than precious ointment”, as the Bible tells us.

There’s some reason to think the claims are plausible.  The most important of the abnormal heart rhythms is very obvious just by taking a pulse, and doctors listen for particular heart sounds as evidence of heart failure.  So, it could be true.  The “AI stethoscope” would still have to be better than a stethoscope together with natural intelligence, but that’s why you do the trial.  The trial randomly allocated half of the GP practices to use an AI stethoscope and the other half to business as usual.

Now we have the full research paper published in Lancet.  The abstract says

Intention-to-treat analysis found heart failure detection did not differ between groups (IRR 0·94 [95% CI 0·86–1·02]); with no difference in community-based or hospital-based diagnoses (p>0·05).

That is, there isn’t good evidence of a benefit from your doctor having an “AI stethoscope” and at least for heart failure detection there’s evidence against a meaningful benefit: the uncertainty interval tops out at a 2% increase.

It’s worth emphasising that the trial had already finished last August. All that differs is the analysis.   The analysis in the conference presentation was the “per protocol” comparison: comparing people who got an AI stethoscope examination with a curated set of controls who got at least one face-to-face consultation (but not necessarily a non-AI stethoscope). In fact, they couldn’t get enough information for data linkage on all the people who got an AI stethoscope examination, so the analysis only used half of them. The published paper also reports this analysis

Use declined over time, with clinicians citing workflow barriers to sustained use. In per-protocol analyses, adjusting for patient exposure to the AI-stethoscope, detection of heart failure (IRR 2·33 [95% CI 1·28–4·26]), atrial fibrillation (IRR 3·45 [2·24–5·32]), and VHD (IRR 1·92 [1·09–3·40]) was significantly increased in the intervention group.

So, doctors didn’t use the “AI stethoscope” much, but if you compare half of the people who did get AI stethoscoped with the same number of apparently similar people in the control group of the trial, you find big differences.  This difference could be a real benefit of the new device, but we no longer really have randomised evidence on that question; we’re relying on how similar the researchers could make the treatment and control groups.

It’s still worth publishing the data, and the researchers and Lancet get credit for putting the randomised-trial analysis first in the research paper. Lancet doesn’t really get much credit for posting the results as “AI-enabled stethoscopes show promise for improving diagnosis of cardiovascular conditions, UK trial finds”.

February 24, 2026

Briefly

  • The “cancer mornings” paper now has an Editors Note “The editors are issuing this note to alert readers that concerns have been raised regarding inconsistencies between the registration record of this trial on clinicaltrials.gov and published version the study protocol, as well as with some of the findings in this study. Editorial action will be taken as appropriate once an investigation of the concerns is complete and all parties have been given an opportunity to respond in full.”  For people more familiar with politics, on the controversial/embattled/disgraced/former spectrum an editors note is somewhere near the controversial/embattled boundary.  In this example it’s a mixture of not believing the result is possible and some changes in the trial registration over time (as I mentioned).
  • Various sources are enthusiastically repeating a claim that the Tesla Cybertruck is explodier than the notorious Ford Pinto was. The Cybertruck had had 5 fire fatalities in an estimated 34,438 vehicles, versus 27 in  3,173,491 vehicles for the Pinto.  Crudely, that’s a ratio of 17.  There’s obviously a lot of uncertainty, but a standard uncertainty interval for the relative risk goes from 5.8 to 41.  There are a few more caveats: first, one of the five Cybertruck deaths was deliberate. If we don’t count that one, the uncertainty interval is now 4 to 35. Second, the denominator isn’t that reliable for the Cybertruck.  Third, we don’t have any driving information — if the cybertrucks were driven more you might expect more fires.  Fourth, this is obviously a comparison selected after the fact, so it will be inflated and less reliable than the stats indicate.   As an illustration of the unreliability of this comparison there are claims for the Pinto that are more than an order of magnitude higher. Mother Jones, which reported the Pinto investigation, didn’t caveat the 27 deaths figure at all in its Cybertruck story.
  • Marc Daalder has an interesting story at newsroom about changes in the NZ crime rate. Or, to be more precise, the NZ reported crime rate. Retail crime is one of the sectors where the numbers are driven by reporting — most robberies from shops aren’t reported because there’s no real benefit to doing so.  This is familiar in other crime areas — intimate partner violence for one — and in medical statistics, where diagnoses of, say, prostate cancer or lung cancer or melanoma are driven by testing.
  • The Color Game: how well can you remember colours?
  • Women’s clothing sizes: scrolly/visualisation “[at age 15] This means for the first time ever, most girls in their cohort will be able to find a size in the women’s clothing section. This will also likely be the last time this ever happens in their lives.”
February 20, 2026

Out of warranty?

A new medical study (reported here and here) used MRI to look at the shoulders of a reasonable representative sample of 602 people over 40 in Finland.  Rotator cuff abnormalities of varying apparent severity were seen in 595 of the people: that’s 99% to two digits accuracy.  Of the 602 people, 18% reported shoulder pain and the other 82% claimed their shoulders were ok (apart from being over 40).

There wasn’t much difference between the people who noticed their shoulders hurting and the ones who didn’t: here’s a graph comparing asymptomatic and symptomatic shoulders,  so someone with one bad shoulder and one over-40-but-otherwise-good shoulder is in both groups. The green at the bottom is “no abnormality”, then we progress through “tendinopathy”, “partial-thickness tear”,”full-thickness tear”.

You can see the abnormalities are a bit more severe in the symptomatic group, but not enough to make a useful diagnostic test.  On top of that, the researchers showed that the difference largely goes away when you adjust for things a doctor would have measured before doing the MRI, so the MRI really isn’t providing much useful information.

We’ve seen this before. New medical-imaging tech gets used first on people who look like they need it. A lot of people with back pain were given CT scans. These showed that people with back pain had weird misshapen spines, and often led to referrals for surgery.  It was much later that people not reporting significant back pain had their backs scanned — and they, too, often had weird misshapen spines. Spines are just badly designed and implemented.

Medical imaging can be immensely valuable: simple X-rays, CT scans, MRI, PET, and so on. One of Marie Curie’s many claims to fame was designing, deploying, and driving mobile X-ray units in the Battle of the Marne.  But with each shiny new technology for subtler and more precise imaging there’s an increasing need for control data. Marie Curie could see a piece of lead in a soldier’s heart and be confident that it wasn’t normal.  The problems we’re looking for in shoulders and spines are more complicated and comparisons are important.

February 14, 2026

Cancer hates mornings too?

Via the pharmaceutical chemist Derek Lowe, and also various media outlets, there is a new cancer study that randomised patients with lung cancer to get their immunotherapy infusions in the morning or the afternoon/evening.  The motivation  will have been the various not-very-convincing correlational studies where patients getting morning treatment did better on average. In those studies the differences seen were large, but the studies were small enough that only large differences could have been seen.

The new study also saw a massive difference between morning and afternoon treatment, with the estimated rate of survival without disease progression being 60% lower in the morning group. That difference was 5.5 standard errors away from zero — almost physics levels of statistical surprise.

So, what  do we check?

First: dropout. Maybe the healthy patients in the afternoon or the sick patients in the morning dropped out? No, according to the research paper everyone who was randomised was included in the final analysis.

Second: did they report what they said they would report? Up to a point, yes.  The clinical trial registry says they started out with overall survival and response rate (tumour shrinkage) as their measurements of success. They changed to progression-free survival as their headline measurement after the trial had been running for a while, which is potentially dodgy.  On the other hand, they did report overall survival, and the results are almost as good as progression-free survival.  They also reported response rate, which had unimpressive favorable results, but which is a much less important measurement.  If things had gone the other way, with good response and bad survival data I would have believed the survival data.

We should now consider whether the results make sense.  This is immunology — as Ed Yong described it for the Atlantic, “where intuition goes to die”.  Looking at the experts (Derek Lowe and the people quoted in the news stories) it seems they don’t completely believe it, but they are also unwilling to entirely disbelieve it.  The drug hangs around in the body for weeks, making a time-of-day effect surprising, but who knows?  The result agrees with past correlational research, but that past research is not very convincing. The worst that the experts quoted by Stat (a medical news site) were willing to say is that only half the eligible patients were randomised, which might mean problems in generalising the results. Fortunately, this trial will be relatively easy to replicate, directly in lung cancer, or in the range of other conditions such as melanoma or head and neck cancer where this specific antibody is used, or in the wider world of immune checkpoint inhibitors.

The possibility that’s not mentioned by any of the news stories is fraud: either faking data or faking the tidiness of the randomisation and completeness of the data. Fraud happens; it’s a definite possibility.  On the other hand, this doesn’t look like an especially attractive place to try it. Other researchers are bound to redo the experiment, and look into the details, and Big Pharma hasn’t worked out how to manufacture more than one morning per day.

I expect these results to fail to replicate, but I wouldn’t bet large amounts of money on it.

Olympics condoms

Every two years (since 1988) there is at least one round of stories about condoms at the Olympics (here’s a couple of past StatsChat posts).

Many athletes would have the funds and general executive function to be able to acquire condoms for themselves, and it’s clear that a big part of the point is safe-sex advertising. It’s relatively difficult to get a positive story about condom use into the world’s prestige press, and the Olympics are an opportunity.

Usually the story is about oversupply  (450,000 in Rio!, 200,000 Olympics-branded ones in Paris!). For a couple of Olympics the story was about social distancing (Tokyo had 110,000, but they were officially just souvenirs).

This year the story is undersupply: Milano/Cortina apparently had a mere 10,000 condoms, which ran out on day 3.  It’s possible that this is a planning failure, like the nearly-finished cable car, but it might also be that the whole advertising mission of the Olympic condoms is losing its urgency.

February 12, 2026

Briefly

  • The US FDA will not even review the application for Moderna’s new flu vaccine. The FDA is very careful to give itself the flexibility to do this: if they say supportive things about your trial design today there is no legal guarantee that they can’t change their minds six times before breakfast. However, they are usually reluctant to make radical changes in their advice and, for example, typically don’t require placebo controls when an existing treatment works and is already widely recommended.
  • Hayden Donnell at the Spinoff did a deeply felt post on the scale of the Moa Point sewage discharges, with comparisons to everyday life. I want to quote one: “ If you started now, it would take you 2,535 years, 15 days, 13 hours, and 20 minutes of non-stop shitting to produce as much waste as the Moa Point plant is expelling onto schools of unsuspecting fish every day. From this we can deduce with a little additional calculation that the roughly 200,000 people living in Wellington City would take about four and half days to produce one day of the Moa Point poop.  Even allowing for politicians, that’s a lot of effort. The problem, of course, is dilution: a toilet flush is about 10 litres rather than the 0.1 litres the Spinoff is allowing for, and there may well be further dilution downstream.
  • A set of six posts about colour (or perhaps ‘color’) from NASA
  • The American Statistical Association is taking nominations for its “Excellence in Statistical Reporting” award, due March 1st.
  • “And it turned out that the previous gender discrimination policy had been nothing like discriminatory enough; women were much safer drivers, and hadn’t previously been getting anything like enough credit for it.” Dan Davies’s excellent “Back of Mind” newsletter.
February 11, 2026

Coffee brain?

Various sources are telling us that coffee and tea consumption can lower the risk of dementia (the Independent is clickbaiting it to “the common drinks linked with reducing risk of dementia“, and 9News in Australia is even more extreme with The everyday act that could reduce your risk of dementia, according to Harvard study).  The subtext is definitely that caffeine is responsible for the decrease.

The research (paywalled, sadly) comes from two large studies of health professionals: the Nurses’ Health Study and the Health Professionals Follow-up Study.  You will have heard of them before; the participants have now been studied for 30-40 years and thousands of research papers written.  The rate of dementia was about 20% lower in  people who drank above-average amounts of tea or caffeinated coffee, but this reduction was not seen in people who drank decaf coffee. Since about 1 in 10 of the participants ended up with dementia, a 20% lower rate would mean preventing about two cases per 100 people. That’s not huge, but it’s not trivial either.  If you’ve been following medical news, it’s about the same reduction in dementia claimed for the shingles vaccine.

Unlike the shingles vaccine, which took advantage of a change in the rules that approximates randomisation, the coffee finding is correlations. Should we believe it?

It helps that the study is quite large (so random noise is less likely to give big spurious differences) and that participants’ coffee and tea consumption was measured from early on in the study. This study,  unlike small studies, would probably have been published whatever its findings, especially as the lead researcher is a Harvard PhD student.  It also helps that we know coffee and tea are pretty safe — many people who are suspicious of drugs and/or fun have tried quite hard to find harmful effects, with remarkably little success.

One negative fact, at least for the caffeine explanation, is the finding for tea.  The estimated risk reduction for a group of people who drank an average of 1 cup of tea per day is about the same as for a group who drank an average of 2.5 cups of coffee per day — and 2.5 cups of coffee is a lot more caffeine than one cup of tea.

I don’t think the data are all that convincing — this is really below the limit of what can reliably be done with long-term diet data — but we are not going to get better correlational data on coffee, and a randomised trial is outside the range of plausibility. If you drink tea or caffeinated coffee, it’s nice to think that you might be protecting your brain. If you don’t, there’s probably some reason you don’t. I’m not sure these data should change your mind.

February 10, 2026

Medical chatbots: the questions or the answers

A story in the NYTimes and also an unpaywalled story at 404 Media report a study of chatbots for medical advice, saying they are Bad and Not Good.

The research study is published in Nature Medicine. It’s a randomised controlled experiment, where people pretending to be patients were given a set of symptoms and some background health and lifestyle information.  These people were randomly assigned to talk to one of three large language models or just to use whatever information they would normally use at home for a health problem.

The three chatbots were chosen because they were able to recognise the medical situation in nearly every case and typically give appropriate advice  when directly given the same information that the pretend patients had.   When used in chat by non-medical people, though, the bots did much less well. One highlighted example was a scenario of a severe, sudden-onset headache, with sensitivity to light and a stiff neck.   In this scenario, the sudden onset and the stiff neck are both signs of a very serious event — the scenario was based on subarachnoid haemorrhage, a type of stroke.  One pretend patient emphasised the suddenness of the headache and got the correct advice (Ambulance! Now!), another didn’t mention the onset and got advice for a migraine or a tension headache (“lie down in a dark room”).  The bots weren’t any worse than unaided lay people, but they weren’t any better either.

You might think it’s a bit unfair to the chatbot that it wasn’t given all the information, but an important part of the training of doctors (as with statisticians and lawyers and plumbers) is learning what questions to ask when dealing with a non-specialist member of the public.  Obviously, even if you think there’s a barrier in principle to statistical algorithms making great art, there’s no barrier in principle to statistical algorithms learning to take adequate medical histories. They aren’t there yet.

 

 

Who did the Superbowl half-time show?

Unless you have been living in a cave* you will probably be aware that the lead performer was one Benito Antonio Martínez Ocasio, a Puerto Rican rapper who performs as “Bad Bunny”.  The story is complicated a bit because of prediction markets.  The idea of prediction markets is that they can predict the future by letting experts get paid for integrating all the information about a question and betting correctly.

There are reasons to be somewhat skeptical.  The best way to make money out of a prediction market is to have inside information, but if that is too common then no-one sensible who doesn’t have inside information will bet and lose, so the incentives go away. It’s not clear how well they can work in practice.  On the other hand, two US companies, Kalshi and Polymarket, have discovered that gambling can be rebranded as a prediction market, with less regulation, lower minimum age for participants, and more favorable tax treatment.  It’s possible that sports gamblers will also help rescue prediction markets by providing uninformed money.

The other problem with prediction markets about complicated questions is deciding whether the event happened.  According to Business Insider, quite a number of people had bet on predicted whether Cardi B would do the Superbowl half-time show. You and I and probably many of those people might have expected this binary yes/no question to be easy to resolve. In fact, Kalshi and Polymarket resolved it in opposite directions.  The complication is that Cardi B (along with various other well-known performers) was there on stage, so that precise definitions are going to matter.

It’s possible that some fiendishly clever people predicted this confusion and correctly predicted that Kalshi and Polymarket would split on the question and extracted a big win. If so, go them! Otherwise, whether any hypothetical smart money won or lost would depend on the luck of which market it chose.

 

* “on Mars, with your eyes closed and your fingers in your ears” as the Simpsons’ Sideshow Cecil put it