Posts written by Thomas Lumley (2627)

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

March 8, 2026

Briefly

  • From BBC Somerset: “Rare coincidence as three cousins born on same day“. Two sisters-in-law gave birth on the same day, one to identical twins. One of the hospitals notes that identical twins are about 1 in 250 pregnancies. It’s going to be uncommon for two closely-linked women to give birth to three kids on the same day. The chance increases as you consider non-identical twins and more relationships — primary-school BFF, college flatmate, next-door neighbour, sky-diving partner, whatever.  Given that the UK has over half-a-millon births per year, this has got to be a thing that regularly happens.  It’s still rare enough to properly be a big deal to the families involved, and BBC Somerset aren’t overselling it too much.
  • From a BBC news item about electricity theft (and the risks involved)

    The clear increase shown in the graph is a bit undermined by “Crimestoppers estimates that a further 250,000 cases go unreported every year”. If 95% of cases are unreported, there’s no hope for estimating trends from the 5% of reported cases — we can’t possibly distinguish trends in reporting from trends in the true rate.  Long-time StatsChat readers will remember me saying this about everything from skin cancer to domestic violence.
  • Weight-loss jab could be made for $3 a month, study finds (Guardian).  This is plausibly true and I’m not going to argue that pharmaceutical prices are where they should be. However, as with The Guardian itself, the price of one additional copy of the the finished product is not the main determiner of the price, nor should it be.
  • CNN: Here’s how much the war with Iran is expected to cost every day. The answer they give is nearly US$1 billion per day. That’s a lot, but the US is a big country: it’s about three times the US daily spend on coffee and a bit less than the cost of car insurance.  More importantly, it’s not the cost of the war. It’s  not even the cost of the war to the USA as ABC News and Al Jazeera frame the same number. It’s only the cost of the munitions used by the USA.  The cost of the war, under any attempt at reasonable accounting, is far higher.
March 6, 2026

Over-averaging

The Guardian reports “Gen Z males twice as likely as baby boomers to believe wives should obey husbands”, with similar phrasing in headlines from the Daily Mail, and the NY Post, and in the lede from the BBC. This has, unsurprisingly, caused a bit of concern.

Looking at the original information from King’s College London (research for International Women’s Day†) the trend seems to go over all ages

This seems strange. I would not have thought women’s equality had been getting steadily worse for the past eighty years. Do we just have a bad question, or a bad sample, or what? The page at King’s College shows broadly similar patterns for other gender attitude questions, though often less extreme. It’s not just the question but it might be partly the question.  In particular, there might be a carryover from ‘obey’ in wedding vows, which is not quite the same. However, “A husband should have the final word on important decisions made in his home” gets very similar answers to the “obey” question.

Here’s the worldwide comparison for the ‘obey’ question:

There’s a huge amount of variation between countries, so the results will be sensitive to how countries are combined.  Honest and competent researchers will give you this sort of information, and there is a full PDF report actually linked from the King’s College page, near the top!  It has a Technical Note on the last page that says in part

The data is weighted so that the composition of each country’s sample best reflects the demographic profile of the adult population according to the most recent census data. “The Global Country Average” reflects the average result for all the countries and markets in which the survey was conducted. It has not been adjusted to the population size of each country or market and is not intended to suggest a total result. [emphasis added]

It would be interesting to see separate trends for countries and regions, rather than suggesting a total result, when the responses from different countries are so different.

 

 

† Yes, there is. November 19.

March 5, 2026

March madness

Newsroom has a long piece on traffic congestion in Auckland in March. Near the beginning, Douglas Wilson, from the Transport Research Centre at the Uni of Auckland says

“So suddenly people say, ‘Wow, it’s taking me double the travel time to get to work. Why is that the case?’ It’s not that you’ve doubled the traffic volume. Actually, the volume has only gone up a little, proportionally, but the traffic flow has reached capacity.”

This is one of the Two Simple Facts from queueing theory, the branch of applied probability that deals with congestion in networks. These networks can be physical road networks or electronic data networks or something like a system of medical waiting lists, or something as simple as a literal queue, and what they all have in common is waiting for other users.

Queueing theory can lead to very complicated simulations and theoretical approximations, but parts of it are simple. My Two Simple Facts are

  1. When you have multiple servers you should still try to have a single queue
  2. A queueing system has a “capacity” and when it gets near that capacity small changes make things much worse

Most of the time, Auckland’s traffic system works reasonably well. There’s enough wiggle room for traffic to catch up around the inevitable slowdows.  When you get a big crash on the motorway or heavy rain or extra drivers, though, the whole system suddenly gets much slower. In the other direction, removing drivers after Christmas opens up the city out of all proportion to the number who leave.

Sudden slowdowns near full capacity are a pretty general property of queueing systems. We can look at them in a nice simple example — this sort of mathematical model is very useful both for understanding the general vibes and for developing theoretical tools.

(more…)

March 3, 2026

Question wording

It’s impressive that someone has done a poll in the US with only 1/3 of respondents wanting to abolish ICE. It was Third Way, who would like you to think of them as moderate.

When you look more closely you can see how it was done. The 34% figure is for “We need to abolish ICE and halt all immigration enforcement inside the country”.  These two policy changes are not especially closely linked — the US had immigration enforcement for a long time before creating ICE — and the first proposal, rather than the more extreme second one, is given the headline.

 

February 26, 2026

Waiting for the details

In August, the Guardian and the BBC reported a successful clinical trial of an “AI stethoscope”. I’ll note at the start that this isn’t ChatGPT, it’s the older deep neural network style of AI that works more predictably and consistently.

The BBC said

A British team conducted a study using a modern version and say they found it can spot heart failure, heart valve disease and abnormal heart rhythms almost instantly.

and the Guardian said

Those examined using the new tool were twice as likely to be diagnosed with heart failure, compared with similar patients who were not examined using the technology.

Patients were three times more likely to be diagnosed with atrial fibrillation – an abnormal heart rhythm that can increase the risk of having a stroke. They were almost twice as likely to be diagnosed with heart valve disease, which is where one or more heart valves do not work properly.

These reports were based on a presentation at a scientific conference, the European Society of Cardiology meeting. Yes, the trial was called TRICORDER.  “A good name is better than precious ointment”, as the Bible tells us.

There’s some reason to think the claims are plausible.  The most important of the abnormal heart rhythms is very obvious just by taking a pulse, and doctors listen for particular heart sounds as evidence of heart failure.  So, it could be true.  The “AI stethoscope” would still have to be better than a stethoscope together with natural intelligence, but that’s why you do the trial.  The trial randomly allocated half of the GP practices to use an AI stethoscope and the other half to business as usual.

Now we have the full research paper published in Lancet.  The abstract says

Intention-to-treat analysis found heart failure detection did not differ between groups (IRR 0·94 [95% CI 0·86–1·02]); with no difference in community-based or hospital-based diagnoses (p>0·05).

That is, there isn’t good evidence of a benefit from your doctor having an “AI stethoscope” and at least for heart failure detection there’s evidence against a meaningful benefit: the uncertainty interval tops out at a 2% increase.

It’s worth emphasising that the trial had already finished last August. All that differs is the analysis.   The analysis in the conference presentation was the “per protocol” comparison: comparing people who got an AI stethoscope examination with a curated set of controls who got at least one face-to-face consultation (but not necessarily a non-AI stethoscope). In fact, they couldn’t get enough information for data linkage on all the people who got an AI stethoscope examination, so the analysis only used half of them. The published paper also reports this analysis

Use declined over time, with clinicians citing workflow barriers to sustained use. In per-protocol analyses, adjusting for patient exposure to the AI-stethoscope, detection of heart failure (IRR 2·33 [95% CI 1·28–4·26]), atrial fibrillation (IRR 3·45 [2·24–5·32]), and VHD (IRR 1·92 [1·09–3·40]) was significantly increased in the intervention group.

So, doctors didn’t use the “AI stethoscope” much, but if you compare half of the people who did get AI stethoscoped with the same number of apparently similar people in the control group of the trial, you find big differences.  This difference could be a real benefit of the new device, but we no longer really have randomised evidence on that question; we’re relying on how similar the researchers could make the treatment and control groups.

It’s still worth publishing the data, and the researchers and Lancet get credit for putting the randomised-trial analysis first in the research paper. Lancet doesn’t really get much credit for posting the results as “AI-enabled stethoscopes show promise for improving diagnosis of cardiovascular conditions, UK trial finds”.

February 24, 2026

Briefly

  • The “cancer mornings” paper now has an Editors Note “The editors are issuing this note to alert readers that concerns have been raised regarding inconsistencies between the registration record of this trial on clinicaltrials.gov and published version the study protocol, as well as with some of the findings in this study. Editorial action will be taken as appropriate once an investigation of the concerns is complete and all parties have been given an opportunity to respond in full.”  For people more familiar with politics, on the controversial/embattled/disgraced/former spectrum an editors note is somewhere near the controversial/embattled boundary.  In this example it’s a mixture of not believing the result is possible and some changes in the trial registration over time (as I mentioned).
  • Various sources are enthusiastically repeating a claim that the Tesla Cybertruck is explodier than the notorious Ford Pinto was. The Cybertruck had had 5 fire fatalities in an estimated 34,438 vehicles, versus 27 in  3,173,491 vehicles for the Pinto.  Crudely, that’s a ratio of 17.  There’s obviously a lot of uncertainty, but a standard uncertainty interval for the relative risk goes from 5.8 to 41.  There are a few more caveats: first, one of the five Cybertruck deaths was deliberate. If we don’t count that one, the uncertainty interval is now 4 to 35. Second, the denominator isn’t that reliable for the Cybertruck.  Third, we don’t have any driving information — if the cybertrucks were driven more you might expect more fires.  Fourth, this is obviously a comparison selected after the fact, so it will be inflated and less reliable than the stats indicate.   As an illustration of the unreliability of this comparison there are claims for the Pinto that are more than an order of magnitude higher. Mother Jones, which reported the Pinto investigation, didn’t caveat the 27 deaths figure at all in its Cybertruck story.
  • Marc Daalder has an interesting story at newsroom about changes in the NZ crime rate. Or, to be more precise, the NZ reported crime rate. Retail crime is one of the sectors where the numbers are driven by reporting — most robberies from shops aren’t reported because there’s no real benefit to doing so.  This is familiar in other crime areas — intimate partner violence for one — and in medical statistics, where diagnoses of, say, prostate cancer or lung cancer or melanoma are driven by testing.
  • The Color Game: how well can you remember colours?
  • Women’s clothing sizes: scrolly/visualisation “[at age 15] This means for the first time ever, most girls in their cohort will be able to find a size in the women’s clothing section. This will also likely be the last time this ever happens in their lives.”
February 20, 2026

Out of warranty?

A new medical study (reported here and here) used MRI to look at the shoulders of a reasonable representative sample of 602 people over 40 in Finland.  Rotator cuff abnormalities of varying apparent severity were seen in 595 of the people: that’s 99% to two digits accuracy.  Of the 602 people, 18% reported shoulder pain and the other 82% claimed their shoulders were ok (apart from being over 40).

There wasn’t much difference between the people who noticed their shoulders hurting and the ones who didn’t: here’s a graph comparing asymptomatic and symptomatic shoulders,  so someone with one bad shoulder and one over-40-but-otherwise-good shoulder is in both groups. The green at the bottom is “no abnormality”, then we progress through “tendinopathy”, “partial-thickness tear”,”full-thickness tear”.

You can see the abnormalities are a bit more severe in the symptomatic group, but not enough to make a useful diagnostic test.  On top of that, the researchers showed that the difference largely goes away when you adjust for things a doctor would have measured before doing the MRI, so the MRI really isn’t providing much useful information.

We’ve seen this before. New medical-imaging tech gets used first on people who look like they need it. A lot of people with back pain were given CT scans. These showed that people with back pain had weird misshapen spines, and often led to referrals for surgery.  It was much later that people not reporting significant back pain had their backs scanned — and they, too, often had weird misshapen spines. Spines are just badly designed and implemented.

Medical imaging can be immensely valuable: simple X-rays, CT scans, MRI, PET, and so on. One of Marie Curie’s many claims to fame was designing, deploying, and driving mobile X-ray units in the Battle of the Marne.  But with each shiny new technology for subtler and more precise imaging there’s an increasing need for control data. Marie Curie could see a piece of lead in a soldier’s heart and be confident that it wasn’t normal.  The problems we’re looking for in shoulders and spines are more complicated and comparisons are important.

February 14, 2026

Cancer hates mornings too?

Via the pharmaceutical chemist Derek Lowe, and also various media outlets, there is a new cancer study that randomised patients with lung cancer to get their immunotherapy infusions in the morning or the afternoon/evening.  The motivation  will have been the various not-very-convincing correlational studies where patients getting morning treatment did better on average. In those studies the differences seen were large, but the studies were small enough that only large differences could have been seen.

The new study also saw a massive difference between morning and afternoon treatment, with the estimated rate of survival without disease progression being 60% lower in the morning group. That difference was 5.5 standard errors away from zero — almost physics levels of statistical surprise.

So, what  do we check?

First: dropout. Maybe the healthy patients in the afternoon or the sick patients in the morning dropped out? No, according to the research paper everyone who was randomised was included in the final analysis.

Second: did they report what they said they would report? Up to a point, yes.  The clinical trial registry says they started out with overall survival and response rate (tumour shrinkage) as their measurements of success. They changed to progression-free survival as their headline measurement after the trial had been running for a while, which is potentially dodgy.  On the other hand, they did report overall survival, and the results are almost as good as progression-free survival.  They also reported response rate, which had unimpressive favorable results, but which is a much less important measurement.  If things had gone the other way, with good response and bad survival data I would have believed the survival data.

We should now consider whether the results make sense.  This is immunology — as Ed Yong described it for the Atlantic, “where intuition goes to die”.  Looking at the experts (Derek Lowe and the people quoted in the news stories) it seems they don’t completely believe it, but they are also unwilling to entirely disbelieve it.  The drug hangs around in the body for weeks, making a time-of-day effect surprising, but who knows?  The result agrees with past correlational research, but that past research is not very convincing. The worst that the experts quoted by Stat (a medical news site) were willing to say is that only half the eligible patients were randomised, which might mean problems in generalising the results. Fortunately, this trial will be relatively easy to replicate, directly in lung cancer, or in the range of other conditions such as melanoma or head and neck cancer where this specific antibody is used, or in the wider world of immune checkpoint inhibitors.

The possibility that’s not mentioned by any of the news stories is fraud: either faking data or faking the tidiness of the randomisation and completeness of the data. Fraud happens; it’s a definite possibility.  On the other hand, this doesn’t look like an especially attractive place to try it. Other researchers are bound to redo the experiment, and look into the details, and Big Pharma hasn’t worked out how to manufacture more than one morning per day.

I expect these results to fail to replicate, but I wouldn’t bet large amounts of money on it.

Olympics condoms

Every two years (since 1988) there is at least one round of stories about condoms at the Olympics (here’s a couple of past StatsChat posts).

Many athletes would have the funds and general executive function to be able to acquire condoms for themselves, and it’s clear that a big part of the point is safe-sex advertising. It’s relatively difficult to get a positive story about condom use into the world’s prestige press, and the Olympics are an opportunity.

Usually the story is about oversupply  (450,000 in Rio!, 200,000 Olympics-branded ones in Paris!). For a couple of Olympics the story was about social distancing (Tokyo had 110,000, but they were officially just souvenirs).

This year the story is undersupply: Milano/Cortina apparently had a mere 10,000 condoms, which ran out on day 3.  It’s possible that this is a planning failure, like the nearly-finished cable car, but it might also be that the whole advertising mission of the Olympic condoms is losing its urgency.

February 12, 2026

Briefly

  • The US FDA will not even review the application for Moderna’s new flu vaccine. The FDA is very careful to give itself the flexibility to do this: if they say supportive things about your trial design today there is no legal guarantee that they can’t change their minds six times before breakfast. However, they are usually reluctant to make radical changes in their advice and, for example, typically don’t require placebo controls when an existing treatment works and is already widely recommended.
  • Hayden Donnell at the Spinoff did a deeply felt post on the scale of the Moa Point sewage discharges, with comparisons to everyday life. I want to quote one: “ If you started now, it would take you 2,535 years, 15 days, 13 hours, and 20 minutes of non-stop shitting to produce as much waste as the Moa Point plant is expelling onto schools of unsuspecting fish every day. From this we can deduce with a little additional calculation that the roughly 200,000 people living in Wellington City would take about four and half days to produce one day of the Moa Point poop.  Even allowing for politicians, that’s a lot of effort. The problem, of course, is dilution: a toilet flush is about 10 litres rather than the 0.1 litres the Spinoff is allowing for, and there may well be further dilution downstream.
  • A set of six posts about colour (or perhaps ‘color’) from NASA
  • The American Statistical Association is taking nominations for its “Excellence in Statistical Reporting” award, due March 1st.
  • “And it turned out that the previous gender discrimination policy had been nothing like discriminatory enough; women were much safer drivers, and hadn’t previously been getting anything like enough credit for it.” Dan Davies’s excellent “Back of Mind” newsletter.