Saturday, 3 January 2015

Science by press release

Yesterday's Times has a front page story "Two thirds of cancer cases are the result of bad luck rather than poor lifestyle choices...". (paywall)

That doesn't match my preconceptions, so I looked for the story online.    The Independent and the Telegraph agree.  So does the Mail.  And the Express. And the Mirror.

Reuters' headline agrees, but its story suggests something a bit different - that two thirds of an abitrary selection of cancer types occur mainly at random.

The BBC speaks unambiguously of "most cancer types" and so does The Guardian.

The press release which must have given rise to this story features the phrase "two thirds of adult cancer incidence across tissues can be explained primarily by 'bad luck'".  I can't make much sense of "cancer incidence across tissues", so I can't blame the journalists for stumbling over it likewise.  But the reporters who got the story right must have managed to scan down to the paragraph where the press release explains that the researchers "found that 22 cancer types could be largely explained by the “bad luck” factor of random DNA mutations during cell division. The other nine cancer types had incidences higher than predicted by "bad luck" and were presumably due to a combination of bad luck plus environmental or inherited factors."

I emphasize that "two thirds of cancer types" is not at all the same as "two thirds of cancer cases".  Two rare cancers apparently unrelated to environmental factors will count for far fewer cases than one common cancer in the other category.

So what of the paper behind the press release?  Here's the abstract, with a paywalled link to the whole paper.  Or you can 'preview' the paper for free here, to the extent your conscience permits.  Supplementary data and methodology descriptions are here.

The hypothesis behind the paper is that cancer is to a large extent caused by errors arising during stem cell division, at a rate which is independent of the tissue type involved.  The researchers therefore obtain estimates of the lifetime number of stem cell divisions various tissue types, and plot that against lifetime cancer incidence, obtaining a significant-looking scatter plot (Figure 1 in the published paper).  So far so good.

But they've used a log-log plot, necessary to cover the orders of magnitude variations in the data.  Now, if you think, as the researchers apparently do, that cancer risk is proportional to number of stem cell divisions, it follows that the slope of a log-log plot should be unity.  It isn't, by eye it's more like two thirds.  The researchers, busy calculating a linear correlation between the log values seem not to have noticed this surprising result.  Instead they square the correlation to get an R2 of 65%, which may (it's not clear) be the source of the "two-thirds of cancer types" claim.

If so, that claim is based on a total failure of comprehension of what correlation means.  Imagine a hypothetical world in which cancer occurs during stem cell division with some significant probability only if a given environmental factor is present, and that environmental factor is present equally in all tissue types.  In this world cancer incidence across tissue types is perfectly correlated with the number of stem cell divisions, but nevertheless all cancer is caused by the environmental factor.

It's simply impossible to say anything about the importance of environmental factors in a statistical analysis without including those factors as an input to the analysis.

However, the press release also features the paragraph I quoted about 22 out of 31 cancer types being largely explained by bad luck.  Perhaps that's what they mean by two thirds.  To get this number, they devised an Extra Risk Score - ERS for short.  Then they used AI methods to divide cancer types into two types based on the ERS values.  So what's the ERS?  The Supplement describes it as "the (negative value of the) area of the rectangle formed in the upper-left quadrant of Fig. 1 by the two coordinates (in logarithmic scale) of a data point as its sides." That is, it's the product of the {base-10 logarithm of stem cell divisions} and the {base-10 logarithm of lifetime cancer risk}.   (The cancer risk logarithm is negative (or zero) since lifetime risk is less than (or equal to) one.)

Shorn of the detail, it's the product of two logarithms.  How does that make sense?  Multiplying two logarithms is bizarre; for all ordinary purposes you're supposed to add them.  For this analyis, a simple measure would seem to be the ratio of lifetime incidence to stem cell divisions, or you might prefer the log of that ratio, which would be the log of the incidence minus the log of the stem cell divisions.

(On further reflection, the number I'd use would be {log(1-incidence)/divisions}.  That doesn't give a defined answer for lifetime incidence of unity, but you can get a number by using an incidence of just less than unity.  Among the other cancer types, it picks out gallbladder cancer as having the highest environmental or heriditary risk, which is consistent with that cancer's unusual geographical variation of incidence.)

The Supplement attempts to justify multiplying the logarithms by explaining why dividing them woudn't make sense.  Which is a bit like advocating playing football in ballet shoes because it would be foolish to wear stilettos.

Whatever ERS calculation you used, the clustering method would still divide the cancers into two groups, because that's what clustering methods do, but different calculations would put different cancer types in the high-ERS group.  If you want, as the senior author does, to draw conclusions from composition of the high-ERS cluster, you need a sound justification for your ERS calculation.

__

To its credit, The Guardian has published a piece pointing out the correlation misunderstanding.  This piece is also highly unimpressed by the paper, and this review of it has mixed feelings.

Me, I suppose the underlying idea has some truth in it.  But the methodology is the worst I've ever seen in a prominently published paper.

Update: more commentary from Understanding Uncertainty and StatsGuy 

Update: Bradley J Fikes, author of this piece in the San Diego Union-Tribute, complains in comments here that the title of this post is ill-chosen.  He points out that he didn't write his story simply from the press release, but checked it with John Hopkins before it was published.  He's got a point about the title: more than half of what I say here is criticism of the paper not of the press release.



5 comments:

  1. > I can't make much sense of "cancer incidence across tissues"

    I read http://www.statsguy.co.uk/are-two-thirds-of-cancers-really-due-to-bad-luck/ which I found helpful. That, too, points out flaws in the press, though not quite the one you're talking about.

    That gets you some way back towards "of course lifestyle (etc) affects incidence". Although it looks like he blows it somewhat in the comments.

    ReplyDelete
  2. The bane of my life as a medical engineer was clinicians who were functionally innumerate and had little to no grasp of statistics.

    ReplyDelete
  3. http://www.statschat.org.nz/2015/01/03/cancer-isnt-just-bad-luck

    http://notstatschat.tumblr.com/post/106956046171/variation-explained-and-log-transformation

    ReplyDelete
  4. I want to take this time out as a cancer survivor to encourage women out there still suffering from this. The sad news about it is that i was diagnosed on my 36th birthday in 2008 and with stage 3 TNBC which after i made research was a very aggressive form of cancer at that point i decided and told myself i was going to die and that the end has finally come.All my life i never thought of having breast cancer because i was very active and i worked out at the gym several times every week and my diet was okay. In my search for a cure after 6 years of diagnosis and even after chemo spending thousands of dollars a church member told me about Dr kennedy who could help me with a permanent cure, i doubted this at first but i later gave it a try and today i proud to say i am a cancer survivor no nodes and i am totally free the new diagnosis confirmed it. Do not die in silence or ignorance reach him on (drkennedy118@gmail.com or whatsapp +16468833189) don't be shy just speak to someone today.

    ReplyDelete
  5. Get Live Cricket Score with Ball by ball Live Score & live cricket Streaming of all international and domestic cricket matches being played around the world. Cricket Scorecard will be updated instantly and accuracy is given high importance.

    ReplyDelete