Spelling error in title intentional. I got my hands on the BOLDER II study, which found that quetiapine (Seroquel) was an effective treatment for bipolar depression. As is often the case in clinical trials research, I found something that made me feel uneasy. In the abstract it is stated that the therapeutic effect size (ES) was .61 for quetiapine 300 mg/d and .54 for quetiapine 600 mg/d on the Montgomery Asberg Depression Rating Scale (MADRS). The authors, however, used a different method to calculate effect size than what is typically done. This alternative method inflated the apparent effect of Seroquel by a decent margin. In this post, I’ll compare effect size done the conventional way to the effect size reported in the BOLDER II results.
If you’re not interested in how I calculated my stats, please skip the next paragraph!
I calculated the effect sizes from information available in Table 2 (pg. 604), where it is stated that, on the MADRS, 300 mg/d quetiapine was related to an average improvement of 16.94 points, while placebo was associated with an average improvement of 11.93 points. That nets a difference of 5.01 between the groups, according to my math. I calculated the standard deviation (SD) for each group from the reported standard error for each group: Standard error = (Standard deviation divided by square root of number of participants in the group). This gets a SD of 12.56 for the placebo group and a SD of 12.33 for the Seroquel 300 mg/d group. I then took the average of the two SD’s (technically, I weighted the placebo group a bit more heavily because there were slightly more placebo participants than Seroquel 300 mg/d participants). This yielded an average (pooled) standard deviation of 12.45. To get the effect size at this point, I divided 5.01 by 12.45.
On the MADRS, the effect size for Seroquel 300 mg/d versus placebo is .402. I assure you that the methods I used were those conventionally used in the field. The authors of the study, however, utilized mixed model repeated measures analysis (MMRM) to calculate their effect size. This is not conventionally how ES is calculated and an ES of .61 is obviously about 50% larger than an ES, calculated using conventional methods, of .402. The authors did not provide a rationale for using MMRM analysis to calculate effect size. When using an unusual method, an explanation should certainly be provided. My guess (and I could well be wrong) is that the sponsor took a look at how both methods turned out and decided that reporting the effect size using MMRM made better publicity. An ES of .61 is moderate according to most standards while an effect of .40 is often considered small to moderate. Through doing a newfangled analysis, the effect grows substantially and the drug looks more efficacious than it would had the more conventional method been utilized.
Hey, I’m all for progress. If MMRM is really a better analytic method, so be it. However, the study authors provided absolutely no rationale for using this analysis. They should have provided a rationale as well as providing ES calculated using both MMRM and traditional methods so that readers could have seen both figures. Instead, the higher ES’s are reported without any justification, leaving the reader thinking that there is no controversy regarding the reported effects of Seroquel in this study.
As for the 600 mg/day dose, my calculations come up with an ES of .37, which is again in the small to moderate category, whereas the authors came up with .54. The effect size goes up 46% this time when the unconventional method is used.
To summarize, the authors used an unconventional statistical method to calculate effect size, which resulted in effect sizes about 50% larger than those yielded when conventional methods are used. The results using conventional methods are not provided in the published report and no rationale is laid out for the unconventional method.
There’s more to come on my interpretation of BOLDER II. Stay tuned.
Really like your blog - I'm also an academic with a leery eye toward the pharmaceutical industry and its academic denizens. However, I'm pretty sure that your cacluation of the effect size in the MMRM is incorrect. I'm not sure where you got the standard errors from, but assuming that the MADRS score standard deviation are similar to those at baseline (about 5), the effect size between groups at week 8 as calculated in the traditional way (Cohen's d) would be about 5 points mean difference divided by 5 (standard deviation), yielding an effect size of 1. MMRM is actually more conservative, and is less deceptive than LOCF in most instances - in general people on placebo drop out earlier and their last observation is carried forward (even though their depressive symptoms will commonly lessen as is the placebo effect). MMRM and the approach that Thase et al took didn't seem all that innapropriate to me, if anything, was fairly conservative. Now, one could argue about excluding people with suididal ideation from such studies, or whether a 5 point difference is clinically meaningful relative to the side effects that would occur, but I don't think the stats are the source of concern here. Once again though, I really enjoy reading your posts.
ReplyDeleteThanks for the comment. I'll look at the article again then make a comment after I've checked my figures.
ReplyDeleteI'm looking at the article. On page 604 (table 2), the standard errors for the change scores are .99 for PL, .99 for 300 mg Seroquel, and 1.01 for 600 mg Seroquel.
ReplyDeleteI converted these to standard deviations by dividing them by the square root of N for each group. N's were 161 for PL, 155 for 300 mg Seroquel, and 151 for 600 mg Seroquel.
The SD's were hence much larger for the change scores than for the baseline scores, which is often the case. A relatively homogenous group starts the study but their responses to treatment vary quite a bit.
So the SD's come out to be 12.56, 12.33, and 12.41. Hence the ES of 5 divided by 12.45, or .40.
So their ES was actually more liberal. It appears that MMRM reduced the SD a fair amount and that's why the ES changed.
My main point is that the ES changed notably depending on whether LOCF or MMRM was used and that such a difference should have been discussed in the article, in my opinion.
If MMRM is going to become the standard method used in clinical trials, then research should be done on how ES calculated by LOCF differs from MMRM-calculated ES. If there's much of a difference generally, and MMRM ES's tend to be larger, then do we need to change our general criteria for interpreting effects?
In other words, if trials that would yield a .30 ES using LOCF yield an ES of .50 with MMRM, does that mean that suddenly we have increased the effect of treatment by 40% or is this a statistical artifact?
Sorry, I wasn't intending to ramble on quite like that! If you've read this far, thanks. Please let me know if I missed something or if you think I'm way out in left field on this one.