For the last 10 years, after the Khan et al. meta-analysis, and especially after the Kirsch et al. publication [5, 8], the efficacy of antidepressants in the treatment of major depression was under dispute. The current multi-meta-analysis utilised the Kirsch et al. data set and suggests that the most appropriate methods to meta-analyse these data are RE meta-regression models in a Bayesian setting using the SMD scale. It is important to decide which method of meta-analysis is best for the current data set, since different methods and different effect measures have different properties and can therefore result in different estimates [35, 41, 42].

The use of SMD in a Bayesian RE meta-regression model suggests that the standardised effect size of antidepressants relative to placebo is 0.34 (0.27–0.42), and there is no significant role for the initial severity of depression. The most probable raw HDRS change score is 2.82 (2.21–3.44) extending above 3. Our analysis showed that antidepressants are not equally effective. Bayesian NMA approaches suggest that venlafaxine is more effective than the rest with fluoxetine being the least effective among antidepressants.

The Kirsch hypothesis concerning depression is that there is a response which lies on a continuum from no intervention at all (e.g. waiting lists) to neutral placebo, then to active and augmented placebo including psychotherapy and finally to antidepressants which exert a slightly higher efficacy probably because blinding is imperfect because of the side effects (enhanced placebo) [10, 43–48]. The full theory of Kirsch and its criticism can be found elsewhere [49, 50].

The meta-analytical methods applied so far have advantages and limitations and much of the discussion focused on these limitations, and biases are introduced (Table 1). In the analysis of Kirsch et al. [5], the authors calculated the mean in drug change and the mean in placebo change and then took their difference. This breaks the randomisation and introduces bias, as it ignores the studies' characteristics and the sample size [51–53]. The so-called naïve comparisons are liable to bias and overprecise estimates. Horder et al. [19] used simple meta-analysis in a frequentist approach. They used standard meta-analytic approaches (fixed and random effects meta-analysis) and applied meta-regression in frequentist approach where the drug change vs. placebo change is plotted. Meta-regression, the way they used it, also breaks the randomisation as it does not account for the correlation between the change in placebo and the change in drug. Fountoulakis and Moller [18] used two methods: (a) sample size weighting which is appropriate when a set of independent effect sizes (e.g. RMD, SMD) is combined, but again, it breaks the randomisation and introduces bias. (b) Inverse variance weighting which applies weight as the inverse variance or the precision of each arm in each study. The precision of the effect estimates is the most accurate estimation of the summary effect size. It calculates the standardised change both for drug and placebo and then takes their difference. However, this again breaks the randomisation and introduces bias. Khan et al. [8] applied simple regression in frequentist approach where the drug change vs. baseline is plotted and the correlation coefficient is calculated. However, the precision of each study and the heterogeneity is not taken into account as in a meta-regression analysis. Then, in order to draw conclusions, the authors divided the sum of the number of early discontinued patients by the sum of the number of total patients in each arm and then calculated the chi-square. This is not an appropriate analysis as it also breaks the randomisation.

We believe that the current paper resolves the debate concerning the efficacy of antidepressants and its possible relationship to the initial severity in a definite manner.

The argument that an SMD of 0.30–0.35 is a weak one and suggests that the treatment is not really working or it does not make any clinically relevant difference neglects the fact that such an effect size is the rule rather than the exception [54]. Traditionally, an SMD of around 0.2 is considered to be small, around 0.5 is considered medium and around 0.8 is considered to be large [55], but this is an arbitrary assumption. However, in the real world of therapeutics, things are quite different. For comparison, one should look at the acute mania meta-analyses which suggest an SMD of 0.22 [56] or 0.42 [57], while clinically, acute mania is one of the easiest-to-treat acute psychiatric conditions. Also, the SMD of antipsychotics against the positive symptoms of schizophrenia is 0.48 [58].

The present study suggests that in this data set, the SMD results in more meaningful inferences than the RMD effect measure, since a greater amount of heterogeneity is produced using RMD. However, all calculations of RMD suggested a mean close to 3 and confidence intervals including the value of 3, thus suggesting that the RMD is not lower than the suggested NICE criterion. However, this criterion is arbitrary and unscientific, both in terms of clinical experience as well as in mathematical terms (because of the mathematical coupling phenomenon, see below), but this discussion is beyond the scope of the current paper [59, 60].

Because the earlier meta-analyses suggested that initial severity is related to outcome with more severe cases responding better to antidepressants in comparison to placebo, some authors suggested that medication might not work at all for mildly depressed patients. Thus, they argued that for these patients, medication should not be prescribed; instead, alternative treatments which presumably lack side effects should be preferred, in spite of the possibility that the difference between medication and psychotherapy is similar to that between medication and placebo [61]. The suggestion to avoid pharmacotherapy in cases of mild depression is adopted also by the most recent NICE guidelines CG90. An immediate consequence of this is that patients suffering from mild depression are deprived from receiving antidepressants, on the basis of this conclusion and the overvaluation of ‘alternative therapies’.

‘Common sense’ among physicians leads to the belief that patients with greater disease severity at baseline respond better to treatment. The relation between baseline disease severity and treatment effect has a generic name in the statistical literature: ‘the relation between change and initial value’ [62] because treatment effect is evaluated by measuring the change of variables from their initial (baseline) values. In psychology, it is also well known as the ‘law of initial value’ [63].

However, the concept of ‘mathematical coupling’ , which was demonstrated for the first time by Oldham in 1962, suggests that there is a strong structural correlation (approximately 0.71) between the baseline values and change, even when ‘change’ is calculated on the basis of two columns of random numbers [59]. Mathematical coupling can lead to an artificially inflated association between initial value and change score when naïve methods are used [60]. The problem is that Bayesian methods, which are able to partially correct for this artefact to a significant degree, are not routinely applied in meta-analytic paper researches [64–66]. However, even these methods are not completely free from this phenomenon.

Taking into account that our data form a ‘star-shaped’ network, where all agents are compared to placebo effect, we employed a more advanced statistical method than other authors in the past, which is the NMA that is calculated for all treatments, the probability of being the best [31], and the SUCRA values [32]. In our case (star network pattern), NMA method relies only on the indirect comparison via placebo to contrast the different agents. In comparison, Huedo-Medina et al. [27] employed the naïve method of pooling the results, which has been criticised in meta-analysis bibliography that is liable to bias [53]. Conclusively, the results of the current paper suggesting that the use of Bayesian approach returns no role for initial severity should be considered to be strong. This finding is in accord with the conclusion other authors reached by analysing different data sets [67, 68].

An important limitation in the Kirsch et al. data set is that it includes aggregate data rather than individual patient data. It has been recently shown that inference on patient-level characteristics, such as initial severity, using meta-regression models and aggregated evidence can be problematic due to aggregation bias [69]. As clearly stated in Additional file 2 (simple meta-regression in Section 3), this method has low power to detect any relationship when the number of studies is small.

A more complex issue which is beyond the scope of the current article is the intrinsic problems in the methodology of RCTs [70]. These problems tend to reduce the effect size for a number of reasons, with most prominent being the quality of recruited patients and the problems with the quantification of psychiatric symptoms, including the psychometric properties of the scales used. Even the concept of ‘severity’ is not satisfactorily studied. For example, some items like ‘depressed mood’ manifest a ceiling effect as severity grows while others like ‘suicidality’ manifest a floor effect as severity is reduced [71–81]. Both the HDRS and the MADRS describe a construct of depression which corresponds poorly to that defined by the DSM-IV and ICD-10 and include items corresponding to non-specific symptoms (e.g. sleep, appetite, anxiety; they might respond to a variety of non-antidepressant agents) or even side-effects (e.g. somatic symptoms) [77, 78, 82]. Also, it is obvious that the last observation carried forward method significantly contaminates efficacy with tolerability. However, no other results are usually available to analyse. Taking together that in many RCTs, agents like benzodiazepines are permitted in the placebo arm, the final score might not reflect the actual effect of the drug vs. placebo *per se* but somehow the add-on value of antidepressants on benzodiazepines. The RCTs are necessary for the licensing of drugs as safe and effective by the FDA, the EMEA, the MHRA, etc., but their usefulness should not be overstated, and their data should not be overused. Maybe it is time the raw data to be in the public domain, at least for products whose patent has expired. The way the lay press and especially the way medical scientists write for the lay press concerning antidepressants [83, 84] cannot be considered in any other way but as being a reflection of a new type of stigma for depressed patients.

The results of the current study also suggest there is no ‘year’ effect; however, the changing severity of patients recruited over the years might result in a change in the observed difference between placebo and active drug. This is largely in accord with the conclusions of Undurraga and Baldessarini [9].