Your Backtest Is Lying to You

420
This is not a strategy article. There is no indicator here, no signal, no setup. This is about the methodology behind our research series, the one that tested RSI across 26 million configurations, Turn of the Month across 385, VWAP across 5.8 million, and MACD across 14.3 million. We get questions about the statistics. People want to know how we determine whether something is real or noise.

This article explains the three methods we use in our published research, then looks at what professional quantitative researchers add on top of that.

If you have ever backtested a strategy and found that it "works," this article will explain why that probably means nothing, and what you need to do to find out whether it actually does.


Part I: Our methodology


1. The problem with raw backtests

Suppose you test a moving average crossover on SPY. You try fast periods from 5 to 50 and slow periods from 20 to 200. You try holding periods from 1 day to 60 days. You end up with a few thousand parameter combinations. You pick the one that looks best. It has a Sharpe ratio of 1.3 You are excited.

You should not be.

snapshot

Figure 1 shows what happens when you test a strategy on pure random data. Data with no signal, no edge, no information whatsoever. At 10,000 tests, you get roughly 500 "significant" results at p < 0.05. At one million tests, you get 50,000. Every single one of them is a false positive. The data is noise, but the tests produce results that look real.

This is not a theoretical concern. This is what happens every time someone optimizes a strategy across hundreds of parameter combinations and picks the best one. The best result from a large search over noise will always look good. That is how probability works.

The three methods described below are what we apply to every indicator study we publish. They are not exotic. They are standard statistical practice. The fact that most retail analysis ignores them is the reason most retail analysis is worthless.


2. Baseline adjustment

Before any statistical test, the first question is what you are comparing against. Most backtests compare strategy returns against zero. A strategy that averages +0.03% per day in a market that averages +0.04% per day is not a winning strategy. It is a losing strategy disguised by market drift.

snapshot

Figure 2 illustrates this. We compute edge as the mean return on signal days minus the mean return on all days for the same asset over the same period. This is the baseline-adjusted edge. It strips out the market's natural drift and asks: does the signal add anything beyond what you would get from simply being in the market?

Without this adjustment, any strategy that goes long in a bull market will appear to work. With it, you see whether the signal itself contributes information. In our MACD study, the unadjusted numbers looked modestly positive. After baseline adjustment, mean edges fell to +0.054 percentage points for longs and +0.018 for shorts, both below transaction costs.

This step is not optional. Every test result in every one of our studies uses baseline-adjusted edge. If you skip it, you are measuring market beta, not strategy alpha.


3. Welch's t-test

Once you have a baseline-adjusted edge, you need to ask whether that edge is statistically different from zero. This requires a hypothesis test. Most people who have heard of hypothesis testing think of the t-test. What most do not know is that there are different versions, and using the wrong one gives wrong results.

snapshot

The standard Student's t-test assumes both groups have equal variance. In financial data, they almost never do. Signal days and non-signal days have different volatility. Sample sizes differ dramatically: you might have 200 signal days and 5,000 non-signal days. Under these conditions, the Student's t-test produces inflated t-statistics and false significance.

Welch's t-test drops the equal-variance assumption. It adjusts the degrees of freedom based on the actual variances and sample sizes of both groups. In the example in Figure 3, the Student's version reports |t| = 5.38 (highly significant) while Welch's reports |t| = 2.22 (barely significant). Same data, very different conclusions. The Student's result is wrong because it assumes something about the data that is not true.

We use Welch's t-test for every significance calculation across all four of our published studies. It has been the appropriate test for comparing groups with unequal variances since Welch published it in 1947. There is no good reason to use the Student's version on financial return data.


4. Bonferroni correction

This is where most retail analysis stops and where our approach diverges from the standard.

When you test one strategy and it shows p < 0.05, that means there is a 5% chance of seeing this result from random data. Acceptable odds. But when you test 10,000 strategies, you expect 500 to show p < 0.05 from pure chance. The 5% threshold is no longer meaningful.

This is the multiple testing problem, and correcting for it is the single most important step that separates legitimate research from data mining.

snapshot

Bonferroni correction is the simplest and strictest method. It divides the significance threshold by the number of tests. If you run 14 million tests, the threshold becomes 0.05 / 14,310,400 = 3.49 times ten to the negative ninth power. A result must be so extreme that it would occur by chance fewer than once in 300 million random trials.

This is the correction we applied in all four of our published indicator studies. When our MACD study reports 3,235 Bonferroni-significant results from 14.3 million tests, those results are extremely unlikely to be noise. When our RSI study reports zero significant results from 26 million tests, that conclusion is equally solid.

The cost of Bonferroni is that it is conservative. It may reject some real effects along with the false ones. Figure 4 shows this tradeoff: on simulated data with 10,000 tests and 200 real effects, Bonferroni eliminates all false positives but only finds 95 of the 200 real effects. That is a deliberate choice. We prefer missing a real effect to reporting a false one.

snapshot

Figure 5 shows what p-value distributions look like in practice. Under the null hypothesis (left panel), p-values distribute uniformly. When real effects exist (right panel), there is a spike near zero. Looking at the shape of your p-value distribution tells you whether your test battery found anything real before you even look at individual results.


Summary of our published methodology

That is the complete methodology behind RSI, Turn of the Month, VWAP, and MACD:

1. Baseline adjustment: compare signal returns against market average, not against zero
2. Welch's t-test: the correct test for groups with unequal variances and sample sizes
3. Bonferroni correction: adjust significance thresholds for the total number of tests

Three steps. No machine learning, no optimization, no curve fitting. The framework is deliberately simple. The power comes from scale (millions of configurations) and strictness (Bonferroni).

VWAP mean reversion survived all of it. Turn of the Month survived. RSI and MACD crossovers did not. The methodology does not create false negatives. It eliminates false positives. What survives is real.


Part II: Beyond our methods


The three methods above are sufficient for our published indicator studies, where the question is binary: does this indicator predict future returns, yes or no? But there is a deeper toolkit that professional quantitative researchers use, particularly when building portfolio strategies rather than testing individual indicators. These methods go further. We describe them here because we think retail traders should know they exist, and because understanding them changes how you evaluate any backtest result, including ours.


5. Benjamini-Hochberg (False Discovery Rate)

Bonferroni controls the probability of any false positive at all. That makes it the right choice when the cost of a false positive is high, like publishing a claim that RSI works when it does not. But in other situations, particularly when screening thousands of candidate signals to find a handful worth investigating further, Bonferroni is too strict. It throws away too many real effects.

Benjamini-Hochberg takes a different approach. Instead of controlling the probability of any false positive, it controls the expected proportion of false positives among the results you declare significant. At FDR 5%, if you call 100 results significant, roughly 5 of them are expected to be false. You accept a small, controlled error rate in exchange for finding more real effects.

snapshot

Figure 4 illustrates the difference. Bonferroni finds 95 of 200 real effects with zero false positives. Benjamini-Hochberg finds 125 of 200 real effects with 11 false positives. Whether that tradeoff is worth it depends on the context. For screening, it usually is. For publishing a binary claim, it is not.

Reference: Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate. Journal of the Royal Statistical Society: Series B, 57(1), pp. 289-300.


6. Permutation testing

Statistical tests like the t-test make assumptions about the data: independence, normality, stationarity. Financial data violates all of these to varying degrees. Permutation testing sidesteps these assumptions entirely.

The idea is straightforward. You have a strategy with a Sharpe ratio of 0.95. The question is: could a strategy with random timing achieve the same Sharpe on the same market data?

snapshot

Figure 6 shows the process. You generate thousands of random strategies: same exposure rate, same market, but random entry and exit timing. You compute the Sharpe for each. This gives you a null distribution: the distribution of Sharpe ratios achievable by luck alone in that specific market environment. Then you check where your observed Sharpe falls. If it sits above 99% of the random strategies, the p-value is 0.01. Your timing is adding something that random timing does not.

This is more honest than a t-test because it uses the actual market data rather than theoretical assumptions. If the market was trending, the null distribution shifts higher, and your strategy needs a higher Sharpe to be impressive. If the market was choppy, the bar is lower. The test automatically adjusts. Professional quant firms typically run 10,000 permutations per asset. A variant called the block bootstrap preserves the serial correlation structure of returns by resampling blocks of consecutive observations rather than individual periods.

Reference: White, H. (2000). A reality check for data snooping. Econometrica, 68(5), pp. 1097-1126.


7. Deflated Sharpe Ratio

The Deflated Sharpe Ratio, developed by Bailey and Lopez de Prado in 2014, directly answers the question that every backtester should ask but almost no one does: given how many strategy variations I tested, what is the probability that my best Sharpe ratio is real?

snapshot

Figure 7 shows the damage. A Sharpe ratio of 1.0 looks solid. On 3 years of monthly data, after testing 100 variations, the probability it is genuine drops to around 60%. After 1,000 variations, below 30%. After 10,000, it is essentially zero. A Sharpe of 1.5 survives longer, but even that erodes past 5,000 trials. Only a Sharpe above 2.0 maintains confidence across large search spaces.

The DSR accounts for three things most backtests ignore: the number of trials, the non-normality of returns (skewness and kurtosis), and the sample length. It converts a reported Sharpe into a probability that the result is a genuine discovery rather than the expected best outcome from a random search.

This is why reporting a Sharpe ratio without disclosing how many configurations were tested is incomplete at best and misleading at worst. A Sharpe of 1.2 from a single hypothesis test is meaningful. The same Sharpe from a search over 5,000 combinations is probably noise.

Reference: Bailey, D.H. and Lopez de Prado, M. (2014). The Deflated Sharpe Ratio. Journal of Portfolio Management, 40(5), pp. 94-107.


8. Combinatorial Purged Cross-Validation

The most rigorous backtest validation method in the current academic literature is Combinatorial Purged Cross-Validation (CPCV), also from Lopez de Prado. Standard backtesting splits data into in-sample and out-of-sample periods: train on the first 70%, test on the last 30%. The problem is that you only get one out-of-sample result. If it looks good, you do not know whether it would look good on a different split.

CPCV solves this by creating all possible combinations of in-sample and out-of-sample periods. With 10 data segments, there are 252 unique train/test combinations. Each one trains on half the data and tests on the other half, with an embargo period between train and test segments to prevent information leakage. The result is not one out-of-sample Sharpe but a distribution of 252 independent out-of-sample Sharpe ratios.

If 90% of those paths show positive Sharpe, the strategy is robust to the specific sequence of historical events. If only 55% do, the strategy is fragile and depends on which particular years fall in the training period. The purging step removes observations that are too close in time to the test set, preventing look-ahead contamination through autocorrelation.

This method is computationally expensive and requires portfolio-level returns rather than individual signal tests, which is why it applies to strategy development rather than indicator studies. But it is the closest thing that exists to a definitive answer on whether a backtest is overfitted.

Reference: Bailey, D.H., Borwein, J.M., Lopez de Prado, M. and Zhu, Q.J. (2017). The probability of backtest overfitting. Journal of Computational Finance, 20(4).


The validation pyramid

snapshot

Figure 8 summarizes the full landscape from raw backtests to professional validation. Most retail analysis lives at Level 0: no statistical testing at all. Our published research operates at Levels 1 through 2: Welch t-tests with Bonferroni correction. The methods in Part II of this article, permutation testing, the Deflated Sharpe Ratio, and CPCV, are Levels 3 through 5 and represent the domain of dedicated quantitative research teams.

The reason this matters is simple. Without these layers, you cannot distinguish a real edge from the expected output of a large random search. And in a world where anyone can run millions of backtests on a laptop in an afternoon, distinguishing signal from noise is the only thing that matters.


What this means for your trading

If you test one strategy on one asset with one parameter set and it shows significance on a Welch t-test with baseline adjustment, you have something worth investigating. If you test a thousand variations and pick the best one without correction, you have nothing.

The framework is not about being pessimistic. VWAP mean reversion survived. Turn of the Month survived. They survived because the effects are real, driven by identifiable market mechanisms: institutional execution for VWAP, payment cycle flows for Turn of the Month. RSI and MACD crossovers did not survive because the effects are not there.

The tools described here are available in standard scientific computing libraries. The concepts are published in peer-reviewed journals. What they require is discipline: the willingness to subject your best idea to a test that might kill it.


References

Welch, B.L. (1947). The generalization of Student's problem when several different population variances are involved. Biometrika, 34(1-2), pp. 28-35.

Bonferroni, C.E. (1936). Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze, 8, pp. 3-62.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), pp. 289-300.

White, H. (2000). A reality check for data snooping. Econometrica, 68(5), pp. 1097-1126.

Hansen, P.R. (2005). A test for superior predictive ability. Journal of Business and Economic Statistics, 23(4), pp. 365-380.

Bailey, D.H. and Lopez de Prado, M. (2014). The Deflated Sharpe Ratio: correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), pp. 94-107.

Bailey, D.H., Borwein, J.M., Lopez de Prado, M. and Zhu, Q.J. (2017). The probability of backtest overfitting. Journal of Computational Finance, 20(4).

Harvey, C.R., Liu, Y. and Zhu, H. (2016). ... and the cross-section of expected returns. Review of Financial Studies, 29(1), pp. 5-68.

Politis, D.N. and Romano, J.P. (1994). The stationary bootstrap. Journal of the American Statistical Association, 89(428), pp. 1303-1313.

Disclaimer

The information and publications are not meant to be, and do not constitute, financial, investment, trading, or other types of advice or recommendations supplied or endorsed by TradingView. Read more in the Terms of Use.