I agree, but would restate slightly to say the takeaway from this is that testing modifications with likely small effects is difficult.
The larger/more effective mods have a much better chance of showing up above the normal noise of variability in a set of runs.
|