It would be nice to see multiple runs for a few of the mods, to establish the resolution of your testing methodology. We need to know how much noise is included in the data.
Baseline 1 -0.87%
baseline 2 +0.89%
baseline 3 -0.02%
If all of these baselines represent the same configuration, then unfortunately most of your other trends fit within the "noise", and no conclusion can be drawn about whether they work.
Statistical significance - Wikipedia, the free encyclopedia
I know that additional test runs cut in to your budget of testing time, which means you can't test as many mods. But it's better to be sure of a few mods than to have a hunch (or worse: bad data!) about a large number of mods.