I’ve learned a few things. As some of you may know, I’ve been measuring my balance by standing on a board that is balanced on a tiny platform (a pipe plug) — pictures here. Now and then the board would slip off the platform. I supposed this was a failure of balance but I wasn’t sure, especially if it happened as soon as I stood on it. So I got another board into which my brother-in-law kindly drilled the perfect-size hole so that the plug will never slip:
To see if this made a difference I did an experiment with a design I have never used before but that I really like: ABABABAB… (one day per condition). In other words, Monday I tested my balance with the old board, Tuesday with the new board, Wednesday with the old board, Thursday with the new board, etc. Simple, efficient, well-balanced. Here are the results:
The red line is fit to the red points, the blue line to the blue points. The two lines are constrained to have the same slope.
Well, that’s clear. I expected my balance to be better with the new board, actually.
Speaking of the unexpected, I made another measurement improvement that truly surprises me — the surprise is that I never did it before. When I looked at my early balance data (the first 10 or so days of data) I saw that my balance improved for the first 5 trials and was roughly constant after that. Each session was 20 trials so I dropped (excluded) the first 5 trials from my analyses — considering them “warm-up” trials. I took the mean of the last 15 trials. That seemed very reasonable and I thought nothing of it.
Recently I asked again how performance changes over a session. The answer was a bit different: I found that performance improved for the first 10 trials. Now there are 30 trials in a session, so dropping the first 10 of them seemed okay. And that’s what I did.
But then I looked at how variability changed over a session. I expected the earliest trials to be more variable than the rest but the data didn’t show that. Variability was pretty constant from the first trials to the last. Hmm. Maybe I am losing valuable information by not including those early trials in my averages. It occurred to me: why not allow for the warmup effect by modelling it, rather than by excluding it? (Modelling it meaning estimating it and then subtracting it.) I did that, and then I looked at the size of the standard errors of the means (standard errors based on the residuals from the fit) for the most recent 40 days — essentially, the error in measurement. Here is what I found. Median standard errors:
First 10 trials (out of 30) excluded: 0.073
First 5 trials excluded: 0.064
First trial excluded: 0.061
No trials excluded: 0.059
My eyes opened wide when I saw these numbers. Oh my god! I was throwing away so much! A reduction in error from 0.073 to 0.059 — that’s 20% better.
Fascinating, both the consistency of the results with/without slippage, and the standard error reduction.
Thanks, Tim. It was your arithmetic results that led me to the standard error reduction. Your arithmetic results led me to do a very similar task, as you know. I didn’t want to place constraints on the 100 simple arithmetic problems I did each session (it was just too complicated — too many things that might be important) so instead I did my best to equate different days by modelling — by estimating and removing the effects of this and that. That gave me the idea of doing it here, too.