Categories
Technology

[Q] Validity of a Smoothing Technique

Hey Reddit! I’ve been working with some NFL / NBA data for betting purposes, but I kept encountering the same problem in a few different contexts: getting as granular as I wanted to get massively reduced the sample size at each point, introducing a lot of noise / variance. I came up with a smoothing technique that seems logical and seems to tell a much more coherent story from the data, but I’m always wary of creating my own methodology, so I wanted to see what smarter statistical minds had to say. Here’s one example:

The question:

If the market projection for an NBA player is 7.5 rebounds, but there’s a straggling 8.5 somewhere, how much juice is it worth paying to play under 8.5? Put differently, if a player has a 50% chance to go under 7.5 rebounds, what are the chances they go under 8.5 rebounds, i.e. what are the chances they land on exactly 8 rebounds?

I gathered the 2022 data for the 31st – 40th rebounders. Here’s some of the data for Al Horford, as an example:

Mean = 7.7, Median = 7, SD = 2.9 Total Games = 69

4 rebounds – 9 games

5 rebounds – 7 games

6 rebounds – 7 games

7 rebounds – 11 games

8 rebounds – 7 games

9 rebounds – 5 games

10 rebounds – 9 games

11 rebounds – 7 games

While we could reasonably expect the data to be left or right skewed, there’s no reason to expect that Horford – or any other player who plays consistent minutes – should have more rebounds at a data point further from his median than at a data point closer to his median. It seems like it would be patently ridiculous to estimate that Horford should land on 8 rebounds and 11 rebounds 10.1% of the time each given that 8 is already above both his mean and median. My assumption is that the sample size simply isn’t large enough at each data point to coalesce on the true probability; indeed, every player I’ve looked at has some obvious outliers .

In an attempt to solve this problem, I came up with the following smoothing technique, which I’d like your feedback on:

To estimate the probability of landing on 6 rebounds, I added the number of games Horford landed on 4, 5, 6, 7, or 8 rebounds, and divided by 5. To estimate the probability of landing on 7 rebounds, I added the number of games Horford landed on 5, 6, 7, 8, or 9 rebounds, and divided by 5. And so on and so forth. Are there any flaws with this approach? I tried this process on a random set of data in a different context and confirmed that, at least in that one case, it didn’t artificially increase the correlation. **If all you had was Al Horford’s 2022 data and you wanted to make your best guess as to his true likelihood of getting exactly 8 rebounds, would you prefer to use the actual %, the smoothed %, or something else?** I also used Excel’s NORM.S.DIST function , so that could be another option.

Personally, I think the smoothed % provides a better projection than the actual %. However, I averaged the data together for the 31st through 40th rebounders, and at this point, the sample size is much larger at each point, and the overall curve for the actual % of those 10 guys landing on a certain number is pretty smooth. I’m inclined to think that with 10x the sample size, the actual % provides a better general projection than the smoothed %. That said, for rebounders 1 through 10, there’s still one big outlier right around the median, which is the number I’m most concerned with because that’s the number those players will likely be projected around. So, same question as before – with the data from 10 players combined, do you think the actual %, the smoothed %, or something else, would provide the most accurate projection? I realize this might be hard to answer without more context / the full data set. I’m happy to answer any questions or provide the excel doc, if there’s a way to do that through Reddit. Mostly, I want to make sure there’s not some glaring problem with that smoothing technique, because it’s proved illuminating in a number of different contexts. Thanks so much in advance for any insight!

Daniel