[Q] Should I take additional classes in mathematical analysis or in numerical analysis?

For my final year of undergrad, I’m debating between taking either a second sequence on mathematical analysis vs a graduate sequence in numerical analysis. Any suggestions on which I should choose?

I have already completed a two-course sequence on real analysis . Hence, I have a strong understanding of real analysis but not at the level of Rudin, which I hear is extremely helpful for graduate coursework in statistics.

On the other hand, I have also taken an undergrad course in numerical analysis . The first course in the graduate sequence of numerical analysis seems to have a lot of overlap with the undergrad course I took, although it goes in a bit more detail. The second course in the graduate sequence explores completely new topics and is much more advanced. This material would be very helpful in applied statistics, machine learning, and data science .

While it is possible for me to take both sequences in my final year, that would be *a lot* of work, seeing as the mathematical analysis class often demands 20+ hrs per week and the numerical analysis class is at the graduate level and thus moves really fast and has lots of assignments. Hence, I want to avoid taking them both together, if possible. One possibility is that I taken the first course from the mathematical analysis sequence in fall and then skip to the second course of the numerical analysis sequence in the spring seeing as the first course in the graduate numerical analysis sequence has a lot of overlap with the one I already took in undergrad. Although that would inevitably lead to some gaps in my knowledge as I skip to the second course. Plus, I would miss the second half of Rudin’s analysis which covers very valuable topics.

Any thoughts on this? I would appreciate any input.

P.s. I’m a math/stat and CS major aiming for a PhD in statistics focusing on applications .


[Q] Validity of a Smoothing Technique

Hey Reddit! I’ve been working with some NFL / NBA data for betting purposes, but I kept encountering the same problem in a few different contexts: getting as granular as I wanted to get massively reduced the sample size at each point, introducing a lot of noise / variance. I came up with a smoothing technique that seems logical and seems to tell a much more coherent story from the data, but I’m always wary of creating my own methodology, so I wanted to see what smarter statistical minds had to say. Here’s one example:

The question:

If the market projection for an NBA player is 7.5 rebounds, but there’s a straggling 8.5 somewhere, how much juice is it worth paying to play under 8.5? Put differently, if a player has a 50% chance to go under 7.5 rebounds, what are the chances they go under 8.5 rebounds, i.e. what are the chances they land on exactly 8 rebounds?

I gathered the 2022 data for the 31st – 40th rebounders. Here’s some of the data for Al Horford, as an example:

Mean = 7.7, Median = 7, SD = 2.9 Total Games = 69

4 rebounds – 9 games

5 rebounds – 7 games

6 rebounds – 7 games

7 rebounds – 11 games

8 rebounds – 7 games

9 rebounds – 5 games

10 rebounds – 9 games

11 rebounds – 7 games

While we could reasonably expect the data to be left or right skewed, there’s no reason to expect that Horford – or any other player who plays consistent minutes – should have more rebounds at a data point further from his median than at a data point closer to his median. It seems like it would be patently ridiculous to estimate that Horford should land on 8 rebounds and 11 rebounds 10.1% of the time each given that 8 is already above both his mean and median. My assumption is that the sample size simply isn’t large enough at each data point to coalesce on the true probability; indeed, every player I’ve looked at has some obvious outliers .

In an attempt to solve this problem, I came up with the following smoothing technique, which I’d like your feedback on:

To estimate the probability of landing on 6 rebounds, I added the number of games Horford landed on 4, 5, 6, 7, or 8 rebounds, and divided by 5. To estimate the probability of landing on 7 rebounds, I added the number of games Horford landed on 5, 6, 7, 8, or 9 rebounds, and divided by 5. And so on and so forth. Are there any flaws with this approach? I tried this process on a random set of data in a different context and confirmed that, at least in that one case, it didn’t artificially increase the correlation. **If all you had was Al Horford’s 2022 data and you wanted to make your best guess as to his true likelihood of getting exactly 8 rebounds, would you prefer to use the actual %, the smoothed %, or something else?** I also used Excel’s NORM.S.DIST function , so that could be another option.

Personally, I think the smoothed % provides a better projection than the actual %. However, I averaged the data together for the 31st through 40th rebounders, and at this point, the sample size is much larger at each point, and the overall curve for the actual % of those 10 guys landing on a certain number is pretty smooth. I’m inclined to think that with 10x the sample size, the actual % provides a better general projection than the smoothed %. That said, for rebounders 1 through 10, there’s still one big outlier right around the median, which is the number I’m most concerned with because that’s the number those players will likely be projected around. So, same question as before – with the data from 10 players combined, do you think the actual %, the smoothed %, or something else, would provide the most accurate projection? I realize this might be hard to answer without more context / the full data set. I’m happy to answer any questions or provide the excel doc, if there’s a way to do that through Reddit. Mostly, I want to make sure there’s not some glaring problem with that smoothing technique, because it’s proved illuminating in a number of different contexts. Thanks so much in advance for any insight!



[Q] Logistic Regression and Collinearity in League of Legends


I’ve been doing some research about using statistics for win/loss predictive analysis in League of Legends, mainly for an individual player which I collected quite a bit of data on. In this case, 22 variables worth of data . Since there’s such a large amount of information many of these variables have relatively minimal predictive power and some are collinear with each other.

When I first created logistic models for this purpose I applied my AIC and VIF to eliminate non significant and collinear variables. This left me with a 6 variable logistic model which was pretty good for my test data but left out some highly important variables. In particular two variables which have very strong predictive power but are collinear with each other . However even though they are collinear they both independently impact the game on their own since both provide more economy to the team that gets more of them.

However, when I include these variables in my model almost every other predictor becomes non significant in the p-values. This is a problem because all of the other variables do impact the game to some degree. They are things like number of deaths a player has, number of kills, number of assists, how many objectives did they take and all of these variables have decent predictive power when run in 1 variate logistic regression for win-loss prediction. None of them are individually collinear with either of the powerful collinear variables mentioned earlier either.

and linear models which showed a strong linear relationship so I don’t know why this isn’t showing up in the VIF.)

**Tl;dr** I am trying to create logistic models for win loss prediction but don’t know how to address issues of collinearity and non significance of previously significant predictors when the collinear variables are included



[Q] Survey on comedy in the Classroom – support on statistical sizes needed

Years ago, I worked in survey design and implementation. After many years away and doing other types of research, I now have another opportunity to pursue a new project. However, I’m getting increasingly nervous as I look into the nuts and bolts of making it happen.

I’ve always had a statistician help run my numbers and analysis in the past, which is no longer an option. Now, as I start my planning, I’m starting to worry. I hope that I can describe where I’m starting from and, based on your experience, get some advice about what things I should look out for, the potential types of analysis I should run, or any similarly modelled projects I can look at to for inspiration.

Here’s the breakdown:

1) My research investigates whether comedic examples aid in students’ ability to understand and reflect on experiences out in the community.

–I have a list of students who have agreed to further contact for research projects. –These students have taken a particular type of experiential course at the university, so they are not representative of the whole student body.

–Of the approximately 1,500 students who took this opportunity, approximately 550 have agreed to receive follow-up surveys.

–My question here is: how do I conceptualize, calculate, and articulate a representative sample?

2) I plan to compare the means of two different groups , so I am struggling to understand an appropriate population standard deviation.

–The survey will be approximately 20 questions long. Respondents will be randomly split into two groups. Each group will see a different video clip but will be asked identical questions after the clip. I want to compare the results of each question to see if there are significant differences.

–What is the sample size of each group needed to make these viable? I’m okay with a 90% confidence level.

–Does it then make the most sense to complete an independent group T-test for those differences?

While there is much more to consider, these two things are my initial stumbling blocks. Thanks in advance for any help or insight. It’s much appreciated.


[Q] Deciding the best regression analysis for a dataset

Hi everyone. I have a dataset and wanted some feedback on the proper analysis. It’s a repeated-measures design with 5 continuous variable observations per subject. For example, a score on a test and a self-report Likert scale of anxiety, both collected for each observation. I want to see if a continuous trait variable predicts the strength of the correlation between the test score and reported anxiety.

The following link is close to what I want to do, but instead of two groups, it’s a continuous variable. I’m basically trying to run a regression between a trait score and a bunch of correlations coefficients . Can anyone offer any guidance here?


[Q] Comparing Variance of Random Slopes Across Categorical Fixed Effect

When fitting a random slopes and intercepts model, what is an effective way to test whether the variance of the slopes is different at different levels of the fixed effect?

For example, image a dataset with a continuous response variable ‘y’, a three-level fixed effect variable ‘x’ and a single random effect ‘subject’. How would you test whether the subject’s x coefficients had higher variance at different ‘x’ levels? Some sort of bootstrapping procedure?

A follow up question: Imagine a scenario where the subject level slopes are segregated into two or more clusters at one level of the fixed effect. What would be the best way to model this?

See this picture of for one such scenario


[Q] Is it appropriate to use Cronbach’s alpha when I have more than one variable in a survey?

Hello! I’m planning to use Cronbach’s alpha on a survey’s results, but want to know if it’s possible to apply it in this case.

The survey has 25 questions. There are 5 questions for each variable being measured.

Would I have to apply Cronbach’s alpha on the 25 questions as a whole, or should I apply it separately for every variable .

From what I can understand, Cronbach’s alpha measures the correlation between the results, so if 2 questions have similar results, Cronbach’s alpha will be higher . However, one cannot necessarily expect the results of different variables to be similar. In other words, questions from different variables will be measuring different things, resulting in an unacceptable Cronbach’s alpha.

Extra info if it matters: I’m using a Likert’s scale . There are 5 variables, one of them is the dependent variable in a multiple linear regression model, the other 4 variables being the indepentend variables.



[Q] Is a multinomial model appropriate for these data?

I’m trying to understand if multinomial models are appropriate for my analysis. To give a simplified example, let’s say we’re interested in the choice of someone randomly picking coloured balls out of a hat. If there are 10 green, 10 red and 10 orange balls, then the baseline expected probability of picking any colour is equal.

But if there are unequal number of coloured balls, e.g. 25 green, 3 red and 2 orange, this is going to affect the probability of picking a ball. You’re more likely to pick a green, simply because there’s far more of them.

To add complexity, we repeat this choice exercise for multiple different bags, all with varying numbers of coloured balls . So there would presumably need to be some sort of random term .

Can multinomial models account for

* unequal numbers of coloured balls in a bag
* multiple bags, all with different number of coloured balls in them

I’ve been looking to analyse my data in R, with the nnet and mlogit packages but I’m not sure if multinomial models are actually appropriate or not.

Thanks for the help!


[Q] Suggested methodology to generated simulated data from an existing dataset

I’m working on a project where I have a real dataset with a complex underlying structure . As part of my work, I am running a simulation study to test the validity of different methods which exist in the literature for working with this dataset. The goal is essentially to see how each method performs if the assumptions underlying a different method are the actual true assumptions.

To do this, I would like to generate simulated datasets which have the same underlying structure of the original dataset. I have thought of 2 possible approaches

1. Add a small amount of random noise to the covariates for each simulation run to generate datasets with the same general underlying structure which are not identical
2. Do some kind bootstrap sampling to generate simulated datasets with the same underlying structure

Does anyone have experience using either of these approaches , or can point me to papers outlining valid approaches for this kind of challenge?


[E] Video series on fluid simulation relying heavy on statistics -> all self-coded

Hey there,

The next part of my video series on fluid simulation is available.

Topic covered: rarefied gas dynamics, continuum gas dynamics, fluid motion descriptions & coordinates , material-fixed , arbitrary), reducibility aspects, motivation on modeling unresolved flow structures, ensemble averages of microscopically and macroscopically varying data, usefulness of the modeling hierarchy, simplifying and decoupling the evolution equations, Navier-Stokes equations, compressible flow and the incompressible flow assumptions, and buoyancy-driven flow .

I hope you like it!

Have fun!