[Q] Logistic Regression and Collinearity in League of Legends


I’ve been doing some research about using statistics for win/loss predictive analysis in League of Legends, mainly for an individual player which I collected quite a bit of data on. In this case, 22 variables worth of data . Since there’s such a large amount of information many of these variables have relatively minimal predictive power and some are collinear with each other.

When I first created logistic models for this purpose I applied my AIC and VIF to eliminate non significant and collinear variables. This left me with a 6 variable logistic model which was pretty good for my test data but left out some highly important variables. In particular two variables which have very strong predictive power but are collinear with each other . However even though they are collinear they both independently impact the game on their own since both provide more economy to the team that gets more of them.

However, when I include these variables in my model almost every other predictor becomes non significant in the p-values. This is a problem because all of the other variables do impact the game to some degree. They are things like number of deaths a player has, number of kills, number of assists, how many objectives did they take and all of these variables have decent predictive power when run in 1 variate logistic regression for win-loss prediction. None of them are individually collinear with either of the powerful collinear variables mentioned earlier either.

and linear models which showed a strong linear relationship so I don’t know why this isn’t showing up in the VIF.)

**Tl;dr** I am trying to create logistic models for win loss prediction but don’t know how to address issues of collinearity and non significance of previously significant predictors when the collinear variables are included



[Q] Survey on comedy in the Classroom – support on statistical sizes needed

Years ago, I worked in survey design and implementation. After many years away and doing other types of research, I now have another opportunity to pursue a new project. However, I’m getting increasingly nervous as I look into the nuts and bolts of making it happen.

I’ve always had a statistician help run my numbers and analysis in the past, which is no longer an option. Now, as I start my planning, I’m starting to worry. I hope that I can describe where I’m starting from and, based on your experience, get some advice about what things I should look out for, the potential types of analysis I should run, or any similarly modelled projects I can look at to for inspiration.

Here’s the breakdown:

1) My research investigates whether comedic examples aid in students’ ability to understand and reflect on experiences out in the community.

–I have a list of students who have agreed to further contact for research projects. –These students have taken a particular type of experiential course at the university, so they are not representative of the whole student body.

–Of the approximately 1,500 students who took this opportunity, approximately 550 have agreed to receive follow-up surveys.

–My question here is: how do I conceptualize, calculate, and articulate a representative sample?

2) I plan to compare the means of two different groups , so I am struggling to understand an appropriate population standard deviation.

–The survey will be approximately 20 questions long. Respondents will be randomly split into two groups. Each group will see a different video clip but will be asked identical questions after the clip. I want to compare the results of each question to see if there are significant differences.

–What is the sample size of each group needed to make these viable? I’m okay with a 90% confidence level.

–Does it then make the most sense to complete an independent group T-test for those differences?

While there is much more to consider, these two things are my initial stumbling blocks. Thanks in advance for any help or insight. It’s much appreciated.


[Q] Deciding the best regression analysis for a dataset

Hi everyone. I have a dataset and wanted some feedback on the proper analysis. It’s a repeated-measures design with 5 continuous variable observations per subject. For example, a score on a test and a self-report Likert scale of anxiety, both collected for each observation. I want to see if a continuous trait variable predicts the strength of the correlation between the test score and reported anxiety.

The following link is close to what I want to do, but instead of two groups, it’s a continuous variable. I’m basically trying to run a regression between a trait score and a bunch of correlations coefficients . Can anyone offer any guidance here?


[Q] Comparing Variance of Random Slopes Across Categorical Fixed Effect

When fitting a random slopes and intercepts model, what is an effective way to test whether the variance of the slopes is different at different levels of the fixed effect?

For example, image a dataset with a continuous response variable ‘y’, a three-level fixed effect variable ‘x’ and a single random effect ‘subject’. How would you test whether the subject’s x coefficients had higher variance at different ‘x’ levels? Some sort of bootstrapping procedure?

A follow up question: Imagine a scenario where the subject level slopes are segregated into two or more clusters at one level of the fixed effect. What would be the best way to model this?

See this picture of for one such scenario


[Q] Is it appropriate to use Cronbach’s alpha when I have more than one variable in a survey?

Hello! I’m planning to use Cronbach’s alpha on a survey’s results, but want to know if it’s possible to apply it in this case.

The survey has 25 questions. There are 5 questions for each variable being measured.

Would I have to apply Cronbach’s alpha on the 25 questions as a whole, or should I apply it separately for every variable .

From what I can understand, Cronbach’s alpha measures the correlation between the results, so if 2 questions have similar results, Cronbach’s alpha will be higher . However, one cannot necessarily expect the results of different variables to be similar. In other words, questions from different variables will be measuring different things, resulting in an unacceptable Cronbach’s alpha.

Extra info if it matters: I’m using a Likert’s scale . There are 5 variables, one of them is the dependent variable in a multiple linear regression model, the other 4 variables being the indepentend variables.



[Q] Is a multinomial model appropriate for these data?

I’m trying to understand if multinomial models are appropriate for my analysis. To give a simplified example, let’s say we’re interested in the choice of someone randomly picking coloured balls out of a hat. If there are 10 green, 10 red and 10 orange balls, then the baseline expected probability of picking any colour is equal.

But if there are unequal number of coloured balls, e.g. 25 green, 3 red and 2 orange, this is going to affect the probability of picking a ball. You’re more likely to pick a green, simply because there’s far more of them.

To add complexity, we repeat this choice exercise for multiple different bags, all with varying numbers of coloured balls . So there would presumably need to be some sort of random term .

Can multinomial models account for

* unequal numbers of coloured balls in a bag
* multiple bags, all with different number of coloured balls in them

I’ve been looking to analyse my data in R, with the nnet and mlogit packages but I’m not sure if multinomial models are actually appropriate or not.

Thanks for the help!


[Q] Suggested methodology to generated simulated data from an existing dataset

I’m working on a project where I have a real dataset with a complex underlying structure . As part of my work, I am running a simulation study to test the validity of different methods which exist in the literature for working with this dataset. The goal is essentially to see how each method performs if the assumptions underlying a different method are the actual true assumptions.

To do this, I would like to generate simulated datasets which have the same underlying structure of the original dataset. I have thought of 2 possible approaches

1. Add a small amount of random noise to the covariates for each simulation run to generate datasets with the same general underlying structure which are not identical
2. Do some kind bootstrap sampling to generate simulated datasets with the same underlying structure

Does anyone have experience using either of these approaches , or can point me to papers outlining valid approaches for this kind of challenge?


[E] Video series on fluid simulation relying heavy on statistics -> all self-coded

Hey there,

The next part of my video series on fluid simulation is available.

Topic covered: rarefied gas dynamics, continuum gas dynamics, fluid motion descriptions & coordinates , material-fixed , arbitrary), reducibility aspects, motivation on modeling unresolved flow structures, ensemble averages of microscopically and macroscopically varying data, usefulness of the modeling hierarchy, simplifying and decoupling the evolution equations, Navier-Stokes equations, compressible flow and the incompressible flow assumptions, and buoyancy-driven flow .

I hope you like it!

Have fun!


[Question] Untangling Logged and Differenced Variables

I have a bit of a conundrum:

I wanted to run a within-between model on panel data to parse out within unit and between unit variation. Generally that involves a model that looks like:

Y_ij = B1 + B2 + e_ij

The IVs get demeaned for within-unit variation and unit means across time are used to estimate between unit variation.

The problem is that one of my IVs is log transformed. As such, averaging together the differences in de-meaned log values for B1 seems to conflate magnitudes of change with respect to Y because a 1 unit difference in one case over time is not equivalent to a one unit difference in another .

Any suggestions on how to proceed? I suppose I can just run the model without logging the variable, but I have some justifications to do so. l feel like there must be an alternative option floating around.


[Q] (Updated): Is there a validated procedure for allocating realistic ranges for BP, weight, HR, etc?

I’ve been looking online to see if there might perhaps be a commonly used range from nhanes for example, but I haven’t found anything yet.

I am more specifically looking for a protocol to utilize so that in my code, I can flag when a number is unrealistically high or low. I don’t know if there is a common procedure for which ranges I should allow and which I should pinpoint as being impossible. If there is nothing available, I will have to come up with my own ranges.

I am essentially looking to see if there are cutoffs already utilized for the purpose of flagging in my code, not necessarily cutoffs for separating stages of disease.