[Q] Resources to re-learn or refresh the basics of intro to statistics

Hey everyone!

I’m a college student and next semester I will be taking my second statistics class . My issue is that I took my first statistics class a long time ago in high school, like almost 3 years ago now, and barely remember the most important concepts. So I’m looking for videos, books, or any other resources you might recommend to review this knowledge during the summer. I’m also gonna work during summer so nothing that is 900 pages long please lol. All I want is to regain a clear understanding of the most important things.

I have seen a couple of courses in Coursera or ) if anyone has any opinions on those. I kind of like the Stanford one cause its completely open to anyone.

Any comments or advice are greatly appreciated 🙂


[Q] Is it possible to reorder this matrix to its canonical (markovian absorbing) form?

Is it possible to rearrange this matrix in its canonical form ? I have searched numerous websites and videos and have only found answers for small matrices where you automatically get the identity matrix just by rearranging the rows. How can I do that here if the absorbing states are all even states? Is it possible?












Chi-Squared and Bin Size [Q]

I’m trying to add a tad of statistics to a manuscript where the data is obviously different .

I have two groups of students: Ones that went to tutoring, and ones that did not.

I wanted to do the Chi-squared on each class to see if there was a significant difference regarding their grade outcomes.

The class sizes are large with 100+ students and the distributions between groups are very different.

I have whole letter grade data . The problem arose that for some groups in some classes there would be a zero count for either D or F’s.

To resolve this I expanded the bin sizes. I have the arrangements:

1. Good grades , Moderate grades , and Bad Grades

2. Successful in Course and Unsuccessful in Course

3. Received and A, Did not Receive an A

All tests came out saying the data is dependent at the 99% confidence interval.

My question is would it be okay to use all three, or is there some reason some of these would not be as ideal as others.

The point of the paper is to suggest tutoring is good for student performance among other thing.

I have no formal education on stats so I’d really appreciate it. 🙂


[Q] How to specify Spatial weight matrix for interaction model with spillover effects?

Hi everyone,


I want to estimate a gravity type model for mobility flows, this means my dependent variable is an 1 x n^2 vector of flow between n^2 region pairs. I already have an nxn weight matrix which is based on the geographic distance between the n regions.


I am currently trying to build a spatial model as mentioned above. I know that for a classic spatial lag model the Weight matrix would be specified depending of the type of spatial effects .

For example for destination based effects one would form a kronecker product of an nxn idenity matrix and the nxn weight matrix, to obtain the weight matrix which can be used for the estimation.

I only want to use an SLX model, meaning i only include spatial effects for independent variables. My question now is if i can simply use the same weight matrices as i would use for the spatial lag model?

I tried to find good explanations of how the weight matrix is constructed for interaction models, but couldn’t find any.

Thank you for your Help!


Sybil attack March Madness survivor pool?

A question of the statisticians.

I’ve run a March Madness Survivor Pool for years with hundreds of people and it’s all been manual . This year I want to automate all of this and I want to use a blockchain like Polygon to facilitate the bets and try to make it as decentralized as possible.

My concern is that if it is anonymous, can someone effectively game the system by creating say 1,000 wallets and guaranteeing victory? If so, how many accounts would it take and would be financially reasonable to do so?


[Q] Am I using the right tests?

Hi all,

I am trying to complete statistical analyses for a medical project regarding 2 different exposures .

Some of the outcomes are numeric, such as the number of migraines patients experiences, and others are categorical, such as presence or absence of heart attack.

My plan is the following:

1. Do **t-test with unequal variance** for the numeric values. ex: t-test to compare the number of migraines or minutes spent in surgery for the 2 patient cohorts
2. do **chi-square for the categoricals**: ex: chi-square to compare the presence or absence of heart attack in patients receiving med A vs those receiving med B

I will mostly be using excel for this project.

I am happy to provide more info if needed

Thank you for any feedback or recommendations!


[Question] Help with Mplus code to do longitudinal mediation

Hello everyone,

I am looking for some help using Mplus trying to do a longitudinal mediation. My study involved taking pre-post measurements of several variables as part of an intervention that had two groups; a control group and a treatment group.

I need the mediator and outcome variables to be change scores and I am uncertain how to tell Mplus to interpret the mediator and outcome variables as change scores in the Define category . My model has the correct syntax to create the change scores , I just don’t know how to tell Mplus to use the change score variable as the mediator and outcome variable.

I have missing data for the post measurements as several people dropped off during the study. However I am doing an intent-to-treat analysis. Do I need to add anything on the syntax to account for this? I am already specifying that values with -999 are missing and I am also using the ML estimator.

Do I need to rename it somehow?

Any help is much appreciated.

DATA: FILE IS Organized Full Sheet.dat;



ActiveC !ActiveC is a binary categorical independent variable
Met MetA
NGSE NGSEA !New General Self Efficacy Scale; complete in NGSA and
DIET DIETA !about 50% on DIETA.
NEval NEvalA
GloH GloHA
Mas MasA
Enj EnjA
Sys SysA
Dia DiaA
Wght WghtA; !Weight variable; complete in Wght and about 47% in WghtA.

usevariables NGSE NGSEA Wght WghtA X M Y;

missing = all ;


X = ActiveC;
Y = WghtA – Wght;


BOOTSTRAP = 10000;


delta by NGSEA@1;
delta with NGSE;

Wghta ON Wght@1;
delta by Wghta@1;
delta with Wght;

Y ON M ;
Y ON X ;
M ON X ;

a1b1 = a1*b1;
TOTAL = a1*b1 + cdash;


[Question] Analyzing effects of two independent variables on three response variables…using JMP

I am extracting plant materials using three levels of plant matter solutions concentration and three levels of sonication levels total 9 different treatments. The response variables arethe extract’s purity, target phytochemical extraction yield, and crude extract yield. I am using JMP, and each treatment is done triplicate.

My plan is to use F-test first to find if there is any difference in the three response from the two independent variables, and if there is, use either Fisher LSD or Tukey HSD to avoid the type 1 error. Does this sound like a good plan? Or should I run another test?


[Q] Regression model comparison for ordinal data

Hey everyone! I’m currently working on my thesis and want to analyse whether certain health markers have additional value in predicting symptoms to just using an already existing marker . Basically: does variable A explain variance in addition to variable B of variable C?

I analyse two different markers in two different analyses. For one marker, all variables A, B, and C are 5-point Likert, for another marker all variables are on a 0-10 scale.

My supervisor told me I could use multiple linear regression analysis with B as a covariate if my variables are approximately normally distributed so I could simply see if A has additional value in the output. Unfortunately they are not, and even transforming didn’t really help, so now I’m considering ordinal regression analyses.

Now I’m struggling with a few problems:
1. Should I just go for linear regression anyway ?
2. If not, how can I find out whether or not A explains additional variance? I haven’t been able to figure out how to check for it in my R output, but I’ve stumbled upon the likelihood ratio test, could that be an option?

Sorry this is a long read and probably basic stuff, but I’m not super competent in statistics and especially R. I’d really appreciate any sort of feedback!


[Q] How can I test the significance of some economics data I’ve gathered?

Hello r/statistics!

This is my first post here, so I apologize if I’m not following proper etiquette specific to this subreddit.

To the matter at hand: I’m currently working on a paper which I’m writing sheerly out of curiosity at the moment. I’ve gathered data on four primary countries , as well as 31 other countries, regarding the proportion of total imports and total exports which is taken up by the imports and exports of capital goods respectively from the world bank. I’ve gathered this data for two periods, 2008 and 2018, and have found the percent change in these proportions in that time period for the countries mentioned. The method of I have used is as follows:

Thailand, 2008:
Proportion of total exports expressed in capital goods: 35.88%
Proportion of total imports expressed in capital goods: 31.76%

Thailand, 2018:
Proportion of total exports expressed in capital goods: 38.3%
Proportion of imports expressed in capital goods: 33.96%

Percent change
Exports: /35.88) x 100 = 6.744% change
Imports: /31.76) x 100 = 6.92%

I’ve repeated this for the other three countries. My question is, can I prove the significance of these data, and if so, how? Using the same source, I have the proportion of exports and imports expressed in the flow of capital goods for the time periods mentioned, and have calculated their percent change, as provided below:

World, 2008
Proportion of total exports expressed in capital goods: 30.89%
Proportion of total imports expressed in capital goods: 29.11%

World, 2018
Proportion of total exports expressed in capital goods: 32.58%
Proportion of total imports expressed in capital goods: 32.75%

Percent change:
Exports: /30.89) x 100 = 5.47% change
Imports: /29.11) x 100 = 12.50% change

I wanted to use this as a parameter for a one sample z test, however these four countries are not randomly selected, nor large enough, and inclusion of these into the sample of the other 31 countries would be testing the significance of the whole sample, and not the results of these four countries. Do you know of anyway around this dilemma, wherein which I could actually test the significance of the data from the four primary countries specified?

P.S I’ve also been confused on another matter. The percent change of exports and imports expressed in the flow of capital goods is important. However, the change in exports and imports may not be independent, and in fact may act on one another.I figured a test for independence would be in order, however I’ve already proceeded with another course of action, should they not be independent, in a separate document. What I did was express the proportions of exports and imports expressed in the flow of capital goods as a proportion. An example would be:

Thailand, 2008
35.88/31.76 = 1.129

Thailand, 2018
38.3/33.96 = 1.127

Percent change
/1.129) x 100 = -0.17% change

I figured this would observe a change in the general direction of a country’s economy regarding whether they are importing or exporting more capital. For instance, given the data available, Vietnam over the last decade has seen a 180% change in favor of exporting capital goods, following this math. Would any of this be worthwhile to pursue, or have I fudged the math on this?

I know this is a lot to read. To those of you who have taken the time to read my question, thank you sincerely. I apologize in advance if my replies or short or sparse. That being said, I intent earnestly to reply to the answers provided.

EDIT: user u/Fabulous-Nobody- kindly commented asking for clarification on the meaning of “can I prove this significance of this data”. I realized that I forgot to state my hypotheses and properly define my terms. My sincere apologies.

For the purposes of this problem: a significance test is a test which serves to demonstrate whether or not data falls along the normal distribution provided by a stated parameter. In this case, our data suggests the percent change in the proportion of world exports and imports expressed in the flow of capital goods is 5.47% and 12.5% respectively. I am seeking to find out whether the normal distribution of these figures accurately predicts the percent change in the proportion of exports and imports expressed in the flow of capital goods for Thailand, Vietnam, India, and The Philippines.

For the purposes of this test, let us set our hypotheses as the following:

Null: the normal distribution of the stated parameters accurately predicts the percent change in the proportion of exports and imports expressed by the flow of capital goods in the four countries stated.

Alternative: the normal distribution of the stated parameters does NOT accurately predict the percent change in the proportion of exports and imports expressed by the flow of capital goods in the four countries stated.