[Question] Untangling Logged and Differenced Variables

I have a bit of a conundrum:

I wanted to run a within-between model on panel data to parse out within unit and between unit variation. Generally that involves a model that looks like:

Y_ij = B1 + B2 + e_ij

The IVs get demeaned for within-unit variation and unit means across time are used to estimate between unit variation.

The problem is that one of my IVs is log transformed. As such, averaging together the differences in de-meaned log values for B1 seems to conflate magnitudes of change with respect to Y because a 1 unit difference in one case over time is not equivalent to a one unit difference in another .

Any suggestions on how to proceed? I suppose I can just run the model without logging the variable, but I have some justifications to do so. l feel like there must be an alternative option floating around.


[Q] (Updated): Is there a validated procedure for allocating realistic ranges for BP, weight, HR, etc?

I’ve been looking online to see if there might perhaps be a commonly used range from nhanes for example, but I haven’t found anything yet.

I am more specifically looking for a protocol to utilize so that in my code, I can flag when a number is unrealistically high or low. I don’t know if there is a common procedure for which ranges I should allow and which I should pinpoint as being impossible. If there is nothing available, I will have to come up with my own ranges.

I am essentially looking to see if there are cutoffs already utilized for the purpose of flagging in my code, not necessarily cutoffs for separating stages of disease.


[Q] Properly implementing a step-wise regression in Python for a GLM model. How to?

Hello Stats,

I am trying to model frequency of an event, and I am looking for the best way to do this. My response variable is frequency, and I am fitting it with Poisson family & log link.

I have a set of variables, say 10 of them, and I want to iteratively add them into the model so that I am putting in the most significant variables in first.

So I am using statsmodels, the GLM.

Suppose I start with a base model, the mean model: response ~ 1

I want to test it against models:

response ~ variable1 + constant term

response ~ variable2 + constant term

response ~ variable3 + constant term
… and so on.

I want to then pick the “best” one to go with. What is the standard way to quantify and choose the best? Is it the likelihood ratio test? Is there a way to implement this in Python?


[Q] Additive effects vs. interactions

Say you believe that both political party and age are predictors for a population’s opinion on abortion. Fitting an interaction term in your multiple regression model between political party and age, you could evaluate whether the effect of increasing age varies across e.g. democrats and republicans in your sample . As far as I’ve understood, the interaction term would be conditioned on the additive effects of age and political party.

What does additivity in this regard mean? Does this solely relate to model goodness-of-fit , i.e. that the addition of one variable improves the regression line’s fit to the data? In this case, a significant interaction term would mean that you are improving the model’s fit to the data?


[C] Sucking at numerical recruitment tests

When I apply for jobs and the first step is a numerical test, I automatically know it’s a straight up rejection because I’ll fail it. No matter how easy or hard these are, I’ll fail them.

I’ve never encountered one where the questions are hard that I’m lost, for the most part I’ll know exactly how to answer them. Yet, once that timer is clicking, it seems impossible – not sure if it’s the anxiety or not enough time to understand the question and work it out

Despite the fact these are simple questions like working out percentages, finding averages, reading graphs – does anyone here struggle with these tests?

I’ve got a STEM BSc, not maths but had a lot of applied maths content and a MSc in Artificial Intelligence which obviously had a lot of maths, more on the modelling/statistical side.

Failing these tests has left me in a lot of doubt and solidifying my imposter syndrome – anyone else?


[Q] Control chart question

Hi! I’m trying to improve a process, dealing with customer complaints.

Currently the process is as such:
– Customer complaints are counted for each month.
– The mean and 3*SD control limits are established using the last 12 data points
– Then they look for spikes , trends and shifts in the mean.

Let’s say it’s December and they’re doing this process. They would be taking all customer complaints from Jan-Dec, establish the mean and control limits, and look for spikes, shifts and trends. However, due to the fact that December data is included in setting the mean and control limits, this would mean that having more complaints in December would raise control limits are raised, and as a result make it more difficult to detect spikes .

I’m trying to read up on control charts, but every example I find seems to do this as well, taking all data into account to establish the mean + control limits. Am I missing something? Wouldn’t it be much better to NOT take the current month of data into account? Could anyone point me to some website or book that would help me understand this process better, and how it should be properly established?

Thanksf or your time!


[Q] Feature selection with multiple measurements

In an ongoing omics project, we have measured 500 features from 50 controls and 50 diseased. We have sampled the study subjects once, but collected 2-4 tissue samples simultaneously from each study subject . From the subjects with disease, we collected 2-4 tissue samples separate *locations* from the diseased tissue, whereas from the healthy controls the 2-4 samples were collected from a screening program form individuals who were disease-free at time of sampling).

We want to use the Lasso for variable selection to identify a set of features that best can predict disease/health regardless of the anatomical location where the sampling was made . At the same time, we are interested in the biology. To my knowledge, modeling the correlated patients per sample without a random effects would artificially inflate the sample size. So the question is how to pursue this: Should *subject_id* be modeled as a random effects term in a Lasso mixed model ? Notwithstanding that this approach may not be ideal, would it technically be correct in that it corrects for the lack of independence?


[Career] Any good MOOC on applied statistics ?


I’m a young data scientist who has been working for about 2 years now. I’m part of a team with some senior data engineers and SWE, but all the data scientist are just fresh grad with barely any stats credentials. I’m looking to expand my skills and I feel like I’m lacking in terms of statistics.

I’ve done some in college but it was barely just some fundamentals, all theory no practice. I’m looking for some online courses that would expand my knowledge.

I’ve already completed which I find quite interesting, although not advanced enough for what I’m looking.

I’m planning to start , but I fear it’s too theoretical. Are there anything more practical with some application in real life problems ? Things like hypothesis testing, A/B test and such ?


[R] Analysis of Russian vaccine trial outcomes suggests they are lazily faked. Distribution of efficacies across age groups is quite improbable

From the abstract: In the 1000-trial simulation for the AstraZeneca vaccine, in 23.8% of simulated trials, the observed efficacies of all age subgroups fell within the efficacy bounds for age subgroups in the published article. The J + J simulation showed 44.7%, Moderna 51.1%, Pfizer 30.5%, and **0.0% of the Sputnik simulated trials had all age subgroups fall within the limits of the efficacy estimates described by the published article. In 50,000 simulated trials of the Sputnik vaccine, 0.026% had all age subgroups fall within the limits of the efficacy estimates described by the published article, whereas 99.974% did not.**


[R] Analysis design to study different equipment in golf

Hi! I thought that this would be a perfect place to discuss about my current project I’m thinking of doing. I have the chance to explore Trackman data, and I wanna find out which club made by who performs the best.

Here are some things to consider: the dataset contains many players with different swings and better players might pick certain type of club, which lead to bias. Carry distance and dispersion would ultimately tell me how the shot was but I’m not clear on how I could normalize the data and do analysis by clubs. HLM might be the go but wanted to see what people think. I would appreciate any inputs or ideas, thank you!