This post is mainly based on: https://rpubs.com/amputz/BLUP
This is the equation that we need to understand: Y = Xβ + Zu + ε. The alternative way is y = Xb + Za + e.
Now let’s expand it into matrix form.
Y: Response vector (n×1).
X: Fixed-effects design matrix (n×p): p is the number of levels for the fixed effect. For example it would be 2 if X is sex (female and male). How would this work if X is continuous?
β: Fixed-effects coefficients (p×1). var(β) = 0 because it is fixed. They are constant for each level.
Z: Random-effects design matrix (n×q): q is the dimension of A/G matrix (n×n). Design matrix needs y/n for perticular levels/individuals.
u: Random effects (q×1), with u ∼ N(0, G). Note that q can be greater than n, and this is why we can predict for more individuals than we have records of. G is the vcv matrix for the random effects
ε: Residual errors (n×1), with ε∼N(0,R). R is the vcv for the residuals (recall GLS models).
In sum, var(y) = ZGZ’ + R (note that Z is a design matrix therefore ZZ’ = I).
Now we need to solve for β and u.
Recall how we used ordinary least squares (OLS) for fixed effects models. In OLS, we solve (X’X)β = X’y to estimate β. For mixed models, the equations are extended to include the random effects and their variances. How do we estimate β and u at the same time?
Henderson’s Mixed Model Equations
Henderson’s equations combine both fixed and random effects into a single set of equations. The user might be familiar with ordinary least squares (OLS) for fixed effects models, so comparing them could be helpful.
The joint distribution of Y and u would then be multivariate normal. To find the estimates of β and u, we can maximize the joint likelihood, which leads to solving the equations derived from setting the derivatives of the log-likelihood with respect to β and u to zero.
Alternatively, since maximizing the likelihood directly can be complicated due to the random effects, Henderson proposed using Best Linear Unbiased Prediction (BLUP) for the random effects. The BLUP approach combines the information from the fixed and random parts.
So, the Mixed Model Equations (MME) are typically written in matrix form as:
[ X’R⁻¹X X’R⁻¹Z ] [ β ] = [ X’R⁻¹Y ]
[ Z’R⁻¹X Z’R⁻¹Z + G⁻¹ ] [ u ] [ Z’R⁻¹Y ]
This looks like a block matrix system. The left-hand side matrix has blocks X’R⁻¹X, X’R⁻¹Z, Z’R⁻¹X, and Z’R⁻¹Z + G⁻¹. The right-hand side vectors are X’R⁻¹Y and Z’R⁻¹Y.
But why does the lower right block include G⁻¹? Because the random effects u have their own variance-covariance structure G, so including G⁻¹ adds a penalty term to the random effects, shrinking them towards zero depending on the variance components. This is similar to ridge regression (L2 regularization, lasso is L1), where adding a penalty term stabilizes the estimates.
Wait, if G is the variance of u, then G⁻¹ would be the precision matrix. So, in the MME, we’re essentially combining the information from the data (through X and Z) and the prior information (through G⁻¹) in a Bayesian-like manner, although Henderson’s derivation is frequentist.
Let me check the dimensions. Suppose X is n x p (n observations, p fixed effects), Z is n x q (q random effects), then β is p x 1 and u is q x 1. The matrix on the left would be (p+q) x (p+q), and the right-hand side vectors are (p+q) x 1. That makes sense.
Solving this system gives the estimates for β and u. The solution can be written using block matrix inversion, but it’s often computationally intensive, especially for large datasets. Modern software uses iterative methods or restricted maximum likelihood (REML) to estimate the variance components (G and R) first and then solve the MME.
But how do we actually use these equations? Let’s consider a simple example. Suppose we have a model with one fixed effect (intercept) and one random effect (e.g., subjects). Then X would be a column of ones, Z would be a matrix indicating the subject for each observation, G is the variance of the random intercepts, and R is the residual variance times the identity matrix.
In this case, the MME would involve inverting matrices that combine the fixed and random effects. The solution would give the fixed intercept and the best linear unbiased predictions for each subject’s random intercept.
But I’m still a bit fuzzy on the derivation. Let me try to sketch it out. Starting from the model Y = Xβ + Zu + ε, with u ~ N(0, G) and ε ~ N(0, R). The goal is to find β and u that maximize the joint density of Y and u. The joint density is proportional to exp(-0.5[(Y - Xβ - Zu)’ R⁻¹ (Y - Xβ - Zu) + u’ G⁻¹ u]).
Taking derivatives with respect to β and u and setting them to zero gives the estimating equations. The derivative with respect to β would be X’ R⁻¹ (Y - Xβ - Zu) = 0, and the derivative with respect to u would be Z’ R⁻¹ (Y - Xβ - Zu) - G⁻¹ u = 0. Rearranging these gives the MME:
X’ R⁻¹ X β + X’ R⁻¹ Z u = X’ R⁻¹ Y
Z’ R⁻¹ X β + (Z’ R⁻¹ Z + G⁻¹) u = Z’ R⁻¹ Y
Which is exactly the block matrix system mentioned earlier. So, the MME are derived from setting the gradients of the joint log-likelihood (or equivalently, the penalized least squares) to zero.
Understanding this, the matrix form is a way to compactly represent these two equations. The inclusion of G⁻¹ in the lower right block accounts for the penalty term from the random effects’ prior distribution. This ensures that the estimates of u are shrunk towards zero depending on their variance, preventing overfitting.
Another key point is that if G is diagonal (i.e., random effects are independent), then G⁻¹ is also diagonal, simplifying computations. But in cases where there’s a more complex covariance structure (e.g., spatial or temporal correlations), G would have off-diagonal elements, making the inversion more challenging.
In practice, when the variance components (G and R) are unknown, they need to be estimated first, often using methods like REML or ML, and then plugged into the MME to solve for β and u. This iterative process is why mixed models can be computationally intensive.
So, summarizing my understanding: The matrix form of the Mixed Model Equations is a system that combines the fixed and random effects into a single equation. The structure of the matrix reflects the contributions of both the fixed effects design matrix (X) and the random effects design matrix (Z), along with the variance components (G and R). Solving this system provides estimates for β and predictions for u, taking into account the variability in both the fixed and random parts of the model.
I think I need to solidify this by going through a small example or maybe looking at how the matrices are structured in a simple case. Let’s say we have three observations from two subjects. Suppose the fixed effect is a single intercept, and the random effect is the subject-specific intercept. Then:
-
X would be a 3x1 matrix of ones.
-
Z would be a 3x2 matrix where each row has a 1 in the column corresponding to the subject.
-
G is a 2x2 matrix with the variance of the subject intercepts on the diagonal.
-
R is σ²I₃, the residual variance times the identity matrix.
Plugging these into the MME:
[ X’R⁻¹X X’R⁻¹Z ] [ β ] = [ X’R⁻¹Y ]
[ Z’R⁻¹X Z’R⁻¹Z + G⁻¹ ] [ u ] [ Z’R⁻¹Y ]
Calculating each block:
X’R⁻¹X would be (1/σ²) * 3 (since X is a column of ones, and X’X is 3).
X’R⁻¹Z would be (1/σ²) * [2 1] if the first subject has two observations and the second has one.
Z’R⁻¹X would be the transpose of X’R⁻¹Z, so (1/σ²) * [2; 1].
Z’R⁻¹Z would be (1/σ²) * diag([2, 1]) since each subject’s column in Z has 2 and 1 ones respectively.
Adding G⁻¹ to Z’R⁻¹Z gives diag([2/σ² + 1/g, 1/σ² + 1/g]), assuming G is diag(g, g).
So the full left-hand side matrix would be:
[ 3/σ² 2/σ² 1/σ² ]
[ 2/σ² (2/σ² + 1/g) 0 ]
[ 1/σ² 0 (1/σ² + 1/g) ]
Wait, but the block structure is:
[ X’R⁻¹X X’R⁻¹Z ]
[ Z’R⁻¹X Z’R⁻¹Z + G⁻¹ ]
So if X is 3x1 and Z is 3x2, then X’R⁻¹X is 1x1, X’R⁻¹Z is 1x2, Z’R⁻¹X is 2x1, and Z’R⁻¹Z + G⁻¹ is 2x2. So the full matrix is 3x3. But in my example above, I may have messed up the blocks. Let me correct that.
X’R⁻¹X is (1/σ²) * 3 (since X is a column of ones, X’X is 3).
X’R⁻¹Z is (1/σ²) * [2, 1], since the first subject has two observations (sum of Z’s first column) and the second subject has one (sum of Z’s second column).
Z’R⁻¹X is the transpose of X’R⁻¹Z, which is [2/σ²; 1/σ²].
Z’R⁻¹Z is (1/σ²) * [[2, 0], [0, 1]], since each subject’s column has 2 and 1 ones respectively.
G⁻¹ is (1/g) * I₂, since G is diag(g, g).
So Z’R⁻¹Z + G⁻¹ becomes [[2/σ² + 1/g, 0], [0, 1/σ² + 1/g]].
Therefore, the full left-hand side matrix is:
[ 3/σ² 2/σ² 1/σ² ]
[ 2/σ² 2/σ² + 1/g 0 ]
[ 1/σ² 0 1/σ² + 1/g ]
But actually, the block structure is:
Top-left: 1x1 (3/σ²)
Top-right: 1x2 [2/σ², 1/σ²]
Bottom-left: 2x1 [2/σ²; 1/σ²]
Bottom-right: 2x2 diag(2/σ² + 1/g, 1/σ² + 1/g)
And the right-hand side vector is:
[ X’R⁻¹Y ] = (1/σ²) * sum(Y)
[ Z’R⁻¹Y ] = (1/σ²) * [sum(Y for subject 1), sum(Y for subject 2)]
Solving this system would give the estimate for β (the fixed intercept) and the predictions for u₁ and u₂ (the random intercepts for each subject).
This example helps to see how the matrix blocks are formed. The key takeaway is that the MME combines information from both the fixed and random effects, weighted by their respective variances (σ² and g). The inclusion of G⁻¹ in the lower right block penalizes the random effects, preventing them from being too large, which is especially important when the number of random effects is large relative to the number of observations.
Another thing to note is that if the random effects have large variance (g is large), then G⁻¹ is small, and the penalty term is less influential, allowing the random effects to vary more. Conversely, if g is small, the penalty is stronger, shrinking the random effects towards zero.
Understanding the matrix form also helps in seeing how computational challenges arise. Inverting the large matrix becomes difficult as the number of fixed and random effects grows. This is why methods like the Henderson’s iterative approach or using sparse matrix techniques are employed in practice.
In summary, the matrix form of the Mixed Model Equations is a systematic way to estimate both fixed and random effects by setting up a block matrix that incorporates the design matrices (X and Z) and the variance components (R and G). Solving this system provides the estimates while accounting for the correlation structure in the data.