Illustrating the sample variance bias with R and Mathematica

In my previous blog, I recall that we can demonstrate in a few steps that the sample variance is an unbiased estimator of the population variance when we divide by n-1:

\begin{aligned} E(S^2)&=E\Big[\frac{\Sigma_{i=1}^n (X_i-\bar X)^2}{n-1}\Big]\\ &=\sigma^2 \end{aligned}

Another way to think about this problem is to illustrate the bias when the sample variance is divided by just n. In this blog, I will use the new Wolfram function AIAssistant, that I discussed in another more recent blog, to illustrate the sample variance bias in R and in Mathematica.

So let us start with asking to write an example in R, I asked:

show me a graph of the sample variance bias in R

The answer:

set.seed(123) # for reproducibility

n <- 10 # sample size
N <- 1000 # number of simulations

true_variance <- 1 # true population variance

# create a function to simulate the bias in the sample variance estimator
sample_variance_bias <- function() {
  x <- rnorm(n, mean = 0, sd = sqrt(true_variance)) # generate a sample from a normal distribution
  biased_estimator <- sum((x - mean(x))^2) / n # compute the biased estimator (dividing by n instead of n-1)
  unbiased_estimator <- sum((x - mean(x))^2) / (n - 1) # compute the unbiased estimator (dividing by n-1)
  return(biased_estimator - true_variance) # return the bias of the biased estimator
}

# simulate the bias across N simulations
bias <- replicate(N, sample_variance_bias())

# plot the histogram of the bias
hist(bias, main = "Histogram of Sample Variance Bias", xlab = "Bias", ylab = "Frequency")

The description of that code is also provided:

This code generates a sample of size n from a normal distribution with mean 0 and standard deviation equal to the square root of the true population variance. It then computes both the biased estimator (dividing by n instead of n-1) and the unbiased estimator (dividing by n-1). This process is repeated across N simulations, and the bias of the biased estimator is computed for each simulation. Finally, a histogram of the bias values is plotted.

The resulting histogram should show a peak at a negative value, indicating that the biased estimator tends to underestimate the true variance of the population.

The result is pretty neat. We clearly see the downward bias for the sample variance bias estimator. Besides, the explanation is very clear.

By curiosity, I wanted to know whether this R code could be translated into the Wolfram Language.

write me the equivalent for Mathematica

The answer:

SeedRandom[123]; (* for reproducibility *)

n = 10; (* sample size *)
N = 1000; (* number of simulations *)

trueVariance = 1; (* true population variance *)

(* define a function to simulate the bias in the sample variance \
estimator *)
sampleVarianceBias[] := Module[{x, biasedEstimator, unbiasedEstimator},
    x = RandomVariate[NormalDistribution[0, Sqrt[trueVariance]], n]; (* 
  generate a sample from a normal distribution *)
    biasedEstimator = Total[(x - Mean[x])^2]/n; (* 
  compute the biased estimator (dividing by n instead of n-1) *)
    unbiasedEstimator = Total[(x - Mean[x])^2]/(n - 1); (* 
  compute the unbiased estimator (dividing by n-1) *)
    biasedEstimator - trueVariance (* 
  return the bias of the biased estimator *)
  ]

(* simulate the bias across N simulations *)
bias = Table[sampleVarianceBias[], N];

(* plot the histogram of the bias *)
Histogram[bias, Automatic, "PDF",
   PlotLabel -> "Histogram of Sample Variance Bias",
   AxesLabel -> {"Bias", "Frequency"}]

This code does not work as the N symbol is protected in Mathematica, so I entered the first error message:

"Symbol N is Protected."

The answer:

The error message you are seeing is "Symbol N is Protected". This means that you cannot assign a value to the symbol N because it is already defined as a protected symbol in Mathematica. A protected symbol is one that has special meaning or functionality in the language and cannot be modified by the user.

To fix this error, you can simply use a different symbol name for the number of simulations, such as numSims or simNum. For example:

SeedRandom[123]; (* for reproducibility *)

n = 10; (* sample size *)
numSims = 1000; (* number of simulations *)

trueVariance = 1; (* true population variance *)

(* define a function to simulate the bias in the sample variance \
estimator *)
sampleVarianceBias[] := Module[{x, biasedEstimator, unbiasedEstimator},
    x = RandomVariate[NormalDistribution[0, Sqrt[trueVariance]], n]; (* 
  generate a sample from a normal distribution *)
    biasedEstimator = Total[(x - Mean[x])^2]/n; (* 
  compute the biased estimator (dividing by n instead of n-1) *)
    unbiasedEstimator = Total[(x - Mean[x])^2]/(n - 1); (* 
  compute the unbiased estimator (dividing by n-1) *)
    biasedEstimator - trueVariance (* 
  return the bias of the biased estimator *)
  ]

(* simulate the bias across numSims simulations *)
bias = Table[sampleVarianceBias[], numSims];

(* plot the histogram of the bias *)
Histogram[bias, Automatic, "Count",
   PlotLabel -> "Histogram of Sample Variance Bias",
   AxesLabel -> {"Bias", "Frequency"}, 
 PlotRange -> {{-1, 2}, {0, 150}}, ChartStyle -> "Pastel",
 ImageSize -> {800, 400}]

This code should run without errors and produce the expected histogram.

The result, after changing a bit of the Histogram command:

We clearly see that using n in the sample variance formula produces a bias, as the distribution does not peak at 0.

Please note that the files for replicating this blog are available here: https://github.com/JamelSaadaoui/EconMacroBlog

MMAvarianceBIAS – 3 May 2023Download

1 Comment

The expansion of AI will likely shrink earnings inequality

Visualization of time-varying coefficients

US consumer sentiment and US geopolitical risks

LED Research Seminar

Do Supply Chain Disruptions Matter for Global Economic Conditions?

1 Comment

Leave a Reply Cancel reply

Related Posts

The expansion of AI will likely shrink earnings inequality

Visualization of time-varying coefficients

US consumer sentiment and US geopolitical risks

LED Research Seminar

Do Supply Chain Disruptions Matter for Global Economic Conditions?