In [ ]:
import numpy as np
import pandas as pd
In [ ]:
n = 10 # Original sample size
x = np.random.normal(size=n) # Normal(0,1) distribution, n samples
print(x)
[ 0.88816193 -0.36582982  0.54292555  1.87163441 -1.73881963  0.83138921
  2.16109592  1.16487939 -0.26369009 -1.32868647]

Let's investigate the sampling error (standard deviation) of $\frac{1}{n}\sum^n_{i=1}x_i$

  • Its theoretical value is: $true\_error \equiv \frac{\sigma}{\sqrt{n}}$,here $\sigma=1$,i.e. $\frac{1}{\sqrt{n}}$
  • The estimation with the current sample is: $est\_error \equiv \frac{\hat{\sigma}}{\sqrt{n}}$, where $\hat{\sigma} = \sqrt{\frac{1}{n}\sum^n_{i=1}(x_i - \frac{1}{n}\sum^n_{i=1}x_i)^2}$
In [ ]:
x_mean = np.mean(x)
x_std = np.std(x) # sigma hat
x_mean_std = x_std / np.sqrt(n)
print("true_error: ", 1/np.sqrt(n))
print("est_error: ", x_mean_std)
true_error:  0.31622776601683794
est_error:  0.3857629549832874

Boostrap error:

In [ ]:
B = 10000
boot = list()
for i in range(B):
    boot.append(np.random.choice(x, n))
In [ ]:
boot[0:3]
Out[ ]:
[array([ 1.16487939,  0.54292555, -0.36582982, -1.73881963,  1.87163441,
         1.16487939,  0.54292555,  1.87163441,  2.16109592,  0.54292555]),
 array([ 0.83138921, -0.26369009, -0.26369009, -0.26369009, -0.36582982,
         0.54292555,  0.83138921,  0.54292555,  0.88816193,  1.87163441]),
 array([ 0.88816193, -1.32868647, -1.73881963, -0.26369009,  0.83138921,
         0.83138921, -1.32868647,  0.88816193,  0.54292555, -0.36582982])]
In [ ]:
boot_mean = np.full(shape=B,fill_value=np.nan)
for i in range(len(boot)):
    boot_mean[i] = np.mean(boot[i])
In [ ]:
boot_mean[0:3]
Out[ ]:
array([ 0.77582507,  0.43515258, -0.10436846])
In [ ]:
x_mean_std_boot = np.sqrt(np.sum((boot_mean - np.mean(boot_mean))**2)/B) # bootstrapped error estimation

👆 It should be similar to x_mean_std

In [ ]:
print("est_error: ", x_mean_std)
print("bootstrapped_error: ", x_mean_std_boot)
est_error:  0.3857629549832874
bootstrapped_error:  0.3875187622541539

When n is small, x is not a good sampling of N(0,1):

In [ ]:
print("true_error: ", 1/np.sqrt(n))
print("est_error: ", x_mean_std)
print("bootstrapped_error: ", x_mean_std_boot)
true_error:  0.31622776601683794
est_error:  0.3857629549832874
bootstrapped_error:  0.3875187622541539

The difference of the first and the second is determined by n; the difference of the second and the third is determined by B

When n is large, and when our sampling procedure is good (iid in our case), all the above three will be close.

In [ ]:
n = 10000 # Original sample size
x = np.random.normal(size=n) # Normal(0,1) distribution, n samples
x_mean = np.mean(x)
x_std = np.std(x) # sigma hat
x_mean_std = x_std / np.sqrt(n)

B = 10000
boot = list()
for i in range(B):
    boot.append(np.random.choice(x, n))

boot_mean = np.full(shape=B,fill_value=np.nan)
for i in range(len(boot)):
    boot_mean[i] = np.mean(boot[i])

x_mean_std_boot = np.sqrt(np.sum((boot_mean - np.mean(boot_mean))**2)/B) # bootstrapped error estimation

print("true_error: ", 1/np.sqrt(n))
print("est_error: ", x_mean_std)
print("bootstrapped_error: ", x_mean_std_boot)
true_error:  0.01
est_error:  0.010094049225094291
bootstrapped_error:  0.010135792258784986
In [ ]: