In [ ]:
import numpy as np
import pandas as pd
In [ ]:
n = 10 # Original sample size
x = np.random.normal(size=n) # Normal(0,1) distribution, n samples
print(x)
[ 0.88816193 -0.36582982 0.54292555 1.87163441 -1.73881963 0.83138921 2.16109592 1.16487939 -0.26369009 -1.32868647]
Let's investigate the sampling error (standard deviation) of $\frac{1}{n}\sum^n_{i=1}x_i$
- Its theoretical value is: $true\_error \equiv \frac{\sigma}{\sqrt{n}}$,here $\sigma=1$,i.e. $\frac{1}{\sqrt{n}}$
- The estimation with the current sample is: $est\_error \equiv \frac{\hat{\sigma}}{\sqrt{n}}$, where $\hat{\sigma} = \sqrt{\frac{1}{n}\sum^n_{i=1}(x_i - \frac{1}{n}\sum^n_{i=1}x_i)^2}$
In [ ]:
x_mean = np.mean(x)
x_std = np.std(x) # sigma hat
x_mean_std = x_std / np.sqrt(n)
print("true_error: ", 1/np.sqrt(n))
print("est_error: ", x_mean_std)
true_error: 0.31622776601683794 est_error: 0.3857629549832874
Boostrap error:
In [ ]:
B = 10000
boot = list()
for i in range(B):
boot.append(np.random.choice(x, n))
In [ ]:
boot[0:3]
Out[ ]:
[array([ 1.16487939, 0.54292555, -0.36582982, -1.73881963, 1.87163441, 1.16487939, 0.54292555, 1.87163441, 2.16109592, 0.54292555]), array([ 0.83138921, -0.26369009, -0.26369009, -0.26369009, -0.36582982, 0.54292555, 0.83138921, 0.54292555, 0.88816193, 1.87163441]), array([ 0.88816193, -1.32868647, -1.73881963, -0.26369009, 0.83138921, 0.83138921, -1.32868647, 0.88816193, 0.54292555, -0.36582982])]
In [ ]:
boot_mean = np.full(shape=B,fill_value=np.nan)
for i in range(len(boot)):
boot_mean[i] = np.mean(boot[i])
In [ ]:
boot_mean[0:3]
Out[ ]:
array([ 0.77582507, 0.43515258, -0.10436846])
In [ ]:
x_mean_std_boot = np.sqrt(np.sum((boot_mean - np.mean(boot_mean))**2)/B) # bootstrapped error estimation
👆 It should be similar to x_mean_std
In [ ]:
print("est_error: ", x_mean_std)
print("bootstrapped_error: ", x_mean_std_boot)
est_error: 0.3857629549832874 bootstrapped_error: 0.3875187622541539
When n is small, x is not a good sampling of N(0,1):
In [ ]:
print("true_error: ", 1/np.sqrt(n))
print("est_error: ", x_mean_std)
print("bootstrapped_error: ", x_mean_std_boot)
true_error: 0.31622776601683794 est_error: 0.3857629549832874 bootstrapped_error: 0.3875187622541539
The difference of the first and the second is determined by n; the difference of the second and the third is determined by B
When n is large, and when our sampling procedure is good (iid in our case), all the above three will be close.
In [ ]:
n = 10000 # Original sample size
x = np.random.normal(size=n) # Normal(0,1) distribution, n samples
x_mean = np.mean(x)
x_std = np.std(x) # sigma hat
x_mean_std = x_std / np.sqrt(n)
B = 10000
boot = list()
for i in range(B):
boot.append(np.random.choice(x, n))
boot_mean = np.full(shape=B,fill_value=np.nan)
for i in range(len(boot)):
boot_mean[i] = np.mean(boot[i])
x_mean_std_boot = np.sqrt(np.sum((boot_mean - np.mean(boot_mean))**2)/B) # bootstrapped error estimation
print("true_error: ", 1/np.sqrt(n))
print("est_error: ", x_mean_std)
print("bootstrapped_error: ", x_mean_std_boot)
true_error: 0.01 est_error: 0.010094049225094291 bootstrapped_error: 0.010135792258784986
In [ ]: