Monte Carlo Simulation: Class #2

Monte Carlo Simulations: Class 2

Independent Study / Spring 2023

Francis L. Huang, PhD

huangf@missouri.edu

@flhuang

2023.01.25 / Updated 2023-04-22

1 / 37

Simulation stuff
Francis Huang / huangf@missouri.edu    
Last weekWe talked about what simulations are
Walked through an example of a simulation
Talked about designing a simulation:Design: what conditions?
Generate data: what's the data generating process?
Analyze: ...
Measure performance: ...
Visualize: ...

2 / 37

Simulation stuff
Francis Huang / huangf@missouri.edu    
Agenda:Revisit last week's discussion
Using simulations to understand statistics better
Running some t-tests from sampled dataUnderstanding the standard error

Transforming variables
Basic univariate, nonnormal data generation
Using the sample function
Rolling two dice
Simulating the Monty Hall problem (start)
3 / 37

Simulation stuff

Talked about generating data

If you obtained a correlation between two randomly generated, independent variables, how often would we expect to find statistical significance?

Went through this one by one...

set.seed(123)
x <- rnorm(100, 0, 1)
y <- rnorm(100, 0, 1)
cor.test(x, y)


    Pearson's product-moment correlation
data:  x and y
t = -0.49095, df = 98, p-value = 0.6246
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.2435805  0.1483291
sample estimates:
        cor 
-0.04953215

4 / 37

Simulation stuff

Talked about how to obtain the statistic of interest, in this case, the p value

tmp <- cor.test(x, y)
str(tmp)

List of 9
 $ statistic  : Named num -0.491
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named int 98
  ..- attr(*, "names")= chr "df"
  $ p.value    : num 0.625
 $ estimate   : Named num -0.0495
  ..- attr(*, "names")= chr "cor"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "correlation"
 $ alternative: chr "two.sided"
 $ method     : chr "Pearson's product-moment correlation"
 $ data.name  : chr "x and y"
 $ conf.int   : num [1:2] -0.244 0.148
  ..- attr(*, "conf.level")= num 0.95
 - attr(*, "class")= chr "htest"

tmp$p.value #is what we want

[1] 0.6245623
NA
NA
NA
NA

5 / 37

Simulation stuff

Create a loop...Save the value

This is a basic simulation, run the same test over a certain number of replications (in this case, 1,000: can be more)

reps <- 1000 #how many times to run this
pvals <- numeric(reps) #create a container
set.seed(123) #for replicability
for (i in 1:reps){
  x <- rnorm(100, 0, 1)
  y <- rnorm(100, 0, 1)
  tmp <- cor.test(x, y)
  pvals[i] <- tmp$p.value # saving the output in a vector
}

How many times should we expect this to be p < .05?

sum(pvals < .05) / length(pvals)

[1] 0.053

mean(pvals < .05) #or quite simply

[1] 0.053

6 / 37

Simulation stuff
Francis Huang / huangf@missouri.edu    
NOTE: this should be based on our understanding of the test--Why .05?
7 / 37

Simulation stuff

NOTE: this should be based on our understanding of the test--

Why .05?
That's our type I error rate-- just by chance, we can expect to find something statistically significant-- even when there is no real difference!
We can use simulations as a way to sharpen our understanding of our basic stat knowledge

7 / 37

Simulation stuff

Take another example: a one sample t-test

Create a population of interest, let's say 30,000 students at a university
We think that these students are higher ability (M = 105, SD = 15) than the average student (let's say M = 100, SD = 15)
We only know what the 'average' student ability is
If we were to sample $n$ number of students from the population, how many would need to find statistical significant differences?

set.seed(246)
popiq <- rnorm(n = 30000, mean = 105, sd = 15)

This is another way of doing things... sampling from a population... which is what we generally do anyway.

8 / 37

Simulation stuff

This is our population... we are in control of this-- we know ("theta") $\theta$ = true value

mean(popiq)

[1] 105.0702

sd(popiq)

[1] 14.99542

hist(popiq)
abline(v = mean(popiq), col = 'red')

9 / 37

Simulation stuff

Now an important part of what we do is make inferences from a sample $\rightarrow$ population

We want to generalize from our sample to the population
- H0: Ability $=$ 100
- H1: Ability $\ne$ 100 (can also be one-tailed)
How big a sample do we need to show this? (one sample t-test)

Source: Wikipedia

10 / 37

Simulation stuff

Take a sample: what if we use n = 20?

set.seed(100) #only doing this for my slides
s1 <- sample(popiq, 20) # randomly sample 20 (without replacement)
mean(s1)

[1] 101.3393

s2 <- sample(popiq, 20)
mean(s2) ### all different of course

[1] 102.4099

s3 <- sample(popiq, 20)
mean(s3) ### due to sampling variation

[1] 111.9371

s4 <- sample(popiq, 20)
mean(s4)

[1] 104.4906

11 / 37

Simulation stuff
Francis Huang / huangf@missouri.edu    
What if we run one-sample t-tests?t.test(x = s1, mu = 100)

    One Sample t-test
data:  s1
t = 0.37384, df = 19, p-value = 0.7127
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
  93.84122 108.83731
sample estimates:
mean of x 
 101.3393
t.test(x = s2, mu = 100)

    One Sample t-test
data:  s2
t = 0.51647, df = 19, p-value = 0.6115
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
  92.64343 112.17646
sample estimates:
mean of x 
 102.4099
12 / 37

Simulation stuff
Francis Huang / huangf@missouri.edu    
What if we take a 1000 samples of 20-- what would the distribution look like?13 / 37

Simulation stuff
Francis Huang / huangf@missouri.edu    
What if we take 1000 samples of 20-- what would the distribution look like?reps <- 1000
n20 <- numeric(reps)
set.seed(123)
for (i in 1:reps){
  n20[i] <- mean(sample(popiq, 
      size = 20))
}
mean(n20)

[1] 105.0358
sd(n20) ## what does this represent?

[1] 3.32792
14 / 37

Simulation stuff

What if we take 1000 samples of 20-- what would the distribution look like?

reps <- 1000
n20 <- numeric(reps)
set.seed(123)
for (i in 1:reps){
  n20[i] <- mean(sample(popiq, 
      size = 20))
}
mean(n20)

[1] 105.0358

sd(n20) ## what does this represent?

[1] 3.32792

Although many times, our sample mean is higher than 100, there are also many occasions, due to sampling variation, where the mean is LESS than 100

hist(n20)

14 / 37

Simulation stuff

Remember, our t-test is $\frac{\bar{x} - \mu}{SE_{\bar{x}}}$

What is the standard error of the mean? Measure of imprecision-- since we only have a sample

$${SE}_{\bar{x}} = \frac{\sigma_{\bar{x}}}{\sqrt{n}}$$

15/(sqrt(20))

[1] 3.354102

## The SE is the standard deviation of the sampling distribution of means
sd(n20) ### close

[1] 3.32792

15 / 37

Simulation stuff

What if we ran a thousand t-tests?

reps <- 1000
pv20 <- numeric(reps)
set.seed(123)
for (i in 1:reps){
  pv20[i] <- t.test(sample(popiq, 
      size = 20), mu = 100)$p.value
}
mean(pv20 < .05) #what does this represent?

[1] 0.29

16 / 37

Simulation stuff

What if we ran a thousand t-tests?

reps <- 1000
pv20 <- numeric(reps)
set.seed(123)
for (i in 1:reps){
  pv20[i] <- t.test(sample(popiq, 
      size = 20), mu = 100)$p.value
}
mean(pv20 < .05) #what does this represent?

[1] 0.29

This represents power to detect an effect (~29%)

16 / 37

Simulation stuff

What if we take 1000 samples of 50-- what would the distribution look like?

reps <- 1000
n50 <- numeric(reps)
set.seed(333)
for (i in 1:reps){
  n50[i] <- mean(sample(popiq, 
      size = 50))
}
mean(n50)

[1] 105.2158

sd(n50) ## what does this represent?

[1] 1.987627

NOTE: over repeated sampling, the mean of the means $\hat{\theta}$ should converge on the true value $\theta$ -- this means that the estimator is unbiased.

What is the theoretical SE?

15 / sqrt(50)

[1] 2.12132

This is more precise compared to n = 20

17 / 37

Simulation stuff

Compare distribution of means

hist(n20, xlim = c(95, 115))

hist(n50, xlim = c(95, 115))

NOTE: the xlim option is specified so that the x-axes are the same in both (will not just use the default min and max which differ)

18 / 37

Simulation stuff

Compare distribution of means

hist(n20, xlim = c(95, 115))

hist(n50, xlim = c(95, 115))

NOTE: the xlim option is specified so that the x-axes are the same in both (will not just use the default min and max which differ)

Many times, people think how accurate your estimate is depends on what percent of the population you sample (e.g., 10%, 50%)
NOTE: the validity of the point estimate does not depend on the % sampled from the population-- we only have 50 out of 30,000 individuals w/c is < 1%.

18 / 37

Simulation stuff

What if we ran a thousand t-tests with n = 50, 70?

reps <- 1000
pv50 <- numeric(reps)
set.seed(123)
for (i in 1:reps){
  pv50[i] <- t.test(sample(popiq, 
      size = 50), mu = 100)$p.value
}
mean(pv50 < .05) #what does this represent?

[1] 0.675

pv70 <- numeric(reps)
set.seed(333)
for (i in 1:reps){
  pv70[i] <- t.test(sample(popiq, 
      size = 70), mu = 100)$p.value
}
mean(pv70 < .05)

[1] 0.809

19 / 37

Simulation stuff

What if we ran a thousand t-tests with n = 50, 70?

reps <- 1000
pv50 <- numeric(reps)
set.seed(123)
for (i in 1:reps){
  pv50[i] <- t.test(sample(popiq, 
      size = 50), mu = 100)$p.value
}
mean(pv50 < .05) #what does this represent?

[1] 0.675

pv70 <- numeric(reps)
set.seed(333)
for (i in 1:reps){
  pv70[i] <- t.test(sample(popiq, 
      size = 70), mu = 100)$p.value
}
mean(pv70 < .05)

[1] 0.809

Not bad, with only n = 70, we can get a good estimate of the population mean!

19 / 37

Simulation stuff

Transforming variables

You can transform any random normal variable $\mathcal{N}(0, 1)$ ('normally distributed with a mean of 0 and a variance of 1') by

adding a constant (this will be the new mean)
multiplying by a factor (this will be the new SD): e.g., $\mathcal{N}(50, 10^2)$

set.seed(312)
test <- rnorm(10000) #m = 0, sd = 1
mean(test)

[1] -0.004945618

sd(test)

[1] 0.9886146

test1 <- test * 10 + 50 
# now called a t-statistic
mean(test1)

[1] 49.95054

sd(test1)

[1] 9.886146

20 / 37

Simulation stuff

What if our population did not have a normal distribution? What would the sampling distribution of means look like? $\chi^2$ distribution

set.seed(111)
skewpop <- rchisq(30000, df = 1)
##  mean = df; var = 2df
mean(skewpop)

[1] 1.002644

var(skewpop)

[1] 2.021265

sd(skewpop)

[1] 1.421712

hist(skewpop)

21 / 37

Simulation stuff

Uniform distribution (all values are equally likely) is also common...

unipop <- runif(10000, 65, 135)
# var = (b - a)^2 / 12; 
# mean = (a + b) /2
mean(unipop)

[1] 100.3927

(65 + 135) / 2

[1] 100

var(unipop)

[1] 403.724

(135 - 65)^2 / 12

[1] 408.3333

sd(unipop)

[1] 20.09288

hist(unipop)

NOTE: runif(1000) will generated values from 0 to 1 by default.

22 / 37

Simulation stuff

NOTE: all random numbers distributions are based on the random uniform distribution

All random number generator functions are based on RANUNI, because all distributions can be obtained from a uniform distribution according to a theorem of probability theory. The transformations from uniform to other distributions are carried out by the inverse transform method, the Box-Müller transformation, and the acceptance-rejection procedure applied to the uniform variates generated by RANUNI. Internally, SAS generates only uniform random numbers with RANUNI and transforms them to the desired distribution.

Fan et al. (2002). SAS for Monte Carlo studies: A guide for quantitative researchers. p. 42

23 / 37

Simulation stuff

Using the `faux` package... transform uni to $\mathcal{N}$

xnorm <- faux::unif2norm(unipop)
hist(xnorm)

The faux packages has many functions to transform distributions. You can view the code as well how this is done.

24 / 37

Simulation stuff

Rolling two dice: what's the most common roll?

Uniform distribution, 1 to 6

Test this out first:

d1 <- sample(1:6, 1)
d2 <- sample(1:6, 1)
tot <- d1 + d2
tot

[1] 9

25 / 37

Simulation stuff

Catan board game example...

Rolling for resources:

Source: https://www.boardgameanalysis.com/what-is-a-balanced-catan-board/

26 / 37

Simulation stuff

Do this 1,000 times

rolls <- numeric(1000)
set.seed(123)
for (i in 1:1000){
  d1 <- sample(1:6, 1)
  d2 <- sample(1:6, 1)
  rolls[i] <- d1 + d2
}

27 / 37

Simulation stuff

Examine frequencies and plot this out

sjmisc::frq(rolls)

x <numeric> 
# total N=1000 valid N=1000 mean=6.96 sd=2.46
Value |   N | Raw % | Valid % | Cum. %
--------------------------------------
    2 |  35 |  3.50 |    3.50 |   3.50
    3 |  62 |  6.20 |    6.20 |   9.70
    4 |  87 |  8.70 |    8.70 |  18.40
    5 | 101 | 10.10 |   10.10 |  28.50
    6 | 138 | 13.80 |   13.80 |  42.30
    7 | 154 | 15.40 |   15.40 |  57.70
    8 | 131 | 13.10 |   13.10 |  70.80
    9 | 123 | 12.30 |   12.30 |  83.10
   10 |  96 |  9.60 |    9.60 |  92.70
   11 |  49 |  4.90 |    4.90 |  97.60
   12 |  24 |  2.40 |    2.40 | 100.00
 <NA> |   0 |  0.00 |    <NA> |   <NA>

barplot(table(rolls))

Many ways to roll a 7:

6-1 / 1-6
3-4 / 4-3
5-2 / 2-5

28 / 37

Simulation stuff

This is probably a good place to talk about functions in R...

Things that do something in R are functions: functionname()
This is just an example of a function with no options available

roller <- function(){
  d1 <- sample(1:6, 1)
  d2 <- sample(1:6, 1)
  d1 + d2
}
#run the function
roller() #it works!

[1] 3

roller()

[1] 8

When you run it, need the (). If not, you will just see the code.

Can use the function 1,000 times

out <- replicate(1000, roller())
str(out)

 int [1:1000] 9 4 9 8 8 11 5 5 7 8 ...

barplot(table(out))

Get the same results!

29 / 37

Simulation stuff

The `sample` function is pretty useful

Let's say you have a population that is 50% White, 20% Black, 15% Hispanic, and 15% Other

race <- sample(c('w', 'b', 'h', 'o'), 100, replace = TRUE, 
            prob = c(.5, .2, .15, .15))
sjmisc::frq(race)

x <character> 
# total N=100 valid N=100 mean=3.02 sd=1.19
Value |  N | Raw % | Valid % | Cum. %
-------------------------------------
b     | 18 | 18.00 |   18.00 |     18
h     | 15 | 15.00 |   15.00 |     33
o     | 14 | 14.00 |   14.00 |     47
w     | 53 | 53.00 |   53.00 |    100
<NA>  |  0 |  0.00 |    <NA> |   <NA>

Can also use it for bootstrapping
Or you can create a population (like what we did) and sample from it

30 / 37

Simulation stuff

Simulation basics: Monty Hall Revisited

As mentioned, recently reminded about this by the feature article in Chance magazine

doors

Marilyn vos Savant, in her weekly “Ask Marilyn” column in Parade magazine, answered a reader’s question about a game show very similar to “Let’s Make a Deal”:

“Dear Marilyn: Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, the others, goats. You pick a door, say number 1, and the host, who knows what’s behind the doors, opens another door, say number 3, which has a goat. He says to you, ‘Do you want to pick door number two?’ Is it to your advantage to switch your choice of doors?” –Craig F. Whitaker.

Source: https://chance.amstat.org/2022/11/monty-hall/. On another note, she was born in St. Louis, MO in 1946.

31 / 37

Simulation stuff

A lot has been written about this (even book/s on the topic):

At first glance, the solution seems pretty straightforward. With two doors left, one of which has the car behind it and the other a goat, chances must be 50-50. In other words, it should make no difference at all whether you switch doors or stick to the door of your initial choice. But vos Savant replied: Dear Craig: "Yes, you should switch. The first door has a 1/3 chance of winning, but the second door has a 2/3 chance."

Source: Wikipedia; Chance magazine

32 / 37

Simulation stuff

Many people could not believe her answer (from Chance Magazine)

She received more than 10,000 letters from readers of her column, the vast majority of whom were absolutely convinced that she was wrong

One reader ... wrote: “…I am sure you will receive many letters on this topic from high school and college students. Perhaps you should keep a few addresses for help with future columns.” In other words, vos Savant, although once listed in the Guinness Book of Records for her exceptionally high IQ, supposedly did not have the slightest understanding of probability.

Even the Hungarian mathematical genius Paul Erdős initially thought vos Savant’s solution was impossible

To convince people: She called on math classes across the United States to take a die, three paper cups, and a penny; turn the cups over; and hide a penny under one of the cups. Then the students—one being the ignorant contestant, another the knowing host—should simulate the game, repeating each of the two strategies, switching and sticking, 200 times [LET'S DO THAT!]

33 / 37

Simulation stuff
Francis Huang / huangf@missouri.edu    
Before we get into the details of writing a simulation, let's understand first WHY her answer is correctLet's write out all the possible combinations of the first choice (picking and sticking with the door)
#initial choice
IC <- rep(1:3, each = 3)
#winning door
WD <- rep(1:3, 3) 
#Win (1) if the same else lose (0)
res <- ifelse(IC == WD, 1, 0) 
xx <- data.frame(IC, WD, res)

xx

  IC WD res
1  1  1   1
2  1  2   0
3  1  3   0
4  2  1   0
5  2  2   1
6  2  3   0
7  3  1   0
8  3  2   0
9  3  3   1
mean(xx$res)

[1] 0.3333333
34 / 37

Simulation stuff

(cont.) what if we switch doors? This is my way of understanding this!

NOTE: the host will only open a door that has a goat-- the host will not open a door with a car! [OD = open door; SD = switched door]

sw

  IC WD  OD  SD res
1  1  1 2/3 3/2   0
2  1  2   3   2   1
3  1  3   2   3   1
4  2  1   3   1   1
5  2  2 1/3 3/1   0
6  2  3   1   3   1
7  3  1   2   1   1
8  3  2   1   2   1
9  3  3 1/2 2/1   0

mean(sw$res)

[1] 0.6666667

You will only lose if you chose correct the first time! If you chose wrong (and switch), you will win.
Your chance of being wrong is 2/3rds as well.

35 / 37

Simulation stuff

So how do we go about programming this?

xkcd

Source: https://xkcd.com/1282/

36 / 37

Simulation stuff

The no-switch condition

noswitch <- function(){
  ic <- sample(1:3, 1)
  wd <- sample(1:3, 1)
  ifelse(ic == wd, 1, 0) #the last value is returned by the function
}
res <- replicate(10000, noswitch())
mean(res)

[1] 0.3297

Once you start the switchdoor condition, you may want to simulation opening a door and then checking if the switched door is a winning door

This may help check which doors to remove: https://www.statology.org/remove-element-from-vector-r/

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Monte Carlo Simulations: Class 2

Independent Study / Spring 2023

Last week

Agenda:

Talked about generating data

Talked about how to obtain the statistic of interest, in this case, the p value

Create a loop...Save the value

How many times should we expect this to be p < .05?

NOTE: this should be based on our understanding of the test--

NOTE: this should be based on our understanding of the test--

Take another example: a one sample t-test

This is our population... we are in control of this-- we know ("theta") \(\theta\) = true value

Now an important part of what we do is make inferences from a sample \(\rightarrow\) population

Take a sample: what if we use n = 20?

What if we run one-sample t-tests?

What if we take a 1000 samples of 20-- what would the distribution look like?

What if we take 1000 samples of 20-- what would the distribution look like?

What if we take 1000 samples of 20-- what would the distribution look like?

Remember, our t-test is \(\frac{\bar{x} - \mu}{SE_{\bar{x}}}\)

What if we ran a thousand t-tests?

What if we ran a thousand t-tests?

What if we take 1000 samples of 50-- what would the distribution look like?

Compare distribution of means

Compare distribution of means

What if we ran a thousand t-tests with n = 50, 70?

What if we ran a thousand t-tests with n = 50, 70?

Transforming variables

What if our population did not have a normal distribution? What would the sampling distribution of means look like? \(\chi^2\) distribution

Uniform distribution (all values are equally likely) is also common...

NOTE: all random numbers distributions are based on the random uniform distribution

Using the faux package... transform uni to \(\mathcal{N}\)

Rolling two dice: what's the most common roll?

Catan board game example...

Do this 1,000 times

Examine frequencies and plot this out

This is probably a good place to talk about functions in R...

The sample function is pretty useful

Simulation basics: Monty Hall Revisited

A lot has been written about this (even book/s on the topic):

Many people could not believe her answer (from Chance Magazine)

Before we get into the details of writing a simulation, let's understand first WHY her answer is correct

(cont.) what if we switch doors? This is my way of understanding this!

So how do we go about programming this?

The no-switch condition

Last week

Help

Using the `faux` package... transform uni to \(\mathcal{N}\)

The `sample` function is pretty useful