Probability Distributions Exercises
In the video you just watched, Rafa looked at distributions of
heights, and asked what was the probability of someone being shorter
than a given height. In this assessment, we are going to ask the same
question, but instead of people and heights, we are going to look at
whole countries and the average life expectancy in those countries.
We will use the data set called "Gapminder" which is available as an R-package on Github. This data set contains the life expectancy, GDP per capita, and population by country, every five years, from 1952 to 2007. It is an excerpt of a larger and more comprehensive set of data available on Gapminder.org, and the R package of this dataset was created by the statistics professor Jennifer Bryan.
First, install the gapminder data using:
install.packages("gapminder")
Next, load the gapminder data set. To find out more information about the data set, use ?gapminder which will bring up a help file. To return the first few lines of the data set, use the function head().
Create a vector 'x' of the life expectancies of each country for the year 1952. Plot a histogram of these life expectancies to see the spread of the different countries.
We will use the data set called "Gapminder" which is available as an R-package on Github. This data set contains the life expectancy, GDP per capita, and population by country, every five years, from 1952 to 2007. It is an excerpt of a larger and more comprehensive set of data available on Gapminder.org, and the R package of this dataset was created by the statistics professor Jennifer Bryan.
First, install the gapminder data using:
install.packages("gapminder")
Next, load the gapminder data set. To find out more information about the data set, use ?gapminder which will bring up a help file. To return the first few lines of the data set, use the function head().
library(gapminder)
data(gapminder)
head(gapminder)
Create a vector 'x' of the life expectancies of each country for the year 1952. Plot a histogram of these life expectancies to see the spread of the different countries.
Probability Distributions Exercises #1
1 point possible (graded)
In statistics, the empirical cumulative distribution function (or
empirical cdf or empirical distribution function) is the function F(a)
for any a, which tells you the proportion of the values which are less
than or equal to a.We can compute F in two ways: the simplest way is to type mean(x <= a). This calculates the number of values in x which are less than or equal a, divided by the total number of values in x, in other words the proportion of values less than or equal to a.
The second way, which is a bit more complex for beginners, is to use the ecdf() function. This is a bit complicated because this is a function that doesn't return a value, but a function.
Let's continue, using the simpler, mean() function.
Explanation
dat1952 = gapminder[ gapminder$year == 1952, ]
x = dat1952$lifeExp
mean(x <= 40)
dat1952 = gapminder[ gapminder$year == 1952, ]
x = dat1952$lifeExp
mean(x <= 40)
Probability Distributions Exercises #2
1 point possible (graded)
dat1952 = gapminder[ gapminder$year == 1952, ]
x = dat1952$lifeExp
mean(x <= 60) - mean(x <= 40)
0.4647887
sapply() on a custom function
Suppose we want to plot the proportions of countries with life expectancy 'q' for a range of different years. R has a built in function for this, plot(ecdf(x)), but suppose we didn't know this. The function is quite easy to build, by turning the code from question 1.1 into a custom function, and then using sapply(). Our custom function will take an input variable 'q', and return the proportion of countries in 'x' less than or equal to q. The curly brackets { and }, allow us to write an R function which spans multiple lines:prop = function(q) {
mean(x <= q)
}
Try this out for a value of 'q': prop(40)
Now let's build a range of q's that we can apply the function to:
qs = seq(from=min(x), to=max(x), length=20)
Print 'qs' to the R console to see what the seq() function gave us. Now we can use sapply() to apply the 'prop' function to each element of 'qs':
props = sapply(qs, prop)
Take a look at 'props', either by printing to the console, or by plotting it over qs:
plot(qs, props)
Note that we could also have written this in one line, by defining the 'prop' function but without naming it:
props = sapply(qs, function(q) mean(x <= q))
This last style is called using an "inline" function or an "anonymous" function. Let's compare our homemade plot with the pre-built one in R:
plot(ecdf(x))
Normal Distribution Exercises
For these exercises, we will be using the following dataset:
Using the same process as before (in Null Distribution Exercises), set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. After that, set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages.
library(downloader) url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv" filename <- basename(url) download(url, destfile=filename) x <- unlist( read.csv(filename) )Here
x
represents the weights for the entire population.
Using the same process as before (in Null Distribution Exercises), set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. After that, set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages.
Normal Distribution Exercises #1
1 point possible (graded)
Use a histogram to "look" at the distribution of averages we get
with a sample size of 5 and a sample size of 50. How would you say they
differ?library(rafalib) ###mypa(1,2)r is optional. it is used to put both plots on one page mypar(1,2) hist(averages5, xlim=c(18,30)) hist(averages50, xlim=c(18,30))
See
this image for more information.
Answer: They both look roughly normal, but with a sample size of 50 the spread is smaller.
Normal Distribution Exercises #2
1 point possible (graded)
For the last set of averages, the ones obtained from a sample size of 50, what proportion are between 23 and 25?mean( averages50 < 25 & averages50 > 23)
Answer: 0.976
Normal Distribution Exercises #3
1 point possible (graded)
Now ask the same question of a normal distribution with average 23.9 and standard deviation 0.43.
pnorm( (25-23.9) / 0.43) - pnorm( (23-23.9) / 0.43)
Answer: 0.9765648
Population, Samples, and Estimates Exercises
For these exercises, we will be using the following dataset:
library(downloader) url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv" filename <- basename(url) download(url, destfile=filename) dat <- read.csv(filename)We will remove the lines that contain missing values:
dat <- na.omit( dat )
Population, Samples, and Estimates Exercises #1
1 point possible (graded)
Use dplyr
to create a vector x
with the body weight of all males on the control (chow
) diet. What is this population's average?x <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
mean(x)
Answer: 30.96381
Population, Samples, and Estimates Exercises #2
1 point possible (graded)
Now use the
Answer: 4.420501
of size 25 from
rafalib
package and use the popsd
function to compute the population standard deviation.library(rafalib) popsd(x)
Answer: 4.420501
Population, Samples, and Estimates Exercises #3
1 point possible (graded)
Set the seed at 1. Take a random sample
x
. What is the sample average? set.seed(1) X <- sample(x,25) mean(X)
Answer: 32.0956
Population, Samples, and Estimates Exercises #4
1 point possible (graded)
Use dplyr
to create a vector y
with the body weight of all males on the high fat hf
) diet. What is this population's average?y <- filter(dat, Sex=="M" & Diet=="hf") %>% select(Bodyweight) %>% unlist mean(y)
Answer: 34.84793
Population, Samples, and Estimates Exercises #5
1 point possible (graded)
Now use the
Answer: 5.574609
of size 25 from
rafalib
package and use the popsd
function to compute the population standard deviation.library(rafalib) popsd(y)
Answer: 5.574609
Population, Samples, and Estimates Exercises #6
1 point possible (graded)
Set the seed at 1. Take a random sample
y
. What is the sample average? set.seed(1) Y <- sample(y,25) mean(Y)
Answer: 34.768
Population, Samples, and Estimates Exercises #7
1 point possible (graded)
What is the difference in absolute value between
abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )
Answer: 1.211716
Population, Samples, and Estimates Exercises #8
1 point possible (graded)
Repeat the above for females. Make sure to set the seed to 1 before each
sample
call. What is the difference in absolute value between x <- filter(dat, Sex=="F" & Diet=="chow") %>% select(Bodyweight) %>% unlist set.seed(1) X <- sample(x,25) y <- filter(dat, Sex=="F" & Diet=="hf") %>% select(Bodyweight) %>% unlist set.seed(1) Y <- sample(y,25) abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )
Answer: 0.73648
Population, Samples, and Estimates Exercises #9
1 point possible (graded)
For the females, our sample estimates were closer to the population
difference than with males. What is a possible explanation for this?
Answer: The population variance of the females is smaller than that of the males; thus, the sample variable has less variability.
Hi, first of all, thank you so much for the answer. I have learnt a lot from your blog and found it is really helpful with I stuck with my small problems.
ตอบลบFor Normal Distribution Exercises #3, I would like to use
pnorm(25, 23.9, 0.43) - pnorm(23, 23.9, 0.43)
the result would be the same and correct, too.
Hope we could discuss more in the near future.
Best
Jackie
Thanks
ตอบลบMy code does not seem for calculate for M but works perfectly fine for F. What can I do
ตอบลบ