My daily Blogger: Week2_Probability Distributions

Probability Distributions Exercises

In the video you just watched, Rafa looked at distributions of heights, and asked what was the probability of someone being shorter than a given height. In this assessment, we are going to ask the same question, but instead of people and heights, we are going to look at whole countries and the average life expectancy in those countries.
We will use the data set called "Gapminder" which is available as an R-package on Github. This data set contains the life expectancy, GDP per capita, and population by country, every five years, from 1952 to 2007. It is an excerpt of a larger and more comprehensive set of data available on Gapminder.org, and the R package of this dataset was created by the statistics professor Jennifer Bryan.
First, install the gapminder data using:
install.packages("gapminder")
Next, load the gapminder data set. To find out more information about the data set, use ?gapminder which will bring up a help file. To return the first few lines of the data set, use the function head().
library(gapminder)
data(gapminder)
head(gapminder)
Create a vector 'x' of the life expectancies of each country for the year 1952. Plot a histogram of these life expectancies to see the spread of the different countries.

Probability Distributions Exercises #1

1 point possible (graded)

In statistics, the empirical cumulative distribution function (or empirical cdf or empirical distribution function) is the function F(a) for any a, which tells you the proportion of the values which are less than or equal to a.

We can compute F in two ways: the simplest way is to type mean(x <= a). This calculates the number of values in x which are less than or equal a, divided by the total number of values in x, in other words the proportion of values less than or equal to a.

The second way, which is a bit more complex for beginners, is to use the ecdf() function. This is a bit complicated because this is a function that doesn't return a value, but a function.

Let's continue, using the simpler, mean() function.

What is the proportion of countries in 1952 that have a life expectancy less than or equal to 40?

Explanation
dat1952 = gapminder[ gapminder$year == 1952, ]
x = dat1952$lifeExp
mean(x <= 40)

Answer: 0.2887324

Probability Distributions Exercises #2

1 point possible (graded)

What is the proportion of countries in 1952 that have a life expectancy between 40 and 60 years? Hint: this is the proportion that have a life expectancy less than or equal to 60 years, minus the proportion that have a life expectancy less than or equal to 40 years.

dat1952 = gapminder[ gapminder$year == 1952, ]
x = dat1952$lifeExp
mean(x <= 60) - mean(x <= 40)

Answer: 0.4647887

sapply() on a custom function

Suppose we want to plot the proportions of countries with life expectancy 'q' for a range of different years. R has a built in function for this, plot(ecdf(x)), but suppose we didn't know this. The function is quite easy to build, by turning the code from question 1.1 into a custom function, and then using sapply(). Our custom function will take an input variable 'q', and return the proportion of countries in 'x' less than or equal to q. The curly brackets { and }, allow us to write an R function which spans multiple lines:
prop = function(q) {
mean(x <= q)
}
Try this out for a value of 'q': prop(40)
Now let's build a range of q's that we can apply the function to:
qs = seq(from=min(x), to=max(x), length=20)
Print 'qs' to the R console to see what the seq() function gave us. Now we can use sapply() to apply the 'prop' function to each element of 'qs':
props = sapply(qs, prop)
Take a look at 'props', either by printing to the console, or by plotting it over qs:
plot(qs, props)
Note that we could also have written this in one line, by defining the 'prop' function but without naming it:
props = sapply(qs, function(q) mean(x <= q))
This last style is called using an "inline" function or an "anonymous" function. Let's compare our homemade plot with the pre-built one in R:
plot(ecdf(x))

Normal Distribution Exercises

For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )

Here x represents the weights for the entire population.
Using the same process as before (in Null Distribution Exercises), set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. After that, set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages.

Normal Distribution Exercises #1

1 point possible (graded)

Use a histogram to "look" at the distribution of averages we get with a sample size of 5 and a sample size of 50. How would you say they differ?

library(rafalib) 
###mypa(1,2)r is optional. it is used to put both plots on one page
mypar(1,2)
hist(averages5, xlim=c(18,30))
hist(averages50, xlim=c(18,30))

See 
  this image for more information.

Answer: They both look roughly normal, but with a sample size of 50 the spread is smaller.

Normal Distribution Exercises #2

1 point possible (graded)

For the last set of averages, the ones obtained from a sample size of 50, what proportion are between 23 and 25?

mean( averages50 < 25 & averages50 > 23)

Answer: 0.976

Normal Distribution Exercises #3

1 point possible (graded)

Now ask the same question of a normal distribution with average 23.9 and standard deviation 0.43.

pnorm( (25-23.9) / 0.43)  - pnorm( (23-23.9) / 0.43)

Answer: 0.9765648

Population, Samples, and Estimates Exercises

For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"
filename <- basename(url)
download(url, destfile=filename)
dat <- read.csv(filename)

We will remove the lines that contain missing values:

dat <- na.omit( dat )

Population, Samples, and Estimates Exercises #1

1 point possible (graded)

Use dplyr to create a vector x with the body weight of all males on the control (chow) diet. What is this population's average?

x <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
mean(x)

Answer: 30.96381

Population, Samples, and Estimates Exercises #2

1 point possible (graded)

Now use the rafalib package and use the popsd function to compute the population standard deviation.

library(rafalib)
popsd(x)

Answer: 4.420501

Population, Samples, and Estimates Exercises #3

1 point possible (graded)

Set the seed at 1. Take a random sample

X

of size 25 from x. What is the sample average?

set.seed(1)
X <- sample(x,25)
mean(X)

Answer: 32.0956

Population, Samples, and Estimates Exercises #4

1 point possible (graded)

Use dplyr to create a vector y with the body weight of all males on the high fat hf) diet. What is this population's average?

y <- filter(dat, Sex=="M" & Diet=="hf") %>% select(Bodyweight) %>% unlist
mean(y)

Answer: 34.84793

Population, Samples, and Estimates Exercises #5

1 point possible (graded)

Now use the rafalib package and use the popsd function to compute the population standard deviation.

library(rafalib)
popsd(y)

Answer: 5.574609

Population, Samples, and Estimates Exercises #6

1 point possible (graded)

Set the seed at 1. Take a random sample

Y

of size 25 from y. What is the sample average?

set.seed(1)
Y <- sample(y,25)
mean(Y)

Answer: 34.768

Population, Samples, and Estimates Exercises #7

1 point possible (graded)

What is the difference in absolute value between

\bar{y} - \bar{x}

and

\bar{Y} - \bar{X}

abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )

Answer: 1.211716

Population, Samples, and Estimates Exercises #8

1 point possible (graded)

Repeat the above for females. Make sure to set the seed to 1 before each sample call. What is the difference in absolute value between

\bar{y} - \bar{x}

and

\bar{X} - \bar{Y}

0.73648

x <- filter(dat, Sex=="F" & Diet=="chow") %>% select(Bodyweight) %>% unlist
set.seed(1)
X <- sample(x,25)
y <- filter(dat, Sex=="F" & Diet=="hf") %>% select(Bodyweight) %>% unlist
set.seed(1)
Y <- sample(y,25)
abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )

Answer: 0.73648

Population, Samples, and Estimates Exercises #9

1 point possible (graded)

For the females, our sample estimates were closer to the population difference than with males. What is a possible explanation for this?

Answer: The population variance of the females is smaller than that of the males; thus, the sample variable has less variability.

3 ความคิดเห็น:

Unknown21 พฤษภาคม 2561 เวลา 14:46
Hi, first of all, thank you so much for the answer. I have learnt a lot from your blog and found it is really helpful with I stuck with my small problems.

For Normal Distribution Exercises #3, I would like to use
pnorm(25, 23.9, 0.43) - pnorm(23, 23.9, 0.43)

the result would be the same and correct, too.

Hope we could discuss more in the near future.

Best
Jackie
ตอบลบ
คำตอบ
Poonam Ray20 สิงหาคม 2564 เวลา 06:31
Thanks
ตอบลบ
คำตอบ
Farabee12 กันยายน 2564 เวลา 08:10
My code does not seem for calculate for M but works perfectly fine for F. What can I do
ตอบลบ
คำตอบ

เพิ่มความคิดเห็น

วันอังคารที่ 17 มกราคม พ.ศ. 2560

Week2_Probability Distributions

Probability Distributions Exercises

Probability Distributions Exercises #1

Probability Distributions Exercises #2

sapply() on a custom function

Normal Distribution Exercises

Normal Distribution Exercises #1

Normal Distribution Exercises #2

Normal Distribution Exercises #3

Population, Samples, and Estimates Exercises

Population, Samples, and Estimates Exercises #1

Population, Samples, and Estimates Exercises #2

Population, Samples, and Estimates Exercises #3

Population, Samples, and Estimates Exercises #4

Population, Samples, and Estimates Exercises #5

Population, Samples, and Estimates Exercises #6

Population, Samples, and Estimates Exercises #7

Population, Samples, and Estimates Exercises #8

Population, Samples, and Estimates Exercises #9

3 ความคิดเห็น: