My daily Blogger: มกราคม 2017

วันอังคารที่ 17 มกราคม พ.ศ. 2560

Week2_Probability Distributions

Probability Distributions Exercises

In the video you just watched, Rafa looked at distributions of heights, and asked what was the probability of someone being shorter than a given height. In this assessment, we are going to ask the same question, but instead of people and heights, we are going to look at whole countries and the average life expectancy in those countries.
We will use the data set called "Gapminder" which is available as an R-package on Github. This data set contains the life expectancy, GDP per capita, and population by country, every five years, from 1952 to 2007. It is an excerpt of a larger and more comprehensive set of data available on Gapminder.org, and the R package of this dataset was created by the statistics professor Jennifer Bryan.
First, install the gapminder data using:
install.packages("gapminder")
Next, load the gapminder data set. To find out more information about the data set, use ?gapminder which will bring up a help file. To return the first few lines of the data set, use the function head().
library(gapminder)
data(gapminder)
head(gapminder)
Create a vector 'x' of the life expectancies of each country for the year 1952. Plot a histogram of these life expectancies to see the spread of the different countries.

Probability Distributions Exercises #1

1 point possible (graded)

In statistics, the empirical cumulative distribution function (or empirical cdf or empirical distribution function) is the function F(a) for any a, which tells you the proportion of the values which are less than or equal to a.

We can compute F in two ways: the simplest way is to type mean(x <= a). This calculates the number of values in x which are less than or equal a, divided by the total number of values in x, in other words the proportion of values less than or equal to a.

The second way, which is a bit more complex for beginners, is to use the ecdf() function. This is a bit complicated because this is a function that doesn't return a value, but a function.

Let's continue, using the simpler, mean() function.

What is the proportion of countries in 1952 that have a life expectancy less than or equal to 40?

Explanation
dat1952 = gapminder[ gapminder$year == 1952, ]
x = dat1952$lifeExp
mean(x <= 40)

Answer: 0.2887324

Probability Distributions Exercises #2

1 point possible (graded)

What is the proportion of countries in 1952 that have a life expectancy between 40 and 60 years? Hint: this is the proportion that have a life expectancy less than or equal to 60 years, minus the proportion that have a life expectancy less than or equal to 40 years.

dat1952 = gapminder[ gapminder$year == 1952, ]
x = dat1952$lifeExp
mean(x <= 60) - mean(x <= 40)

Answer: 0.4647887

sapply() on a custom function

Suppose we want to plot the proportions of countries with life expectancy 'q' for a range of different years. R has a built in function for this, plot(ecdf(x)), but suppose we didn't know this. The function is quite easy to build, by turning the code from question 1.1 into a custom function, and then using sapply(). Our custom function will take an input variable 'q', and return the proportion of countries in 'x' less than or equal to q. The curly brackets { and }, allow us to write an R function which spans multiple lines:
prop = function(q) {
mean(x <= q)
}
Try this out for a value of 'q': prop(40)
Now let's build a range of q's that we can apply the function to:
qs = seq(from=min(x), to=max(x), length=20)
Print 'qs' to the R console to see what the seq() function gave us. Now we can use sapply() to apply the 'prop' function to each element of 'qs':
props = sapply(qs, prop)
Take a look at 'props', either by printing to the console, or by plotting it over qs:
plot(qs, props)
Note that we could also have written this in one line, by defining the 'prop' function but without naming it:
props = sapply(qs, function(q) mean(x <= q))
This last style is called using an "inline" function or an "anonymous" function. Let's compare our homemade plot with the pre-built one in R:
plot(ecdf(x))

Normal Distribution Exercises

For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )

Here x represents the weights for the entire population.
Using the same process as before (in Null Distribution Exercises), set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. After that, set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages.

Normal Distribution Exercises #1

1 point possible (graded)

Use a histogram to "look" at the distribution of averages we get with a sample size of 5 and a sample size of 50. How would you say they differ?

library(rafalib) 
###mypa(1,2)r is optional. it is used to put both plots on one page
mypar(1,2)
hist(averages5, xlim=c(18,30))
hist(averages50, xlim=c(18,30))

See 
  this image for more information.

Answer: They both look roughly normal, but with a sample size of 50 the spread is smaller.

Normal Distribution Exercises #2

1 point possible (graded)

For the last set of averages, the ones obtained from a sample size of 50, what proportion are between 23 and 25?

mean( averages50 < 25 & averages50 > 23)

Answer: 0.976

Normal Distribution Exercises #3

1 point possible (graded)

Now ask the same question of a normal distribution with average 23.9 and standard deviation 0.43.

pnorm( (25-23.9) / 0.43)  - pnorm( (23-23.9) / 0.43)

Answer: 0.9765648

Population, Samples, and Estimates Exercises

For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv"
filename <- basename(url)
download(url, destfile=filename)
dat <- read.csv(filename)

We will remove the lines that contain missing values:

dat <- na.omit( dat )

Population, Samples, and Estimates Exercises #1

1 point possible (graded)

Use dplyr to create a vector x with the body weight of all males on the control (chow) diet. What is this population's average?

x <- filter(dat, Sex=="M" & Diet=="chow") %>% select(Bodyweight) %>% unlist
mean(x)

Answer: 30.96381

Population, Samples, and Estimates Exercises #2

1 point possible (graded)

Now use the rafalib package and use the popsd function to compute the population standard deviation.

library(rafalib)
popsd(x)

Answer: 4.420501

Population, Samples, and Estimates Exercises #3

1 point possible (graded)

Set the seed at 1. Take a random sample

X

of size 25 from x. What is the sample average?

set.seed(1)
X <- sample(x,25)
mean(X)

Answer: 32.0956

Population, Samples, and Estimates Exercises #4

1 point possible (graded)

Use dplyr to create a vector y with the body weight of all males on the high fat hf) diet. What is this population's average?

y <- filter(dat, Sex=="M" & Diet=="hf") %>% select(Bodyweight) %>% unlist
mean(y)

Answer: 34.84793

Population, Samples, and Estimates Exercises #5

1 point possible (graded)

Now use the rafalib package and use the popsd function to compute the population standard deviation.

library(rafalib)
popsd(y)

Answer: 5.574609

Population, Samples, and Estimates Exercises #6

1 point possible (graded)

Set the seed at 1. Take a random sample

Y

of size 25 from y. What is the sample average?

set.seed(1)
Y <- sample(y,25)
mean(Y)

Answer: 34.768

Population, Samples, and Estimates Exercises #7

1 point possible (graded)

What is the difference in absolute value between

\bar{y} - \bar{x}

and

\bar{Y} - \bar{X}

abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )

Answer: 1.211716

Population, Samples, and Estimates Exercises #8

1 point possible (graded)

Repeat the above for females. Make sure to set the seed to 1 before each sample call. What is the difference in absolute value between

\bar{y} - \bar{x}

and

\bar{X} - \bar{Y}

0.73648

x <- filter(dat, Sex=="F" & Diet=="chow") %>% select(Bodyweight) %>% unlist
set.seed(1)
X <- sample(x,25)
y <- filter(dat, Sex=="F" & Diet=="hf") %>% select(Bodyweight) %>% unlist
set.seed(1)
Y <- sample(y,25)
abs( ( mean(y) - mean(x) ) - ( mean(Y) - mean(X) ) )

Answer: 0.73648

Population, Samples, and Estimates Exercises #9

1 point possible (graded)

For the females, our sample estimates were closer to the population difference than with males. What is a possible explanation for this?

Answer: The population variance of the females is smaller than that of the males; thus, the sample variable has less variability.

Week_2:PH525.1x

For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )

Here x represents the weights for the entire population.

Random Variables Exercises #1

1 point possible (graded)

What is the average of these weights?

mean(x)

Answer: 23.89338

Random Variables Exercises #2

1 point possible (graded)

After setting the seed at 1, set.seed(1) take a random sample of size 5. What is the absolute value (use abs) of the difference between the average of the sample and the average of all the values?

set.seed(1)
X <- sample(x,5)
abs( mean(X) - mean(x) )

Answer: 0.2706222

Random Variables Exercises #3

1 point possible (graded)

After setting the seed at 5, set.seed(5) take a random sample of size 5. What is the absolute value of the difference between the average of the sample and the average of all the values?

set.seed(5)
X <- sample(x,5)
abs( mean(X) - mean(x) )

Answer: 1.433378

Random Variables Exercises #4

1/1 point (graded)

Why are the answers from 2 and 3 different?

Answer: Because the average of the samples is a random variable.

For these exercises, we will be using the following dataset:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )

Here x represents the weights for the entire population.

Null Distributions Exercises #1

1 point possible (graded)

Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages. What proportion of these 1,000 averages are more than 1 gram away from the average of x ?

set.seed(1)
n <- 1000
averages5 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,5)
  averages5[i] <- mean(X)
}
hist(averages5) ##take a look
mean( abs( averages5 - mean(x) ) > 1)

Answer: 0.498

Null Distributions Exercises #2

1 point possible (graded)

We are now going to increase the number of times we redo the sample from 1,000 to 10,000. Set the seed at 1, then using a for-loop take a random sample of 5 mice 10,000 times. Save these averages. What proportion of these 10,000 averages are more than 1 gram away from the average of x ?

set.seed(1)
n <- 10000
averages5 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,5)
  averages5[i] <- mean(X)
}
hist(averages5) ##take a look
mean( abs( averages5 - mean(x) ) > 1)

Answer: 0.4976

Note that the answers to 1 and 2 barely changed. This is expected. The way we think about the random value distributions is as the distribution of the list of values obtained if we repeated the experiment an infinite number of times. On a computer, we can't perform an infinite number of iterations so instead, for our examples, we consider 1,000 to be large enough, thus 10,000 is as well. Now if instead we change the sample size, then we change the random variable and thus its distribution.
Set the seed at 1, then using a for-loop take a random sample of 50 mice 1,000 times. Save these averages. What proportion of these 1,000 averages are more than 1 gram away from the average of x ?

set.seed(1)
n <- 1000
averages50 <- vector("numeric",n)
for(i in 1:n){
  X <- sample(x,50)
  averages50[i] <- mean(X)
}
hist(averages50) ##take a look
mean( abs( averages50 - mean(x) ) > 1)

Answer: 0.019

Week#1:PH525.1x

Run swirl Exercises #1

0 points possible (ungraded)

What version of R are you using (hint: make sure you download the latest version and then type version)? Please note that this question does not count toward your grade, but it is important to make sure that you are using the latest version of R.

> version
R will return

Since, I'm using the very current version of R. The correct answer is 3.3.0.

Answer: 3.3.0

Run swirl Exercises #2

1/1 point (graded)

Create a numeric vector containing the numbers 2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23. What is the average of these numbers?

z <- c(2.23, 3.45, 1.87, 2.11, 7.33, 18.34, 19.23)
y <-mean(z)
y
Answer: 7.942

Run swirl Exercises #3

1 point possible (graded)

Use a for loop to determine the value of

\sum_{i = 1}^{25} i^{2}

sum <- 0

for (i in 1:25) {

sum <- sum + i^2

sum

}

Answer: 5525

Run swirl Exercises #4

1 point possible (graded)

The cars dataset is available in base R. You can type cars to see it. Use the class function to determine what type of object is cars.

class(cars)
Answer: data.frame

Run swirl Exercises #5

1/1 point (graded)

How many rows does the cars object have?

nrow(cars)
Answer: 50

Run swirl Exercises #6

1 point possible (graded)

What is the name of the second column of cars?

names(cars)[2]
Answer: dist

Run swirl Exercises #7

1/1 point (graded)

The simplest way to extract the columns of a matrix or data.frame is using [. For example you can access the second column with cars[,2]. What is the average distance traveled in this dataset?

mean(cars[,2])
Answer: 42.98

Run swirl Exercises #8

1/1 point (graded)

Familiarize yourself with the which function. What row of cars has a a distance of 85?

which(cars[,2]==85)
Answer: 50

To run the written code in file we'll call it an R markdown code file.
We can run multiple lines of command in another window as a R markdown.

Here we will test some of the basics of R data manipulation which you should know or should have learned by following the tutorials above. You will need to have the file femaleMiceWeights.csv in your working directory. As we showed above, one way to do this is by using the downloader package:

library(downloader) 
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleMiceWeights.csv"
filename <- "femaleMiceWeights.csv" 
download(url, destfile=filename)

 Here we will test some of the basics of R data manipulation which you 
should know or should have learned by following the tutorials above. You
 will need to have the file femaleMiceWeights.csv in your working directory. As we showed above, one way to do this is by using the downloader package:

Getting Started Exercises #1

1 point possible (graded)

Read in the file femaleMiceWeights.csv and report the exact name of the column containing the weights.

dat <- read.csv("femaleMiceWeights.csv")

dat <- read.csv("femaleMiceWeights.csv")
head(dat) 
##   Diet Bodyweight
## 1 chow      21.51
## 2 chow      28.14
## 3 chow      24.04
## 4 chow      23.45
## 5 chow      23.68
## 6 chow      19.79
names(dat)[2]

Answer:Bodyweight

Getting Started Exercises #2

1/1 point (graded)

The [ and ] symbols can be used to extract specific rows and specific columns of the table. What is the entry in the 12th row and second column?

dat[12,2]

Answer: 26.25

Getting Started Exercises #3

1 point possible (graded)

You should have learned how to use the $ character to extract a column from a table and return it as a vector. Use $ to extract the weight column and report the weight of the mouse in the 11th row.

weights <- dat$Bodyweight
weights[11]

Answer: 26.91

Getting Started Exercises #4

1 point possible (graded)

The length function returns the number of elements in a vector. How many mice are included in our dataset?

weights <- dat$Bodyweight
length(weights)

Answer: 24

Getting Started Exercises #5

1 point possible (graded)

To create a vector with the numbers 3 to 7, we can use seq(3,7) or, because they are consecutive, 3:7. View the data and determine what rows are associated with the high fat or hf diet. Then use the mean function to compute the average weight of these mice.

View(dat) 
weights <- dat$Bodyweight
mean( weights[ 13:24 ])

Answer: 26.83417

Getting Started Exercises #6

1/1 point (graded)

One of the functions we will be using often is sample. Read the help file for sample using ?sample. Now take a random sample of size 1 from the numbers 13 to 24 and report back the weight of the mouse represented by that row. Make sure to type set.seed(1) to ensure that everybody gets the same answer.

set.seed(1)
i <- sample( 13:24, 1)
dat$Bodyweight[i]

Answer: 25.34

dplyr Exercises

For these exercises, we will use a new dataset related to mammalian sleep. This link describes the data. Download the CSV file from this location:

library(downloader)
url="https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
filename <- basename(url)
download(url,filename)

We are going to read in this data, then test your knowledge of they key dplyr functions select and filter. We are also going to review two different _classes_: data frames and vectors.
Rmd file here

dplyr Exercises #1

1/1 point (graded)

Read in the msleep_ggplot2.csv file with the function read.csv and use the function class to determine what type of object is returned.

dat <- read.csv("msleep_ggplot2.csv")
class(dat)

Answer: data.frame

dplyr Exercises #2

1 point possible (graded)

Now use the filter function to select only the primates. How many animals in the table are primates? Hint: the nrow function gives you the number of rows of a data frame or matrix.

library(dplyr)
head(dat)
##                         name      genus  vore        order conservation
## 1                    Cheetah   Acinonyx carni    Carnivora           lc
## 2                 Owl monkey      Aotus  omni     Primates         <NA>
## 3            Mountain beaver Aplodontia herbi     Rodentia           nt
## 4 Greater short-tailed shrew    Blarina  omni Soricomorpha           lc
## 5                        Cow        Bos herbi Artiodactyla domesticated
## 6           Three-toed sloth   Bradypus herbi       Pilosa         <NA>
##   sleep_total sleep_rem sleep_cycle awake brainwt  bodywt
## 1        12.1        NA          NA  11.9      NA  50.000
## 2        17.0       1.8          NA   7.0 0.01550   0.480
## 3        14.4       2.4          NA   9.6      NA   1.350
## 4        14.9       2.3   0.1333333   9.1 0.00029   0.019
## 5         4.0       0.7   0.6666667  20.0 0.42300 600.000
## 6        14.4       2.2   0.7666667   9.6      NA   3.850
dat2 <- filter(dat, order=="Primates")
nrow(dat2)

Answer: 12

dplyr Exercises #3

1/1 point (graded)

What is the class of the object you obtain after subsetting the table to only include primates?

dat2 <- filter(dat, order=="Primates")
class(dat2)

Answer: data.frame

dplyr Exercises #4

1 point possible (graded)

Now use the select function to extract the sleep (total) for the primates. What class is this object? Hint: use %>% to pipe the results of the filter function to select.

y <- filter(dat, order=="Primates") %>% select(sleep_total)
class(y)

Answer: data.frame

dplyr Exercises #5

1/1 point (graded)

Now we want to calculate the average amount of sleep for primates (the average of the numbers computed above). One challenge is that the mean function requires a vector so, if we simply apply it to the output above, we get an error. Look at the help file for unlist and use it to compute the desired average.

y <- filter(dat, order=="Primates") %>% select(sleep_total) %>% unlist
mean( y )

Answer: 10.5

dplyr Exercises #6

1 point possible (graded)

For the last exercise, we could also use the dplyr summarize function. We have not introduced this function, but you can read the help file and repeat exercise 5, this time using just filter and summarize to get the answer.