For all of the questions below incorporate the necessary R code directly into your answers.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual survey provided by the Center for Disease Control (CDC) to assess behavioral and chronic diseases. The center surveys six individual-level behavioral health risk factors associated with the leading causes of premature mortality and morbidity among adults: 1) cigarette smoking, 2) alcohol use, 3) physical activity, 4) diet, 5) hypertension, and 6) safety belt use.
A subset of the data concentrating on Iowa with records for 2012 is given at https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/iowa-brfss-2012.csv
A codebook describing the survey and a listing of all variables is available at http://www.cdc.gov/brfss/annual_data/2012/pdf/CODEBOOK12_LLCP.pdf. You should be able to answer all of the following questions without the help of the codebook.
iowa <- read.csv("https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/iowa-brfss-2012.csv")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dim(iowa)
## [1] 7166 359
Make the variable a factor variable and change the level names accordingly.
iowa$MARITAL <- factor(iowa$MARITAL)
levels(iowa$MARITAL) <- c("Married", "Divorced", "Widowed", "Separated", "Never Married", "Couple", NA)
ggplot( data=iowa,
aes(reorder(MARITAL, AGE, na.rm=T), AGE)) + geom_boxplot()
## Warning: Removed 65 rows containing non-finite values (`stat_boxplot()`).
HEIGHT3
and WEIGHT2
are reported height
and weight of survey participants. What would you (roughly) expect for a
relationship between the two variables? Draw a scatterplot. Does the
result surprise you? Comment.iowa %>% ggplot(aes(x = HEIGHT3, y = WEIGHT2)) +
geom_point()
# there should be a somewhat positive (linear) relationship between height and weight - the picture looks very different
# the scales on x and y axis do not correspond in any way to values that would make sense for human heights and weights
Introduce a new variable ‘height’ into the dataset that corresponds to reported height in centimeters [cm] (i.e. for metric measurements you need to subtract 9000, but for ft/inches you need to do a bit more). For your convenience: 1 ft equals 30.48 cm, 1 in equals 2.54 cm. After converting, round to the nearest centimeter. Introduce NAs as appropriate.
Using the ggplot2 package, draw a histogram of the resulting variable facetted by gender. Make sure that the histograms are displayed on top of each other. Get rid of all warning messages. Comment on the result.
iowa$height <- NA
iowa$height[iowa$HEIGHT3 >= 9000] <- iowa$HEIGHT3[iowa$HEIGHT3 >= 9000] - 9000
iowa$height[iowa$HEIGHT3 == 9999] <- NA
idx <- which(iowa$HEIGHT3 <= 711)
iowa$height[idx] <- round(30.48*(iowa$HEIGHT3[idx] %/% 100) + 2.54*(iowa$HEIGHT3[idx] %% 100))
iowa %>% ggplot(aes( x = height)) +
geom_histogram(binwidth=5) +
facet_grid(SEX~.)
## Warning: Removed 73 rows containing non-finite values (`stat_bin()`).
Which category did respondents pick most often?
Introduce a new variable into the data set that is (in case of a valid answer to SEATBELT) TRUE, if a respondent always wears a seatbelt, and FALSE if not. Deal with missing values appropriately. What percentage of women (SEX = 2) always wear a seatbelt compared to men (SEX = 1)? Using ggplot2, draw a plot that corresponds to these percentages.
summary(factor(iowa$SEATBELT)) # 1 with 5960
## 1 2 3 4 5 7 8 9 NA's
## 5960 682 161 66 61 7 8 3 218
iowa$alwaysseatbelt <- NA
iowa$alwaysseatbelt[iowa$SEATBELT == 1] <- TRUE
iowa$alwaysseatbelt[iowa$SEATBELT %in% 2:5] <- FALSE
sum(iowa$alwaysseatbelt[iowa$SEX==2] == TRUE, na.rm=TRUE)/sum(iowa$SEX==2)*100
## [1] 88.09749
sum(iowa$alwaysseatbelt[iowa$SEX==1] == TRUE, na.rm=TRUE)/sum(iowa$SEX==1)*100
## [1] 76.08844
iowa %>%
ggplot(aes(x = factor(SEX), fill=alwaysseatbelt)) +
geom_bar(position="fill")
Show two plots: using ggplot2, show the relationship between age and percentage of drink driving, in a separate plot show the relationship between exercise and the number of poor health days. In both plots, incorporate information on the number of respondents and gender.
Comment on both plots.
library(dplyr)
dframe <- iowa %>% group_by(AGE, SEX) %>% summarise(
n = length(AGE),
dui = mean(DRNKDRI2 > 0 & (DRNKDRI2 < 77), na.rm=TRUE)*100,
noexer = mean(EXERANY2 == 2, na.rm=TRUE)*100,
poor = mean(POORHLTH, na.rm=TRUE)
)
## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.
dframe %>%
ggplot(aes(x = AGE, y =dui, colour=factor(SEX), size = n)) +
geom_point() +
facet_grid(~SEX)
## Warning: Removed 7 rows containing missing values (`geom_point()`).
# it seems that with increasing age the percentage of dui events goes down;
# women seem to have lower dui percentages than men
# the visual (and the data) are not ideal, because there is not enough data
# to create reliable rates for dui for each year of age. It would be more informative
# to create age groups and calculate dui percentages
dframe %>%
ggplot(aes(x = noexer, y = poor,
colour=factor(SEX))) +
geom_point(aes(size = n)) +
geom_smooth(method = "lm") +
facet_grid(~SEX)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 3 rows containing missing values (`geom_point()`).
# What we see are moderately strong positive relationships between no exercise and poor health
# the visual (and the data) are not ideal, because there is not enough data
# to create reliable rates for each year of age and sex. It would be more informative
# to create age groups and re-calculate the percentages
summary(clean(iowa$MENTHLTH, c(88, 77, 99), c(0, NA, NA))) iowa iowa %>% mutate(
HLTHPLN1 = clean(HLTHPLN1, c(1,2,7,9), c("Yes", "No", NA, NA))
) %>% count(HLTHPLN1)
clean <- function(x, old, new) {
stopifnot(length(old) == length(new))
for (i in 1:length(old)) {
x <- replace(x, x == old[i], new[i])
}
x
}
summary(clean(iowa$MENTHLTH, c(88, 77, 99), c(0, NA, NA)))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 2.685 1.000 30.000 94
iowa %>% mutate(
HLTHPLN1 = clean(HLTHPLN1, c(1,2,7,9), c("Yes", "No", NA, NA))
) %>% count(HLTHPLN1)
## HLTHPLN1 n
## 1 No 532
## 2 Yes 6601
## 3 <NA> 33
m1 <- glm(DRNKDRI2 > 0 ~ AGE, data=iowa, family=binomial(logit))
This creates an object m1 in your R session. Investigate the object,
find out if the object contains an element named ‘aic’. Report its
value, if it does.# use function str() on m1:
m1 <- glm(DRNKDRI2 > 0 ~ AGE, data=iowa, family=binomial(logit))
# str(m1) # too much output to show sensibly
names(m1)
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "na.action" "call" "formula"
## [25] "terms" "data" "offset"
## [28] "control" "method" "contrasts"
## [31] "xlevels"
m1$aic
## [1] 1368.494