Behavioral Risk Factor Surveillance System

For all of the questions below incorporate the necessary R code directly into your answers.

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual survey provided by the Center for Disease Control (CDC) to assess behavioral and chronic diseases. The center surveys six individual-level behavioral health risk factors associated with the leading causes of premature mortality and morbidity among adults: 1) cigarette smoking, 2) alcohol use, 3) physical activity, 4) diet, 5) hypertension, and 6) safety belt use.

A subset of the data concentrating on Iowa with records for 2012 is given at https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/iowa-brfss-2012.csv

A codebook describing the survey and a listing of all variables is available at http://www.cdc.gov/brfss/annual_data/2012/pdf/CODEBOOK12_LLCP.pdf. You should be able to answer all of the following questions without the help of the codebook.

  1. Read the data into your R session. How many records are in the data set, how many variables?
iowa <- read.csv("https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/iowa-brfss-2012.csv")
# your code goes here
  1. The variable ‘MARITAL’ is coded as 1 Married 2 Divorced
    3 Widowed 4 Separated
    5 Never married
    6 A member of an unmarried couple 9 Refused BLANK Not asked or Missing

Make the variable a factor variable and change the level names accordingly.

# your code goes here
  1. Plot boxplots of age (AGE) by marital status (MARITAL). Order categories of MARITAL by increasing median age.
# your code goes here
  1. HEIGHT3 and WEIGHT2 are reported height and weight of survey participants. What would you (roughly) expect for a relationship between the two variables? Draw a scatterplot. Does the result surprise you? Comment.
# your code goes here
  1. It turns out, that the following coding scheme is used for HEIGHT3: 200 - 711 Height (ft/inches)
    7777 Don’t know/Not sure 9000 - 9998 Height (meters/centimeters), where the first 9 indicates that the measurement was metric 9999 Refused BLANK Not asked or Missing

Introduce a new variable ‘height’ into the dataset that corresponds to reported height in centimeters [cm] (i.e. for metric measurements you need to subtract 9000, but for ft/inches you need to do a bit more). For your convenience: 1 ft equals 30.48 cm, 1 in equals 2.54 cm. After converting, round to the nearest centimeter. Introduce NAs as appropriate.

Using the ggplot2 package, draw a histogram of the resulting variable facetted by gender. Make sure that the histograms are displayed on top of each other. Get rid of all warning messages. Comment on the result.

# your code goes here
  1. Seatbelt use (SEATBELT) is captured from 1 - Always, 2- Nearly Always, 3-Sometimes, 4- Seldom, 5-Never, 7-Don’t know, 8-Never drive or ride in a car, 9-Refused, to BLANK-Missing.

Which category did respondents pick most often?

Introduce a new variable into the data set that is (in case of a valid answer to SEATBELT) TRUE, if a respondent always wears a seatbelt, and FALSE if not. Deal with missing values appropriately. What percentage of women (SEX = 2) always wear a seatbelt compared to men (SEX = 1)? Using ggplot2, draw a plot that corresponds to these percentages.

# your code goes here
  1. Using ddply, determine the following summary statistics by age and sex:

Show two plots: using ggplot2, show the relationship between age and percentage of drink driving, in a separate plot show the relationship between exercise and the number of poor health days. In both plots, incorporate information on the number of respondents and gender.

Comment on both plots.

# your code goes here
  1. Write a function ‘clean’ that takes as arguments a vector ‘x’ and two vectors, ‘old’ and ‘new’. The function is supposed to replace all occurrences of ‘old’ by their corresponding values in ‘new’. Make sure to check that ‘old’ and ‘new’ have the same length. Call the following lines to test your function:
summary(clean(iowa$MENTHLTH, c(88, 77, 99), c(0, NA, NA))) iowa iowa %>% mutate(
HLTHPLN1 = clean(HLTHPLN1, c(1,2,7,9), c("Yes", "No", NA, NA))
) %>% count(HLTHPLN1)
# your code goes here
  1. The function glm in R calculates a generalized linear model. m1 <- glm(DRNKDRI2 > 0 ~ AGE, data=iowa, family=binomial(logit)) This creates an object m1 in your R session. Investigate the object, find out if the object contains an element named ‘aic’. Report its value, if it does.
# your code goes here