The Behavioral Risk Factor Surveillance System (BRFSS) is an annual survey provided by the Center for Disease Control (CDC) to assess behavioral and chronic diseases. The center surveys six individual-level behavioral health risk factors associated with the leading causes of premature mortality and morbidity among adults: 1) cigarette smoking, 2) alcohol use, 3) physical activity, 4) diet, 5) hypertension, and 6) safety belt use.
A subset of the data concentrating on Iowa with records for 2022 is given at
url <- "https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/homework/data/brfss-iowa-2022.csv"
The following code reads the data into your R session:
iowa <- read.csv(url)
A codebook describing the survey and a listing of all variables is available at https://www.cdc.gov/brfss/annual_data/2022/zip/codebook22_llcp-v2-508.zip. Download it, and unzip it. Open the file in a browser.
For each of the questions, show the code necessary to get to the answer. Make sure to also write the answer to the question in a sentence.
Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.
Load the dataset into your session and store it in the object
iowa
.
How many rows does that data set have, how many columns? Which types of variables does the data set have?
Use ggplot2
to draw a scatterplot of height
(HEIGHT3
) and weight (WEIGHT2
), facet by
gender (SEXVAR
). State your expectation regarding the
relationship between the variables, comment on the plot you
see.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
iowa %>% ggplot(aes(x = HEIGHT3, y = WEIGHT2)) + geom_point()
## Warning: Removed 221 rows containing missing values (`geom_point()`).
5. Temporarily restrict weight and height to below 2500, then plot the values again. Describe the plot you see.
It turns out, that the following coding scheme is used for HEIGHT3:
HEIGHT3 value | Interpretation |
---|---|
200 - 711 | Height (ft/inches), i.e. 410 is 4 feet, 10 inches |
7777 | Don’t know/Not sure |
9000 - 9998 | Height (meters/centimeters), where the first 9 indicates that the measurement was metric, 9165 is 1 meter 65 cm |
9999 | Refused |
BLANK | Not asked or Missing |
filter
and logical expressions to
answer the following questions:is.na
) does the variable
HEIGHT3 have?NUMADULT
) in
a household in Iowa in 2022?EDUCA
is the variable containing the highest grade or
year of school completed. Is the percentage of college graduates in Iowa
higher or lower than the nation’s average (based on the BRFSS
sample)?FLSHTMY3
) in July 2022 or after?For the submission: submit your solution in an R Markdown file and (just for insurance) submit the corresponding html/word file with it.