Check the metadata, data structure and data content of a potential submission
to the EPHT. check_submission
is meant to be a tool to provide quick
feedback before users spend a ton of time waiting on the submission portal
results. Most checks are simple and non-comprehensive and meant to tip the
user off to potential problems. Where there are explicit allowable
values provided by the cdc those has been included and the data content
will be checked against them.
Users can expect to find the results of check_submission
in the console
printed as a cli list. The results of each check will be one of 3 options
Success, Warning, or Danger. None of the checks are going to prevent users
from moving forward with their submission. They are merely reasonable
guidelines and should be treated as such.
Usage
check_submission(
data,
content_group_id,
mcn,
jurisdiction_code,
state_fips_code,
submitter_email,
submitter_name,
submitter_title
)
Arguments
- data
Pre-wrangled Dataframe.
- content_group_id
Code that identifies the content found in EPHT documentation.
- mcn
Metadata Control Number provided by EPHT.
- jurisdiction_code
Two-letter state abbreviation for the submitter state.
- state_fips_code
FIPS code of the submitter state.
- submitter_email
Email of person submitting data to EPHT.
- submitter_name
First and last name of person submitting data to EPHT.
- submitter_title
Title of person submitting data to EPHT.
Success
This check passes. However, requirements can change over time and this package cannot be expected to catch everything up to the last minute. This is a good sign, but users should expect the test submission portal to have the final say.
Warning
This check found something out of the ordinary. Either the value wasn't in the expected format, the severity of the issue doesn't rise to the level of danger, or outside of increasing code complexity dramatically, this simple check just wasn't sure if the provided value conformed requirements. The test submission portal will have the final say.
Danger
Provided the package is up to date with EPHT requirements, this check
has found something this is wrong, like a value outside of anything in
the EPHT data dictionaries or the content of the value is not in a format
that is not going to play nicely with distiller
or the data submission
portal. The test submission portal will have the final say.
Data
Users are expected to wrangle, aggregate and otherwise implement the logic
expected by EPHT themselves. distiller
will not handle that step, though it
does provides some helpers: collapse_race()
, collapse_ethnicity()
, and
make_months_worse()
.
In order for distiller
to work properly there are some expectations about
the data that must be met:
The data must be a dataframe or tibble
For all
content_group_id
The data must have the following columns: (in any order):month: character - acceptable values: "01", "02", "03" ... "12"
agegroup: numeric - acceptable values: 1-19
county: character - string length of 5, unless unknown, then county = "U"
ethnicity: character - acceptable values: "H", "NH", "U"
race: character - acceptable values: "W", "B", "O", "U"
health_outcome_id: numeric - acceptable values: 1-5
sex: character - acceptable values: "M", "F", "U"
year: numeric - acceptable values: 2001-9999
monthly_count: numeric - acceptable values: >0 and no missing values
For
content_group_id
"CO-ED" and "CO-HOSP" the data must have the additional columns:fire_count: numeric - acceptable values: >0 and no missing values
nonfire_count: numeric - acceptable values: >0 and no missing values
unknown_count: numeric - acceptable values: >0 and no missing values
Content Group Identifier
The Content Group Identifier is the ID expected to be used by EPHT. It is a combination of the the disease and facility type. Details on which ID to use can be found in the how-to-guides provided by EPHT.
content_group_id
must belong to one of the following:
"AS-ED", "AS-HOSP"
"MI-HOSP"
"CO-ED", "CO-HOSP"
"HEAT-ED", "HEAT-HOSP"
"COPD-ED", "COPD-HOSP"
Metadata Control Number
The Metadata control number (mcn) is provided by the EPHT and is used to identify the dataset and its content. In order to submit data users will already have a set of these.
Submission Check
If users set check_first
= TRUE
in make_xml_document()
or runs
check_submission()
or any of the other check_* functions
then the a suite
of checks is run against the metadata, data structure and data content.
Please note that users do not need to run the whole suite of checks, they can
run each function piecemeal on their data as it is being prepared.
check_submission()
is called which is a wrapper around
the following functions:
See also
Other checks:
check_content_group_id()
,
check_data()
,
check_data_content()
,
check_jurisdiction_code()
,
check_mcn()
,
check_state_fips_code()
,
check_submitter_email()
,
check_submitter_name()
,
check_submitter_title()
Examples
data <-
mtcars |>
dplyr::rename(
month = mpg,
agegroup = cyl,
county = disp,
ethnicity = hp,
health_outcome_id = drat,
monthly_count = wt,
race = qsec,
sex = vs,
year = am
) |>
dplyr::select(-c(gear, carb))
content_group_id <- "AS-HOSP"
mcn <- "1234-1234-1234-1234-1234"
jurisdiction_code <- "two_letter_code"
state_fips_code <- "1234"
submitter_email <- "submitter@email.com"
submitter_name <- "Submitter Name"
submitter_title <- "Submitter Title"
check_submission(
data,
content_group_id,
mcn,
jurisdiction_code,
state_fips_code,
submitter_email,
submitter_name,
submitter_title
)
#> ℹ Checking submission metadata
#> ✔ Success: content_group_id
#> ! Warning: mcn may not have correct format
#> Troublemakers: length, format
#> ! Warning: jurisdiction_code may not have correct format
#> Troublemakers: length, format
#> ! Warning: state_fips_code may not have correct format
#> Troublemakers: length, format
#> ✔ Success: submitter_email
#> ✔ Success: submitter_name
#> ✔ Success: submitter_title
#> ℹ Checking data structure and content
#> ✔ Success: dataframe_structure
#> ✖ Danger: month does not have allowable value/s
#> Troublemakers: allowed_values
#> ✔ Success: agegroup
#> ✖ Danger: county does not have allowable value/s
#> Troublemakers: length
#> ✖ Danger: ethnicity does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: health_outcome_id does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: sex does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: year does not have allowable value/s
#> Troublemakers: allowed_values
#> ✖ Danger: race does not have allowable value/s
#> Troublemakers: allowed_values
#> ✔ Success: monthly_count