Skip to contents

Check the metadata, data structure and data content of a potential submission to the EPHT. check_submission is meant to be a tool to provide quick feedback before users spend a ton of time waiting on the submission portal results. Most checks are simple and non-comprehensive and meant to tip the user off to potential problems. Where there are explicit allowable values provided by the cdc those has been included and the data content will be checked against them.

Users can expect to find the results of check_submission in the console printed as a cli list. The results of each check will be one of 3 options Success, Warning, or Danger. None of the checks are going to prevent users from moving forward with their submission. They are merely reasonable guidelines and should be treated as such.

Usage

check_submission(
  data,
  content_group_id,
  mcn,
  jurisdiction_code,
  state_fips_code,
  submitter_email,
  submitter_name,
  submitter_title
)

Arguments

data

Pre-wrangled Dataframe.

content_group_id

Code that identifies the content found in EPHT documentation.

mcn

Metadata Control Number provided by EPHT.

jurisdiction_code

Two-letter state abbreviation for the submitter state.

state_fips_code

FIPS code of the submitter state.

submitter_email

Email of person submitting data to EPHT.

submitter_name

First and last name of person submitting data to EPHT.

submitter_title

Title of person submitting data to EPHT.

Value

cli list in console of submission check results

Success

This check passes. However, requirements can change over time and this package cannot be expected to catch everything up to the last minute. This is a good sign, but users should expect the test submission portal to have the final say.

Warning

This check found something out of the ordinary. Either the value wasn't in the expected format, the severity of the issue doesn't rise to the level of danger, or outside of increasing code complexity dramatically, this simple check just wasn't sure if the provided value conformed requirements. The test submission portal will have the final say.

Danger

Provided the package is up to date with EPHT requirements, this check has found something this is wrong, like a value outside of anything in the EPHT data dictionaries or the content of the value is not in a format that is not going to play nicely with distiller or the data submission portal. The test submission portal will have the final say.

Data

Users are expected to wrangle, aggregate and otherwise implement the logic expected by EPHT themselves. distiller will not handle that step, though it does provides some helpers: collapse_race(), collapse_ethnicity(), and make_months_worse().

In order for distiller to work properly there are some expectations about the data that must be met:

  • The data must be a dataframe or tibble

  • For all content_group_id The data must have the following columns: (in any order):

    • month: character - acceptable values: "01", "02", "03" ... "12"

    • agegroup: numeric - acceptable values: 1-19

    • county: character - string length of 5, unless unknown, then county = "U"

    • ethnicity: character - acceptable values: "H", "NH", "U"

    • race: character - acceptable values: "W", "B", "O", "U"

    • health_outcome_id: numeric - acceptable values: 1-5

    • sex: character - acceptable values: "M", "F", "U"

    • year: numeric - acceptable values: 2001-9999

    • monthly_count: numeric - acceptable values: >0 and no missing values

  • For content_group_id "CO-ED" and "CO-HOSP" the data must have the additional columns:

    • fire_count: numeric - acceptable values: >0 and no missing values

    • nonfire_count: numeric - acceptable values: >0 and no missing values

    • unknown_count: numeric - acceptable values: >0 and no missing values

Content Group Identifier

The Content Group Identifier is the ID expected to be used by EPHT. It is a combination of the the disease and facility type. Details on which ID to use can be found in the how-to-guides provided by EPHT.

content_group_id must belong to one of the following:

  • "AS-ED", "AS-HOSP"

  • "MI-HOSP"

  • "CO-ED", "CO-HOSP"

  • "HEAT-ED", "HEAT-HOSP"

  • "COPD-ED", "COPD-HOSP"

Metadata Control Number

The Metadata control number (mcn) is provided by the EPHT and is used to identify the dataset and its content. In order to submit data users will already have a set of these.

Submission Check

If users set check_first = TRUE in make_xml_document() or runs check_submission() or any of the other check_* functions then the a suite of checks is run against the metadata, data structure and data content. Please note that users do not need to run the whole suite of checks, they can run each function piecemeal on their data as it is being prepared.

check_submission() is called which is a wrapper around the following functions:

Examples

data <-
  mtcars |>
  dplyr::rename(
    month = mpg,
    agegroup = cyl,
    county = disp,
    ethnicity = hp,
    health_outcome_id = drat,
    monthly_count = wt,
    race = qsec,
    sex = vs,
    year = am
  ) |>
  dplyr::select(-c(gear, carb))

content_group_id <- "AS-HOSP"
mcn <- "1234-1234-1234-1234-1234"
jurisdiction_code <- "two_letter_code"
state_fips_code <- "1234"
submitter_email <- "submitter@email.com"
submitter_name <- "Submitter Name"
submitter_title <- "Submitter Title"

check_submission(
  data,
  content_group_id,
  mcn,
  jurisdiction_code,
  state_fips_code,
  submitter_email,
  submitter_name,
  submitter_title
)
#>  Checking submission metadata
#>  Success: content_group_id
#> ! Warning: mcn may not have correct format
#> Troublemakers: length, format
#> ! Warning: jurisdiction_code may not have correct format
#> Troublemakers: length, format
#> ! Warning: state_fips_code may not have correct format
#> Troublemakers: length, format
#>  Success: submitter_email
#>  Success: submitter_name
#>  Success: submitter_title
#>  Checking data structure and content
#>  Success: dataframe_structure
#>  Danger: month does not have allowable value/s
#> Troublemakers: allowed_values
#>  Success: agegroup
#>  Danger: county does not have allowable value/s
#> Troublemakers: length
#>  Danger: ethnicity does not have allowable value/s
#> Troublemakers: allowed_values
#>  Danger: health_outcome_id does not have allowable value/s
#> Troublemakers: allowed_values
#>  Danger: sex does not have allowable value/s
#> Troublemakers: allowed_values
#>  Danger: year does not have allowable value/s
#> Troublemakers: allowed_values
#>  Danger: race does not have allowable value/s
#> Troublemakers: allowed_values
#>  Success: monthly_count