Create an XML document with user-provided data and submission metadata

This is the core function of distiller. It takes in a user-provided dataframe with the expected columns and a set of metadata about the submission. It then jumps through the XML hoops for the user and distills it all down to an XML document that can be submitted to the CDC's EPHT submission portal. Users should always check the outputs with the EPHT test submission portal first.

Users are expected to start with a standard dataframe because the XML requirements change based on the facility type (Hospital vs Emergency Department) even though the variables and disease are the same. Users trade some freedom in naming conventions and in return distiller abstracts away the xml naming gymnastics. By having a single set of column names, users can keep all their data in one place and do all the required wrangling on a single set of data, all at once, and then use whatever package they wish to split out the master data and iterate over it with distiller.

Users have the option to check their submission at time of XML creation and get instant feedback on the possible validity of their submission. Users can further refine the returned xml document with their choice of package such as xml2, XML and friends. When satisfied with the data product users will need to print the XML document to a file and then submit to the CDC like they usually would.

Usage

make_xml_document(
  data,
  content_group_id,
  mcn,
  jurisdiction_code,
  state_fips_code,
  submitter_email,
  submitter_name,
  submitter_title,
  check_first = FALSE
)

Arguments

data: Pre-wrangled Dataframe.
content_group_id: Code that identifies the content found in EPHT documentation.
mcn: Metadata Control Number provided by EPHT.
jurisdiction_code: Two-letter state abbreviation for the submitter state.
state_fips_code: FIPS code of the submitter state.
submitter_email: Email of person submitting data to EPHT.
submitter_name: First and last name of person submitting data to EPHT.
submitter_title: Title of person submitting data to EPHT.
check_first: Check the validity of your EPHT submission. Default is FALSE, if set to TRUE then distiller will run through its suite of metadata and data checks.

Value

XML document object

Data

Users are expected to wrangle, aggregate and otherwise implement the logic expected by EPHT themselves. distiller will not handle that step, though it does provides some helpers: collapse_race(), collapse_ethnicity(), and make_months_worse().

In order for distiller to work properly there are some expectations about the data that must be met:

The data must be a dataframe or tibble
For all content_group_id The data must have the following columns: (in any order):
- month: character - acceptable values: "01", "02", "03" ... "12"
- agegroup: numeric - acceptable values: 1-19
- county: character - string length of 5, unless unknown, then county = "U"
- ethnicity: character - acceptable values: "H", "NH", "U"
- race: character - acceptable values: "W", "B", "O", "U"
- health_outcome_id: numeric - acceptable values: 1-5
- sex: character - acceptable values: "M", "F", "U"
- year: numeric - acceptable values: 2001-9999
- monthly_count: numeric - acceptable values: >0 and no missing values
For content_group_id "CO-ED" and "CO-HOSP" the data must have the additional columns:
- fire_count: numeric - acceptable values: >0 and no missing values
- nonfire_count: numeric - acceptable values: >0 and no missing values
- unknown_count: numeric - acceptable values: >0 and no missing values

Content Group Identifier

The Content Group Identifier is the ID expected to be used by EPHT. It is a combination of the the disease and facility type. Details on which ID to use can be found in the how-to-guides provided by EPHT.

content_group_id must belong to one of the following:

"AS-ED", "AS-HOSP"
"MI-HOSP"
"CO-ED", "CO-HOSP"
"HEAT-ED", "HEAT-HOSP"
"COPD-ED", "COPD-HOSP"

Metadata Control Number

The Metadata control number (mcn) is provided by the EPHT and is used to identify the dataset and its content. In order to submit data users will already have a set of these.

Submission Check

If users set check_first = TRUE in make_xml_document() or runs check_submission() or any of the other check_* functions then the a suite of checks is run against the metadata, data structure and data content. Please note that users do not need to run the whole suite of checks, they can run each function piecemeal on their data as it is being prepared.

check_submission() is called which is a wrapper around the following functions:

Examples

data <-
  mtcars |>
  dplyr::rename(
    month = mpg,
    agegroup = cyl,
    county = disp,
    ethnicity = hp,
    health_outcome_id = drat,
    monthly_count = wt,
    race = qsec,
    sex = vs,
    year = am
  ) |>
  dplyr::select(-c(gear, carb))

content_group_id <- "AS-HOSP"
mcn <- "1234-1234-1234-1234-1234"
jurisdiction_code <- "two_letter_code"
state_fips_code <- "1234"
submitter_email <- "submitter@email.com"
submitter_name <- "Submitter Name"
submitter_title <- "Submitter Title"

doc <- make_xml_document(
  data,
  content_group_id,
  mcn,
  jurisdiction_code,
  state_fips_code,
  submitter_email,
  submitter_name,
  submitter_title
)

doc
#> {xml_document}
#> <HospitalizationData schemaLocation="http://www.ephtn.org/NCDM/PH/HospitalizationData ephtn-ph-HospitalizationData.xsd" xmlns="http://www.ephtn.org/NCDM/PH/HospitalizationData" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
#> [1] <Header>\n  <MCN>1234-1234-1234-1234-1234</MCN>\n  <JurisdictionCode>two_ ...
#> [2] <Dataset>\n  <Row>\n    <RowIdentifier>1</RowIdentifier>\n    <AdmissionM ...