Create an XML document with user-provided data and submission metadata
Source:R/make_xml_document.R
make_xml_document.Rd
This is the core function of distiller
. It takes in a user-provided
dataframe with the expected columns and a set of metadata about the
submission. It then jumps through the XML hoops for the user and distills it
all down to an XML document that can be submitted to the CDC's EPHT
submission portal. Users should always check the outputs with the EPHT test
submission portal first.
Users are expected to start with a standard dataframe because the XML
requirements change based on the facility type (Hospital vs Emergency
Department) even though the variables and disease are the same. Users trade
some freedom in naming conventions and in return distiller
abstracts away
the xml naming gymnastics. By having a single set of column names, users can
keep all their data in one place and do all the required wrangling on a
single set of data, all at once, and then use whatever package they wish to
split out the master data and iterate over it with distiller
.
Users have the option to check their submission at time of XML creation and get instant feedback on the possible validity of their submission. Users can further refine the returned xml document with their choice of package such as xml2, XML and friends. When satisfied with the data product users will need to print the XML document to a file and then submit to the CDC like they usually would.
Usage
make_xml_document(
data,
content_group_id,
mcn,
jurisdiction_code,
state_fips_code,
submitter_email,
submitter_name,
submitter_title,
check_first = FALSE
)
Arguments
- data
Pre-wrangled Dataframe.
- content_group_id
Code that identifies the content found in EPHT documentation.
- mcn
Metadata Control Number provided by EPHT.
- jurisdiction_code
Two-letter state abbreviation for the submitter state.
- state_fips_code
FIPS code of the submitter state.
- submitter_email
Email of person submitting data to EPHT.
- submitter_name
First and last name of person submitting data to EPHT.
- submitter_title
Title of person submitting data to EPHT.
- check_first
Check the validity of your EPHT submission. Default is
FALSE
, if set toTRUE
thendistiller
will run through its suite of metadata and data checks.
Data
Users are expected to wrangle, aggregate and otherwise implement the logic
expected by EPHT themselves. distiller
will not handle that step, though it
does provides some helpers: collapse_race()
, collapse_ethnicity()
, and
make_months_worse()
.
In order for distiller
to work properly there are some expectations about
the data that must be met:
The data must be a dataframe or tibble
For all
content_group_id
The data must have the following columns: (in any order):month: character - acceptable values: "01", "02", "03" ... "12"
agegroup: numeric - acceptable values: 1-19
county: character - string length of 5, unless unknown, then county = "U"
ethnicity: character - acceptable values: "H", "NH", "U"
race: character - acceptable values: "W", "B", "O", "U"
health_outcome_id: numeric - acceptable values: 1-5
sex: character - acceptable values: "M", "F", "U"
year: numeric - acceptable values: 2001-9999
monthly_count: numeric - acceptable values: >0 and no missing values
For
content_group_id
"CO-ED" and "CO-HOSP" the data must have the additional columns:fire_count: numeric - acceptable values: >0 and no missing values
nonfire_count: numeric - acceptable values: >0 and no missing values
unknown_count: numeric - acceptable values: >0 and no missing values
Content Group Identifier
The Content Group Identifier is the ID expected to be used by EPHT. It is a combination of the the disease and facility type. Details on which ID to use can be found in the how-to-guides provided by EPHT.
content_group_id
must belong to one of the following:
"AS-ED", "AS-HOSP"
"MI-HOSP"
"CO-ED", "CO-HOSP"
"HEAT-ED", "HEAT-HOSP"
"COPD-ED", "COPD-HOSP"
Metadata Control Number
The Metadata control number (mcn) is provided by the EPHT and is used to identify the dataset and its content. In order to submit data users will already have a set of these.
Submission Check
If users set check_first
= TRUE
in make_xml_document()
or runs
check_submission()
or any of the other check_* functions
then the a suite
of checks is run against the metadata, data structure and data content.
Please note that users do not need to run the whole suite of checks, they can
run each function piecemeal on their data as it is being prepared.
check_submission()
is called which is a wrapper around
the following functions:
See also
Other xml:
make_dataset_node()
,
make_header_node()
,
make_root_element()
Examples
data <-
mtcars |>
dplyr::rename(
month = mpg,
agegroup = cyl,
county = disp,
ethnicity = hp,
health_outcome_id = drat,
monthly_count = wt,
race = qsec,
sex = vs,
year = am
) |>
dplyr::select(-c(gear, carb))
content_group_id <- "AS-HOSP"
mcn <- "1234-1234-1234-1234-1234"
jurisdiction_code <- "two_letter_code"
state_fips_code <- "1234"
submitter_email <- "submitter@email.com"
submitter_name <- "Submitter Name"
submitter_title <- "Submitter Title"
doc <- make_xml_document(
data,
content_group_id,
mcn,
jurisdiction_code,
state_fips_code,
submitter_email,
submitter_name,
submitter_title
)
doc
#> {xml_document}
#> <HospitalizationData schemaLocation="http://www.ephtn.org/NCDM/PH/HospitalizationData ephtn-ph-HospitalizationData.xsd" xmlns="http://www.ephtn.org/NCDM/PH/HospitalizationData" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
#> [1] <Header>\n <MCN>1234-1234-1234-1234-1234</MCN>\n <JurisdictionCode>two_ ...
#> [2] <Dataset>\n <Row>\n <RowIdentifier>1</RowIdentifier>\n <AdmissionM ...