Write a Data Parser#
AutoParser assumes the use of Global.Health’s adtl package to transform your source data into a standardised format. To do this, adtl requires a TOML specification file which describes how raw data should be converted into the new format, on a field-by-field basis. Every unique data file format (i.e. unique sets of fields and data types) should have a corresponding parser file.
AutoParser exists to semi-automate the process of writing new parser files. This requires a data dictionary (which can be created if it does not already exist, see ‘Create Data dictionary’), and the JSON schema of the target format.
Parser generation is a 2-step process.
Generate intermedaite mappings (CSV)#
First, an intermediate mapping file is created which can look like this:
target_field |
source_description |
source_field |
common_values |
target_values |
value_mapping |
|---|---|---|---|---|---|
identity |
Identity |
Identité |
|||
name |
Full Name |
Nom complet |
|||
loc_admin_1 |
Province |
Province |
Equateur, Orientale, Katanga, Kinshasa |
||
country_iso3 |
|||||
notification_date |
Notification Date |
DateNotification |
|||
classification |
Classification |
Classicfication |
FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU |
mammal, bird, reptile, amphibian, fish, invertebrate, None |
mammifère=mammal, rept=reptile, fish=fish, oiseau=bird, amphibie=amphibian, poisson=fish |
case_status |
Case Status |
StatusCas |
Vivant, Décédé |
alive, dead, unknown, None |
décédé=dead, vivant=alive |
target_x refers to the desired output format, while source_x refers to the raw data.
In this example, the final row shows that the case_status field in the desired output
format should be filled using data from the StatusCas field in the raw data. The value_mapping
column indicated that all instances of décédé in the raw data should be mapped to dead
in the converted file, and vivant should map to alive.
These intermediate mappings should be manually curated, as they are generated using an LLM which may be prone to errors and hallucinations, generating incorrect matches for either the field, or the values within that field.
Generate TOML#
This step is automated and should produce a TOML file that conforms to the adtl parser schema, ready for use transforming data.
API#
- autoparser.create_mapping(schema: Path, data_dictionary: str | DataFrame, language: str, api_key: str, llm: str | None = 'openai', config: Path | None = None, save: bool = True, file_name: str = 'mapping_file') DataFrame
Creates a csv containing the mapping between a data dictionary and a schema.
Takes a data dictionary and matches both the source fields, and any common values to the schema. Uses an LLM to first match the source fields to appropriate schema targets, and then to match the common values to appropriate enum or boolean options.
- Parameters:
schema – Path to a JSON schema file.
data_dictionary – Path to a CSV or XLSX file, or a DataFrame, containing the data dictionary.
language – Language of the source data (e.g. french, english, spanish).
api_key – API key for the API defined in llm
llm – Which LLM to use, currently only ‘openai’ is supported.
config – Path to a JSON file containing the configuration for autoparser.
- Returns:
Dataframe containing the mapping between the data dictionary and the schema.
- Return type:
pd.DataFrame
- autoparser.create_parser(mappings: DataFrame | str, schema_path: Path, parser_name: str, description: str | None = None, config='config/autoparser.toml')
Takes the csv mapping file created by create_mapping and writes out a TOML parser
Generates a TOML parser for use with ADTL from the intermediate CSV file generated by create_mapping. This will generate a TOML file that can be used to parse raw data into the format expected by the schema.
- Parameters:
mappings – Path to the CSV file containing the mappings
schema_path – Path to the schema file
parser_name – Name of the parser to create
description – Description of the parser. Defaults to the parser name.
config – Path to the configuration file to use. Default is config/autoparser.toml.
- Return type:
None