Write a Data Parser#

AutoParser assumes the use of Global.Health’s adtl package to transform your source data into a standardised format. To do this, adtl requires a TOML specification file which describes how raw data should be converted into the new format, on a field-by-field basis. Every unique data file format (i.e. unique sets of fields and data types) should have a corresponding parser file.

AutoParser exists to semi-automate the process of writing new parser files. This requires a data dictionary (which can be created if it does not already exist, see ‘Create Data dictionary’), and the JSON schema of the target format.

Parser generation is a 2-step process.

Generate intermedaite mappings (CSV)#

First, an intermediate mapping file is created which can look like this:

target_field

source_description

source_field

common_values

target_values

value_mapping

identity

Identity

Identité

name

Full Name

Nom complet

loc_admin_1

Province

Province

Equateur, Orientale, Katanga, Kinshasa

country_iso3

notification_date

Notification Date

DateNotification

classification

Classification

Classicfication

FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU

mammal, bird, reptile, amphibian, fish, invertebrate, None

mammifère=mammal, rept=reptile, fish=fish, oiseau=bird, amphibie=amphibian, poisson=fish

case_status

Case Status

StatusCas

Vivant, Décédé

alive, dead, unknown, None

décédé=dead, vivant=alive

target_x refers to the desired output format, while source_x refers to the raw data. In this example, the final row shows that the case_status field in the desired output format should be filled using data from the StatusCas field in the raw data. The value_mapping column indicated that all instances of décédé in the raw data should be mapped to dead in the converted file, and vivant should map to alive.

These intermediate mappings should be manually curated, as they are generated using an LLM which may be prone to errors and hallucinations, generating incorrect matches for either the field, or the values within that field.

Generate TOML#

This step is automated and should produce a TOML file that conforms to the adtl parser schema, ready for use transforming data.

API#

autoparser.create_mapping(schema: Path, data_dictionary: str | DataFrame, language: str, api_key: str, llm: str | None = 'openai', config: Path | None = None, save: bool = True, file_name: str = 'mapping_file') DataFrame

Creates a csv containing the mapping between a data dictionary and a schema.

Takes a data dictionary and matches both the source fields, and any common values to the schema. Uses an LLM to first match the source fields to appropriate schema targets, and then to match the common values to appropriate enum or boolean options.

Parameters:
  • schema – Path to a JSON schema file.

  • data_dictionary – Path to a CSV or XLSX file, or a DataFrame, containing the data dictionary.

  • language – Language of the source data (e.g. french, english, spanish).

  • api_key – API key for the API defined in llm

  • llm – Which LLM to use, currently only ‘openai’ is supported.

  • config – Path to a JSON file containing the configuration for autoparser.

Returns:

Dataframe containing the mapping between the data dictionary and the schema.

Return type:

pd.DataFrame

autoparser.create_parser(mappings: DataFrame | str, schema_path: Path, parser_name: str, description: str | None = None, config='config/autoparser.toml')

Takes the csv mapping file created by create_mapping and writes out a TOML parser

Generates a TOML parser for use with ADTL from the intermediate CSV file generated by create_mapping. This will generate a TOML file that can be used to parse raw data into the format expected by the schema.

Parameters:
  • mappings – Path to the CSV file containing the mappings

  • schema_path – Path to the schema file

  • parser_name – Name of the parser to create

  • description – Description of the parser. Defaults to the parser name.

  • config – Path to the configuration file to use. Default is config/autoparser.toml.

Return type:

None