Creating a Data Dictionary#

Motivation#

A data dictionary is a structured guide which contains the details of a data file. It should contain, at minimum, a list of field/column names, and some kind of description of what data each field holds. This often takes the form of a textual description, plus a note of the data type (text, decimals, date, boolean…) and/or a set of expected values.

A data dictionary is required by AutoParser for parser generation. This is to avoid having to send potentially sensitive or confidential data to an external body (in this case an externally hosted LLM hosted); instead a decription of what the data looks like from the dictionary can be sent to the LLM, which allows for mapping to occur without risking the unintentional release of data.

Many data capture services such as REDCaP will generate a data dictionary automatically when surveys are set up. However, where data is being captured either rapidly, or by individuals/small teams, a formal data dictionary may not have been created for a corresponding dataset. For this scenario, AutoParser provides functionality to generate a simple dictionary based on your data. This dictionary can then be used in other AutoParser modules.

Create a basic data dictionary#

AutoParser will take your raw data file and create a basic data dictionary. For an example dataset of animals, a generated data dictionary looks like this:

source_field	source_type	common_values
Identité	string
Province	choice	Equateur, Orientale, Katanga, Kinshasa
DateNotification	string
Classicfication	choice	FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU
Nom complet	string
Date de naissance	string
AgeAns	number
AgeMois	number
Sexe	choice	F, M, f, m, f, m , inconnu

source_field contains each column header from the source data, and source_type shows the data type in each column. ‘choice’ denotes where a small set of strings have been detected, so AutoParser assumes that specified terms are being used, and lists them in common values.

Notice that the source_description column is empty. This is done by default, so the user can add in a short text description in English (as this column is read by the LLM in later steps and assumes the text is written in English). For example, the description for the AgeMois column might be ‘Age in Months’.

If instead you would like to auto-generate these descriptions, AutoParser can use an LLM to automate this step. Note, we strongly encourage all users to check the results of the auto-generated descriptions for accuracy before proceeding to use the described data dictionary to generate a data parser.

API#

autoparser.create_dict(data: DataFrame | str, config: Path | None = None) → DataFrame#

Create a basic data dictionary from a dataset.

Creates a data dictionary from a dataset, including the field name, field type, and common values (defined as occuring more than 25 times in the columns). Also creates an empty column for field decriptions, which can either be added by hand later, or auto-generated with an LLM using generate_descriptions().

Parameters:

data – Path to a CSV or XLSX file, or a DataFrame, containing the raw data.
config – Path to the configuration file to use if not using the default configuration

Returns:

Data dictionary containing field names, field types, and common values.

Return type:

pd.DataFrame

autoparser.generate_descriptions(data_dict: DataFrame | str, language: str, key: str | None = None, llm: str | None = 'openai', config: Path | None = None) → DataFrame#

Generate descriptions for the columns in the dataset.

Uses an LLM to auto-generate descriptions for a data dictionary based on the column headers.