Creating a Data Dictionary#

Motivation#

A data dictionary is a structured guide which contains the details of a data file. It should contain, at minimum, a list of field/column names, and some kind of description of what data each field holds. This often takes the form of a textual description, plus a note of the data type (text, decimals, date, boolean…) and/or a set of expected values.

A data dictionary is required by AutoParser for parser generation. This is to avoid having to send potentially sensitive or confidential data to an external body (in this case an externally hosted LLM hosted); instead a decription of what the data looks like from the dictionary can be sent to the LLM, which allows for mapping to occur without risking the unintentional release of data.

Many data capture services such as REDCaP will generate a data dictionary automatically when surveys are set up. However, where data is being captured either rapidly, or by individuals/small teams, a formal data dictionary may not have been created for a corresponding dataset. For this scenario, AutoParser provides functionality to generate a simple dictionary based on your data. This dictionary can then be used in other AutoParser modules.

Create a basic data dictionary#

AutoParser will take your raw data file and create a basic data dictionary. For an example dataset of animals, a generated data dictionary looks like this:

source_field

source_description

source_type

common_values

Identité

string

Province

choice

Equateur, Orientale, Katanga, Kinshasa

DateNotification

string

Classicfication

choice

FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU

Nom complet

string

Date de naissance

string

AgeAns

number

AgeMois

number

Sexe

choice

F, M, f, m, f, m , inconnu

source_field contains each column header from the source data, and source_type shows the data type in each column. ‘choice’ denotes where a small set of strings have been detected, so AutoParser assumes that specified terms are being used, and lists them in common values.

Notice that the source_description column is empty. This is done by default, so the user can add in a short text description in English (as this column is read by the LLM in later steps and assumes the text is written in English). For example, the description for the AgeMois column might be ‘Age in Months’.

If instead you would like to auto-generate these descriptions, AutoParser can use an LLM to automate this step. Note, we strongly encourage all users to check the results of the auto-generated descriptions for accuracy before proceeding to use the described data dictionary to generate a data parser.

API#

autoparser.create_dict(data: DataFrame | str, config: Path | None = None) DataFrame#

Create a basic data dictionary from a dataset.

Creates a data dictionary from a dataset, including the field name, field type, and common values (defined as occuring more than 25 times in the columns). Also creates an empty column for field decriptions, which can either be added by hand later, or auto-generated with an LLM using generate_descriptions().

Parameters:
  • data – Path to a CSV or XLSX file, or a DataFrame, containing the raw data.

  • config – Path to the configuration file to use if not using the default configuration

Returns:

Data dictionary containing field names, field types, and common values.

Return type:

pd.DataFrame

autoparser.generate_descriptions(data_dict: DataFrame | str, language: str, key: str | None = None, llm: str | None = 'openai', config: Path | None = None) DataFrame#

Generate descriptions for the columns in the dataset.

Uses an LLM to auto-generate descriptions for a data dictionary based on the column headers.

Parameters:
  • data_dict – Data dictionary containing the column headers, either as a dataframe or a path to the dictionary as a csv/xlsx file.

  • language – Language the column headers are in (e.g. french, spanish).

  • key – OpenAI API key.

  • llm – LLM API to call (currently only OpenAI is supported)

  • config – Path to the configuration file to use if not using the default configuration

Returns:

Data dictionary with descriptions added

Return type:

pd.DataFrame