Mapping Functions#

The following functions can be used to create the intermediate mapping CSV required to generate a parser

autoparser.create_mapping(schema: Path, data_dictionary: str | DataFrame, language: str, api_key: str, llm: str | None = 'openai', config: Path | None = None, save: bool = True, file_name: str = 'mapping_file') DataFrame#

Creates a csv containing the mapping between a data dictionary and a schema.

Takes a data dictionary and matches both the source fields, and any common values to the schema. Uses an LLM to first match the source fields to appropriate schema targets, and then to match the common values to appropriate enum or boolean options.

Parameters:
  • schema – Path to a JSON schema file.

  • data_dictionary – Path to a CSV or XLSX file, or a DataFrame, containing the data dictionary.

  • language – Language of the source data (e.g. french, english, spanish).

  • api_key – API key for the API defined in llm

  • llm – Which LLM to use, currently only ‘openai’ is supported.

  • config – Path to a JSON file containing the configuration for autoparser.

Returns:

Dataframe containing the mapping between the data dictionary and the schema.

Return type:

pd.DataFrame

Class definitions#

You can also interact with the base class Mapper

class autoparser.Mapper(schema: Path, data_dictionary: str | DataFrame, language: str, api_key: str | None = None, llm: Literal['openai', 'gemini'] | None = 'openai', config: Path | None = None)#

Class for creating an intermediate mapping file linking the data dictionary to schema fields and values.

Use create_mapping() to write out the mapping file, as the function equivalent of the command line create-mapping script.

Parameters:
  • schema – The path to the schema file to map to

  • data_dictionary – The data dictionary to use

  • language – The language of the raw data (e.g. ‘fr’, ‘en’, ‘es’)

  • api_key – The API key to use for the LLM

  • llm – The LLM to use, currently only ‘openai’ and ‘gemini’ are supported

  • config – The path to the configuration file to use if not using the default configuration

property common_values: Series#

Returns the commonly repeated values in the source data Usually this indicates that the source field is an enum or boolean

create_mapping(save=True, file_name='mapping_file') DataFrame#

Creates an intermediate mapping dataframe linking the data dictionary to schema fields. The index contains the target (schema) field names, and the columns are: source_description source_field common_values OR choices (depending on the data dictionary) target_values value_mapping

Raises a warning if any fields are present in the schema where a corresponding source field in the data dictionary has not been found.

Parameters:
  • save – Whether to save the mapping to a CSV file. If True, lists in target_values dicts in value_mapping are converted to strings before saving.

  • name – The name to use for the CSV file

match_fields_to_schema() DataFrame#

Use the LLM to match the target (schema) fields to the descriptions of the source data fields from the data dictionary.

match_values_to_schema() DataFrame#

Use the LLM to match the common values from the data dictionary to the target values in the schema - i.e. enum or boolean options.

property target_fields: list[str]#

Returns a list of fields in the target schema

property target_types: dict[str, list[str]]#

Returns the field types of the target schema

property target_values: Series#

Returns the enum values or boolean options for the target schema