Parser construction example#

This file demonstrates the process of constructing a parser file using animals.csv as a source dataset.

Before you start: autoparser requires an LLM API key to function, for either OpenAI or Gemini. You should add yours to your environment, as described here. This example uses the OpenAI API; edit the API_KEY line below to match the name you gave yours.

If you would prefer to use Gemini, use the llm variable in functions where the api key is used, e.g.

writer.generate_descriptions("fr", data_dict, key=API_KEY, llm='gemini')

import autoparser
import pandas as pd
import os
API_KEY = os.environ.get("OPENAI_API_KEY")

# The path to the configuration file to use
config_path = "../../tests/test_config.toml"

/home/docs/checkouts/readthedocs.org/user_builds/autoparser/envs/latest/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

data = pd.read_csv("../../tests/sources/animal_data.csv")
data.head()

	Identité	Province	DateNotification	Classicfication	Nom complet	Date de naissance	AgeAns	AgeMois	Sexe	StatusCas	DateDec	ContSoins	ContHumain Autre	AutreContHumain	ContactAnimal	Micropucé	AnimalDeCompagnie
0	A001	Equateur	2024-01-01	Mammifère	Luna	15/03/2022	2	10	f	Vivant	NaN	Oui	Non	Non	Oui	Oui	Oui
1	B002	Equateur	2024-15-02	FISH	Max	21/07/2021	3	4	m	Décédé	2024-06-01	Non	Oui	Voyage	Non	NON	Oui
2	C003	Equateur	2024-03-10	oiseau	Coco	10/02/2023	1	11	F	Vivant	NaN	Oui	Non	Non	Oui	Oui	Non
3	D004	NaN	2024-04-22	amphibie	Bella	05/11/2020	4	5	m	Vivant	NaN	Oui	NaN	Autres	Non	NON	Non
4	E005	NaN	2024-05-30	poisson	Charlie	18/05/2019	5	3	F	Décédé	2024-07-01	NaN	NaN	Voyage	Oui	Oui	Oui

Let’s generate a basic data dictionary from this data set. We want to use the configuration file set up for this dataset, located in the tests directory.

writer = autoparser.DictWriter(config_path)
data_dict = writer.create_dict(data)
data_dict.head()

	Field Name	Description	Field Type	Common Values
0	Identité	NaN	string	NaN
1	Province	NaN	choice	Equateur, Orientale, Katanga, Kinshasa
2	DateNotification	NaN	string	NaN
3	Classicfication	NaN	choice	FISH, amphibie, oiseau, Mammifère, poisson, RE...
4	Nom complet	NaN	string	NaN

The ‘Common Values’ column indicates fields where there are a limited number of unique values, suggesting mapping to a controlled terminology may have been done, or might be required in the parser. The list of common values is every unique value in the field.

Notice that the Description column is empty. To proceed to the next step of the parser generation process, creating the mapping file linking source -> schema fields, this column must be filled. You can either do this by hand (the descriptions MUST be in english), or use autoparser’s LLM functionality to do it for you, demonstrated below.

dd_described = writer.generate_descriptions("fr", data_dict, key=API_KEY)
dd_described.head()

---------------------------------------------------------------------------
OpenAIError                               Traceback (most recent call last)
Cell In[4], line 1
----> 1 dd_described = writer.generate_descriptions("fr", data_dict, key=API_KEY)
      2 dd_described.head()

File ~/checkouts/readthedocs.org/user_builds/autoparser/envs/latest/lib/python3.11/site-packages/autoparser/dict_writer.py:192, in DictWriter.generate_descriptions(self, language, data_dict, key, llm)
    186         raise ValueError(
    187             "No data dictionary found. Please create a data dictionary first."
    188         )
    190 df = load_data_dict(self.config, data_dict)
--> 192 self._setup_llm(key, llm)
    194 headers = df.source_field
    196 descriptions = self._get_descriptions(list(headers), language, self.client)

File ~/checkouts/readthedocs.org/user_builds/autoparser/envs/latest/lib/python3.11/site-packages/autoparser/dict_writer.py:59, in DictWriter._setup_llm(self, key, name)
     57 self.key = key
     58 if name == "openai":
---> 59     self.client = OpenAI(api_key=key)
     61     self._get_descriptions = _get_definitions_openai
     63 elif name == "gemini":

File ~/checkouts/readthedocs.org/user_builds/autoparser/envs/latest/lib/python3.11/site-packages/openai/_client.py:105, in OpenAI.__init__(self, api_key, organization, project, base_url, timeout, max_retries, default_headers, default_query, http_client, _strict_response_validation)
    103     api_key = os.environ.get("OPENAI_API_KEY")
    104 if api_key is None:
--> 105     raise OpenAIError(
    106         "The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable"
    107     )
    108 self.api_key = api_key
    110 if organization is None:

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

Now that we have a data dictionary with descriptions added, we can proceed to creating an intermediate mapping file:

mapper = autoparser.Mapper("../../tests/schemas/animals.schema.json", dd_described, "fr", api_key=API_KEY, config=config_path)
mapping_dict = mapper.create_mapping(file_name='example_mapping.csv')

mapping_dict.head()

/Users/pipliggins/Documents/repos/autoparser/src/autoparser/create_mapping.py:258: UserWarning: The following schema fields have not been mapped: ['country_iso3', 'owner']
  warnings.warn(

	source_description	source_field	common_values	target_values	value_mapping
target_field
identity	Identity	Identité	NaN	NaN	NaN
name	Full Name	Nom complet	NaN	NaN	NaN
loc_admin_1	Province	Province	Equateur, Orientale, Katanga, Kinshasa	NaN	equateur=None, kinshasa=None, katanga=None, or...
country_iso3	None	NaN	NaN	NaN	NaN
notification_date	Notification Date	DateNotification	NaN	NaN	NaN

At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary. The mapping file has been written out to example_mapping.csv. A good example is the ‘loc_admin_1’ field; the LLM often maps the common values provided to ‘None’ as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text. Also note the warning above; the LLM should not have found fields to map to the ‘country_iso3’ or ‘owner’ fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.

Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, example_parser.toml:

writer = autoparser.ParserGenerator("example_mapping.csv", "../../tests/schemas", "example", config=config_path)
writer.create_parser("example_parser.toml")

WARNING:root:Missing required field country_iso3 in animals schema. Adding empty field...

You can veiw/edit the created parser at example_parser.toml, and try it out using ADTL.