Getting started#
Installation#
AutoParser is a Python package that can either be built into your code or run as a command-line interface (CLI). You can install AutoParser using pip:
python3 -m pip install git+https://github.com/globaldothealth/autoparser
Note that it is usually recommended to install into a virtual environment. We recommend using uv to manage the virtual environment. To create and active a virtual environment for AutoParser using uv run the following commands:
uv sync
. .venv/bin/activate
To view and use the CLI, you can type autoparser into the command line to view the
options available.
Other requirements#
AutoParser relies on LLMs to automatically map raw data fields to a target schema.
In order to use this tool, you will need an API key for either OpenAI
or Google’s Gemini.
AutoParser will use either OpenAI’s gpt-4-mini, or Google’s gemini-1.5-flash.
The LLM should never see your raw data; only the data dictionary which contains column headers, and text descriptions of what each field shoud contain.
Supported file formats#
Autoparser supports CSV and XLSX formats for raw data and data dictionary files, and either JSON or TOML for the target schema.
Quickstart#
See the example notebook here for a basic walk through the functionality of AutoParser.