Configuration files using Pydantic and YAML
Often when working on custom Python tooling or commandline applications it might be helpful to have some sort of configuration file rather than passing a long list of flags and arguments in the shell. There are a lot of different configuration file formats people use like ini, toml, json and yaml. Generally I prefer YAML because of its easy to read structure and the simplicity with which it maps to Python objects, particularly dictionaries. YAML also supports comments which can be a big plus when you want to explain what parameters are or why you have a particularly value for your deployment.
This post describes one implementation for managing YAML configurations using Pydantic with some improvements for usability and documentation.
In this example I'm going to demonstrate a configuration model for a program that reads CSV files.
A lot of the configuration
parameters will be based upon what pandas.read_csv
would need but is just for demonstrative purposes.
Pydantic models
Pydantic is a great data/type validation library for Python. It uses Python type hints to perform
operations such as schema validation, serialisation and deserialisation (serde) and is highly
extensible and customisable. It is perfect for configuration as it
will dump most simple types to JSON which can then be exported to YAML via the pydantic-yaml
plugin library.
The FileModel
Lets start by creating a very simple Pydantic model for a configuration file. The new class FileModel
inherits the BaseModel
from pydantic
to benefit from the verification, validation and serde functionality
of Pydantic. It is possible to only define the configuration parameter (Field
) with just a name and type.
For example, separator: str
. However, here we become more explicit and define a default value in
case the user does not provide the value in our configuration input.
from pydantic import BaseModel, Field
class FileModel(BaseModel):
separator: str = Field(default='')
header_row: int = Field(default=0)
skip_initial_space: bool = Field(default=True)
skip_rows: List[int] | int = Field(default=None)
skip_footer: int = Field(default=0)
nrows: int = Field(default=None)
comment: str = Field(default=None)
Whilst the type checking is included with Pydantic (e.g. header_row
will be typecast to int
) additional contextual logic can be
added to a Pydantic model using validator logic. For example, it does not make sense for header_row
to be less than 0 as
the file line index is always a positive value.
class FileModel(BaseModel):
...
@field_validator('header_row')
@classmethod
def _chk_header_rows(cls, v: int) -> int:
if v < 0:
raise ValueError("header_row must be greater than or equal to 0")
return v
The FeatureModel
The CSV files which the program will process contain a number of columns (features) that need to be processed. The program needs
some additional information about each feature like it's name, data type and so forth. These parameters are repeated for
each feature so their configuration can be captured in a separate Pydantic model. This time no default is given for name
or dtype
which will force the user to supply them for each feature specified in the configuration.
class FeatureModel(BaseModel):
name: str
dtype: str
skip: bool = Field(default=False)
na_values: str = Field(default=None)
Combining Models
The two models can now be combined either through mixing of the two models which would create a sub-section for each model or
through partial inheritance. This implementation will use the later keeping the FileModel
parameters at the top level of
the YAML document and creating a new parameter called fields
as a list of FeatureModel
definitions.
To do this, ConfigModel
inherits FileModel
and `fields is added as a new parameter.
from typing import List
class ConfigModel(FileModel):
fields: List[FeatureModel] = Field(default=[])
Pydantic is quite smart about this process and will even preserve the order of inheritance (which will be useful later).
We can explore our combined ConfigModel
with it's default values in the interpreter.
>>> c = ConfigModel()
>>> c
ConfigModel(separator=',', header_row=0, skip_initial_space=True, skip_rows=None, skip_footer=0, nrows=None, comment=None, fields=[])
fields
is empty in this case but can be populated when the ConfigModel
is initialised.
>>> from config import FeatureModel
>>> f1 = FeatureModel(name='f1', dtype='int')
>>> f2 = FeatureModel(name='f2', dtype='str')
>>> c = ConfigModel(fields=[f1, f2])
>>> c
ConfigModel(separator=',', header_row=0, skip_initial_space=True, skip_rows=None,
skip_footer=0, nrows=None, comment=None, fields=[
FeatureModel(name='f1', dtype='int', skip=False, na_values=None),
FeatureModel(name='f2', dtype='str', skip=False, na_values=None)
])
Reading and writing the config to file
With the configuration models completed a configuration file can be created with the default values. Additionally, the
YAML configuration files will need to be loaded. Let's handle configuration using a new Config
class with a load
and
save
method.
Model to file
The save
method accepts an instance of ConfigModel
and a config_file
path to write the file to. The to_yaml_str
function from pydantic-yaml provides the serialisation functionality for Pydantic models.
import os
from pydantic_yaml import to_yaml_str
class Config
def __init__(self, config_file: os.PathLike):
self._file = config_file
self._config = None
@classmethod
def save(cls, config: ConfigModel, config_file: os.PathLike):
with open(config_file, "w") as f:
f.write(to_yaml_str(config))
The default model configuration file can then be created.
>>> Config.save(ConfigModel(), 'default.yaml')
Something isn't right though. The order of the fields has changed from the Pydantic model. More on this soon.
comment:
header_row: 0
fields: []
nrows:
separator: ','
skip_footer: 0
skip_initial_space: true
skip_rows:
Model from file
Loading the model is simply the inverse operation and relies upon the pydantic-yaml parser function which accepts as arguments the target Pydantic model class and a Iostream.
from pydantic_yaml import parse_yaml_raw_as
class Config:
...
def load(self) -> None:
with open(self._file, "r") as f:
self._config = parse_yaml_raw_as(ConfigModel, f.read())
A custom YAML writer
Fixing the field order
Discovering the cause of the field reordering in the output YAML required some digging. The pydantic-yaml library uses the inbuilt serialisation and deserialisation functionality of Pydantic to achieve a YAML compatible representation of the models. This includes a round trip through JSON which is the serialised Pydantic output. The JSON output from pydantic is serialised correctly.
>>> ConfigModel().model_dump_json()
'{"separator":",","header_row":0,"skip_initial_space":true,"skip_rows":null,"skip_footer":0,"nrows":null,"comment":null,"fields":[]}'
Therefore the issue is in the conversion of the Pydantic model objects to YAML. At it turns out pydantic-yaml uses the
ruamel
library under the hood with the keyword arguments typ="safe"
and pure=True
set. The typ="safe"
option
specifically disables some of the nicer preservation of objects that ruamel
supports. Changing this to typ="rt"
the
round-trip version of the writer achieves our ordering outcomes.
>>> from ruamel.yaml import YAML
>>> from pydantic_yaml import to_yaml_str
>>> to_yaml_str(ConfigModel())
"comment: null\nfields: []\nheader_row: 0\nnrows: null\nseparator: ','\nskip_footer: 0\nskip_initial_space: true\nskip_rows: null\n"
>>> to_yaml_str(ConfigModel(), custom_yaml_writer=YAML(typ='rt'))
"separator: ','\nheader_row: 0\nskip_initial_space: true\nskip_rows:\nskip_footer: 0\nnrows:\ncomment:\nfields: []\n"
Updating the save
method of the Config
class to use a custom writer is pretty straight forward. We'll also include some extra
options for the yaml writer that put the YAML document separator at the start of the file by setting explicit_start = True
on the
writer
.
class Config:
...
@classmethod
def save(cls, config: ConfigModel, config_file: os.PathLike):
writer = YAML(typ='rt', pure=True)
writer.explicit_start = True
f.write(to_yaml_str(config, custom_yaml_writer=writer))
Now the default YAML file looks like this.
>>> Config.save(ConfigModel(), 'default.yaml')
---
separator: ','
header_row: 0
skip_initial_space: true
skip_rows:
skip_footer: 0
nrows:
comment:
fields: []
YAML comments
Pydantic has some great features for it fields that includes field descriptions. Primarily this is for API documentation but I figured it can also be used for commenting our YAML document so that when a fresh or default configuration file is produced, comments can be included alongside each field to help users understand what they do.
Going back to the FileModel
defined earlier, fields can be given descriptions as an argument to the
Field class. For example
class FileModel(BaseModel):
separator: str = Field(default='', description="Character to treat as the delimiter.")
This methodology is straight forward but there is another way which reduces the amount of text we need
to pass Field(description=)
and provides more intuitive support for multiline or long descriptions.
In fact its an extension of docstrings which can be defined below each field. To turn it on the model
configuration needs to be modified to use_attribute_docstrings=True
.
from pydantic import ConfigDict
class FileModel(BaseModel):
model_config = ConfigDict(use_attribute_docstrings=True)
separator: str = Field(default=',')
"Character to treat as the delimiter."
The fields meta data is stored in the model under the model_fields
attribute.
>>> ConfigModel().model_fields
{'separator': FieldInfo(annotation=str, required=False, default=',', description='Character to treat as the delimiter.'),}
Getting these descriptions in our default YAML as comments requires some further investigation of the ruamel library.
There are a few classes in ruamel that support representations of YAML that can handle comments. Specifically the
CommentedMap
class. pydantic-yaml uses the dump
method of the YAML writer to output the YAML string so perhaps we can
wrap this method to insert our field descriptions as comments before the yaml string is created.
Turns out this is pretty straight forward to implement by sub-classing the ruamel.yaml.YAML
writer. The only addition is
that the writer
must now be class specific so it has context about the fields it is writing out. Adding the comment to
the CommentedMap
instance is possible using yaml_set_comment_before_after_key
method. Which takes a field key and
a comment string which can be placed before
or after
the YAML item.
class CommentedYaml(YAML):
def __init__(self, model: BaseModel = None):
self._model = model
super().__init__(typ='rt', pure=True)
def dump(self, data: Any, stream: Any) -> str:
data = CommentedMap(data)
for field, meta in self._model.model_fields.items():
if meta.description:
# Add the description for a field before the field
data.yaml_set_comment_before_after_key(
field,
before = "\n" + meta.description
)
return super().dump(data, stream)
Testing our new writer the comment from the field description has been inserted before the field on a fresh line.
>>> to_yaml_str(ConfigModel(), custom_yaml_writer=CommentedYaml(ConfigModel()))
"# Character to treat as the delimiter.\nseparator: ','\n..."
The Config
class save
method must also be updated to use the new CommentedYaml
class.
class Config:
...
@classmethod
def save(cls, config: ConfigModel, config_file: os.PathLike):
with open(config_file, "w") as f:
writer = CommentedYaml(config)
writer.explicit_start = True
f.write(to_yaml_str(config, custom_yaml_writer=writer))
Comments for repetitive fields
Adding comments for the fields in FileModel
is relatively straight forward. Each field is at the top
level of the YAML document and has a unique name. However, the FeatureModel
is added the hybrid
ConfigModel
as fields: List[FeatureModel]
. Ideally something that documents the structure of FeatureModel
before fields
is defined in the yaml document might be useful. For example
# The features/fields of your CSV to load.
# ----------------
# name: None
# dtype: None
# skip: None
# na_values: None
fields:
- name: f1
dtype: str
Let's start by adding descriptions to FeatureModel
and activating use_attribute_docstrings
configuration feature.
class FeatureModel(BaseModel):
"""A list of field specifications that describe the validation parameters for each field.
---------------------------------
"""
model_config = ConfigDict(use_attribute_docstrings=True)
name: str
"""The name of field."""
dtype: str
"""The expected data type of the field."""
skip: bool = Field(default=False)
"""Ignore this field (drops from output)."""
na_values: str = Field(default=None)
"""NA/Null value representation for this field. Null values will be ignored."""
Then we can extend FeatureModel
with a method that creates the docstring we want from all the available
docstrings in the model. It starts with the class docstring cls.__doc__
and adds the description for each
field using the same methods as CommentedYaml
.
class FeatureModel(BaseModel):
...
@classmethod
def get_docs(cls):
docs = cls.__doc__
fields = []
for field, meta in cls.model_fields.items():
fields.append(
f" {field}: {meta.description}"
)
return docs + "\n".join(fields)
The get_docs
method returns the composite document for the full model just like we wanted.
>>> from config import FeatureModel
>>> FeatureModel.get_docs()
'A list of field specifications that describe the validation parameters for each field.\n---------------------------------\n name: The name of field.\n dtype: The expected data type of the field.\n skip: Ignore this field (drops from output).\n na_values: NA/Null value representation for this field. Null values will be ignored.'
The final step now is to describe the description for fields
in ConfigModel.
class ConfigModel(FileModel):
fields: List[FeatureModel] = Field(default=[], description=FeatureModel.get_docs())
And the default YAML now looks like
>>> Config.save(ConfigModel(), 'default.yaml')
---
# Character to treat as the delimiter.
separator: ','
...
# A list of field specifications that describe the validation parameters for each field.
# ---------------------------------
# name: None
# dtype: None
# skip: None
# na_values: None
fields: []
Final Solution
Putting it all together we now have a simple tool for reading and writing YAML files that includes the handy creation of self documenting default YAML files for configuration. The final solution also includes a preface comment at the top of the document which describes the file purpose.
Final Solution
from typing import Any
import os
from typing import List
from pydantic import BaseModel, Field, field_validator, ConfigDict
from pydantic_yaml import parse_yaml_raw_as, to_yaml_str
from ruamel.yaml import YAML, CommentedMap
class FileModel(BaseModel):
model_config = ConfigDict(use_attribute_docstrings=True)
separator: str = Field(default=',')
"Character to treat as the delimiter."
header_row: int = Field(default=0)
"Row number containing column labels and marking the start of the data (zero-indexed)."
skip_initial_space: bool = Field(default=True)
"Skip spaces after delimiter."
skip_rows: List[int] | int = Field(default=None)
"Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file."
skip_footer: int = Field(default=0)
"Number of lines at bottom of file to skip."
nrows: int = Field(default=None)
"Number of rows of file to read. Useful for reading pieces of large files."
comment: str = Field(default=None)
"""
Character indicating that the remainder of line should not be parsed. If found at the beginning
of a line, the line will be ignored altogether. This parameter must be a single character.
"""
@field_validator('header_row')
@classmethod
def _chk_header_rows(cls, v: int) -> int:
if v < 0:
raise ValueError("header_row must be greater than or equal to 0")
return v
class FeatureModel(BaseModel):
"""A list of field specifications that describe the validation parameters for each field.
---------------------------------
"""
model_config = ConfigDict(use_attribute_docstrings=True)
name: str
""" The name of field."""
dtype: str
"""The expected data type of the field."""
skip: bool = Field(default=False)
"""Ignore this field (drops from output)."""
na_values: str = Field(default=None)
"""NA/Null value representation for this field. Null values will be ignored."""
@classmethod
def get_docs(cls):
docs = cls.__doc__
fields = []
for field, meta in cls.model_fields.items():
fields.append(
f" {field}: {meta.description}"
)
return docs + "\n".join(fields)
class ConfigModel(FileModel):
"""Configuration file for verifying data files that are messy and difficult to work with.
"""
fields: List[FeatureModel] = Field(default=[], description=FeatureModel.get_docs())
@field_validator('fields')
@classmethod
def _chk_field_names_unique(cls, v: List[FeatureModel]) -> List[FeatureModel]:
names = [field.name for field in v]
if len(names) != len(set(names)):
raise ValueError("Field names must be unique")
return v
class CommentedYaml(YAML):
def __init__(self, model: BaseModel = None):
self._model = model
super().__init__(typ='rt', pure=True)
def dump(self, data: Any, stream: Any) -> str:
data = CommentedMap(data)
# add class description
if model_doc:= self._model.__doc__:
data.yaml_set_start_comment(model_doc)
for field, meta in self._model.model_fields.items():
if meta.description:
data.yaml_set_comment_before_after_key(
field,
before = "\n" + meta.description
)
return super().dump(data, stream)
class Config:
def __init__(self, config_file: os.PathLike):
self._file = config_file
def load(self):
with open(self._file, "r") as f:
self._config = parse_yaml_raw_as(ConfigModel, f.read())
@classmethod
def save(cls, config: ConfigModel, config_file: os.PathLike):
with open(config_file, "w") as f:
writer = CommentedYaml(config)
writer.explicit_start = True
f.write(
to_yaml_str(
config,
custom_yaml_writer=writer
))