Dataclasses are great!
Data is everywhere these days. From statistics to healthcare, media, marketing and even sports, humans have been collecting and analyzing data ever since technology has provided the storage space for it.
In this article we will be talking about dataclasses
, a built-in Python module. Also we discuss what problems dataclasses
solve and what you can and cannot do with them.
In this article
- What are dataclasses: The basics
- More specific parameters: Field() and post init()
- Dataclasses as immutable objects: The frozen instance
- Sub-dataclasses: Inheritance
- Ordering dataclasses: Enable comparison
- Dataclasses as dictionaries: Compatibility with .json
- Summary and alternative libraries
What are dataclasses: The basics
Dataclasses were introduced in Python 3.7 and as the name implies, a dataclass is a class that contains data.
Let's see an example on how you can write a dataclass, so first of all import the dataclass
function from the dataclasses
module.
from dataclasses import dataclass
And now we can declare a dataclass like so.
@dataclass
class Player:
'''Class to represent a football player.'''
name: str
age: int
team: str = 'No team yet'
def assign_team(self, new_team) -> None:
self.team = new_team
First thing you might notice is the @dataclass
at the top, which is called a decorator. This decorator is actually the function dataclass(), which is taking our Player
class and adding some functionality to it as we will see later.
Also notice how the attributes of the class are written following PEP526 "Syntax for Variable Annotations", each having its own type annotation.
The syntax is as follows:
variable: type_annotation = default_value
Dataclasses also support default values as in the team
attribute, so when no team is passed, its value would be 'No team yet'
, as we can see when printing a Player object.
player_1 = Player('Victor O. Sullivan', 23)
print(player_1)
>>> Player(name='Victor O. Sullivan', age=23, team='No team yet')
player_1.assign_team('Black Eagles FC')
print(player_1)
>>> Player(name='Victor O. Sullivan', age=23, team='Black Eagles FC')
But keep in mind: Type annotation doesn't mean type validation. Python still is a dynamic language, so there are no static type variables. The type annotation just provides a better code comprehension, but it is not a type validator. *Note: Such type validation functionality can be implemented with the mypy library.
Now let's check all the functions @dataclass
created for us under the hood.
For this we can use the inspect
module, like so.
import inspect
Player_functions = inspect.getmembers(Player, inspect.isfunction)
for name, value in Player_functions:
print(name, value)
>>>
__eq__ <function Player.__eq__ at 0x000001B512576200>
__init__ <function Player.__init__ at 0x000001B512576050>
__repr__ <function Player.__repr__ at 0x000001B512575FC0>
As you can see, some dunder methods were added to our Player class, and most noticeable the __init__()
and __repr__()
methods, all thanks to the dataclasses. The __repr__()
method has been responsible for the nice representation of the Player object each time we have print()
ed it, which is excellent for debugging.
More specific parameters: field() and post init()
Let's try a different example, this time with users' data on a blog where they can comment.
So for this new class, we would like to implement a way to keep track of comments made by every user and for that the field()
method comes into play
First, import field
from the dataclasses
module as usual.
from dataclasses import dataclass, field
Then the user class.
@dataclass
class User:
'''Class to represent users on a blog where they can commment.'''
username: str
comments: list[str] = field(default_factory=list, repr=False)
Using the field()
method allow us to better specify our constructor parameters, for example if the default value is a mutable object like a list we must set the default_factory=list
, because an empty list will cause an error as you see in here.
@dataclass
class User:
'''Class to represent users on a blog where they can commment.'''
username: str
comments: list[str] = []
>>> ValueError: mutable default <class 'list'> for field comments is not allowed: use default_factory
Also if you don't want to show a parameter in the __repr__()
call, use repr=False
.
user_1 = User('turner_rox', ['Thanks for sharing!', 'Why did you use that there?'])
print(user_1)
>>> User(username='turner_rox')
*Note: If you want type hints in your IDE when writing your dataclass, the typing module provides just that and it's built-in. From the last example, the code can be change like so.
...
from typing import List
@dataclass
class User:
...
comments: List[str] = field(default_factory=list, repr=False) # Notice the change
Another common practice is defining a parameter from other parameters by using the field()
method in conjunction with the __post_init__()
method (which is exclusive to dataclasses).
Let's say you might like to implement a comment_counter
variable that counts the comments
for each user. You can do it like so.
from dataclasses import dataclass, field
@dataclass
class User:
'''Class to represent users on a blog where they can commment.'''
username: str
comments: list[str] = field(default_factory=list, repr=False)
comments_counter: int = field(default=0, init=False)
def __post_init__(self):
self.comments_counter = len(self.comments)
The __post_init__()
is called after initialization, so that variables declared in __init__()
are available for us. And now let's check what we did.
user_2 = User('davidjam87', ['Hey, I got a question.', 'Cool post', 'I did something really similar in ''here'''])
print(user_2)
>>> User(username='davidjam87', comments_counter=3)
Dataclasses as immutable objects: The frozen instance
If a class that represents an immutable object is what you need, you can fake it with the frozen=True
parameter. I said "fake it" because remember that python is dynamic and nothing can change that... ironically.
@dataclass(frozen = True)
class Color:
'''Class to represent a Color in RGB format'''
r: int # representing the Red channel
g: int # representing the Green channel
b: int # representing the Blue channel
With the class Color
"frozen", whenever we try to change one of its parameters after creation, we get an Error message.
red=Color(255, 0, 0)
red.g = 100
>>> dataclasses.FrozenInstanceError: cannot assign to field 'g'
Sub-dataclasses: Inheritance
If it's not obvious at this point dataclasses are still classes, so just as regular classes inheritance is allowed. So let's create a sub-class TransparentColor
object that inherits from our Color
object.
We will do this by adding a new parameter a
to our new TransparentColor
class, which stands for "alpha" representing how transparent the color is, 0 being full transparent (invisible) and 255 being full opaque.
@dataclass(frozen = True)
class TransparentColor(Color): # sub-class of Color
'''Class to represent a Color in RGBA format'''
a: int # representing the Alpha channel (transparency)
blue = TransparentColor(0, 0, 255, 100)
print(blue)
>>> TransparentColor(r=0, g=0, b=255, a=100)
You must keep in mind 2 things when doing inheritance:
-
If the super-class have at least 1
default_value
for 1 of its parameters, all of the sub-class' parameters must have default_values. - If the super-class is frozen, the sub-class must also be frozen.
Ordering dataclasses: Enable comparison
When it comes to data, we often times need to compare and sort it.
In dataclasses, the eq
parameter is True
by default so it writes a __eq__()
method by default, which compares if 2 objects share the same data, and if so there are consider equal which is really useful and save you from manually writing the method by hand.
redish_green=Color(100, 100, 0)
greenish_red=Color(100, 100, 0)
print(redish_green == greenish_red)
>>> True
The other comparison methods __lt__
, __le__
, __gt__
, and __ge__
are also generated for you if you pass the order=True
parameter to the @dataclass
decorator. In that case, the criteria used for comparison is magnitude for int
and alphabetical order for str
. So if you want to have those features, you should write them yourself for full control over how they work.
Dataclasses as dictionaries: Compatibility with .json
One last thing to know about dataclasses is that they can be transformed into a dictionary, with the method asdict(object)
which returns a dictionary with the data of the object passed.
from dataclasses import asdict, astuple
yellow = Color(255, 255, 0)
dict_color_yellow = asdict(yellow)
print(dict_color_yellow)
>>> {'r': 255, 'g': 255, 'b': 0}
This is very useful, because .json
files are essentially python dictionaries, so we can storage the data on a .json
file and then read from the same file to create the same objects. Here is a way to do it.
import json
data_out = [asdict(yellow), asdict(red)]
# Write to the json file
with open('data.json', mode='w') as json_out:
json.dump(data_out, json_out)
# Read the json file we created
with open('data.json', mode='r') as json_in:
data_in = json.load(json_in)
colors_db = [Color(**item) for item in data_in]
print(colors_db)
>>> [Color(r=255, g=255, b=0), Color(r=255, g=0, b=0)]
Other method at your disposal is the astuple(object)
method which returns a tuple, if that is what you need.
Summary and alternative libraries
In this article we have covered mainly 2 things.
- The basics of dataclasses and the functionality they can provide to our regular classes.
- Dataclasses save us a lot of time writing boilerplate code, and improve the comprehension of our code tremendously.
Other alternative to dataclasses
Even though they are very useful, dataclasses are not the definitive answer to any programming challenge you might encounter involving data. Here are other similar libraries that also are used for storing data on Python classes, so it's up to you deciding on which one suites you best.
dataclass | namedtuples | pydantic | |
---|---|---|---|
built-in | Yes ✅ | Yes ✅ | No ❌ |
Default values | Yes ✅ | No ❌ | Yes ✅ |
Type validation | No ❌ | No ❌ | Yes ✅ |
So, if you really need the type validation at runtime and you don't want to implement it yourself or don't want to use mypy
as mentioned earlier, pydantic
might be what you need.
Dataclasses still are a great built-in library to go with when it comes to storing data in Python.
Thanks for reading. Hope you find this helpful!
0 Comments
Post a Comment