Up to this point, all the data we’ve worked with in Python have been
stored in objects that are instances of the built-in types that come
with Python, like int
s and list
s. Python’s
built-in data types are powerful, but are not always the most intuitive
way to store data. For example, we saw in 5.1 Tabular Data that we could use a
list of lists to represent tabular data. One of the downsides of this
approach is that when working with this data, the onus is on us to
remember which list element corresponds to which component of the
data.
>>> import datetime
>>> row = [1657, 'ET', 80, datetime.date(2011, 1, 1)]
>>> row[0] # The id
1657
>>> row[1] # The name of the civic centre
'ET'
>>> row[2] # The number of marriage licenses issued
80
>>> row[3] # The time period
2011, 1, 1) datetime.date(
You can imagine how error prone this might be. A simple “off by one” error for an index might retrieve a completely different data type. It also makes our code difficult to read; the reader must know what each index of the list represents. And, as more experienced programmers will tell you, readable code is crucial. “Any fool can write code that a computer can understand. Good programmers write code that humans can understand.” –Martin Fowler
So a row in our marriage license data set is made up of four data elements. It would be nice if, instead of indices, we could use names that were reflective of each piece of data. Certainly, we could use a dictionary (instead of a list), with string keys naming each piece of data. But there is a more robust option we’ll learn about in this section: creating our own data types.
You might remember from 2.1 Python’s Built-In
Functions that in Python, another term for “data type” is
class. This is why type(3)
evaluates to
<class 'int'>
in Python. The built-in data
types we’ve studied so far illustrate how rich and complex data types
can be in Python. So in learning to create our own data types, we will
first learn about the simplest kind of data type: a data
class, which is a kind of class whose purpose is to bundle
individual pieces of data into a single Python object.
For example, suppose we want to represent a “person” consisting of a given name, family name, age, and home address. We already know how to represent each individual piece of data: the given name, family name, and address could be strings, and the age could be a natural number. To bundle these values together, we could use a list or other built-in collection data type, but that approach would run into the issues we discussed above.
So instead, we define our own data class to create a new data type consisting of these four values. Here is the way to create a data class in Python:
from dataclasses import dataclass
@dataclass
class Person:
"""A custom data type that represents data for a person.
"""
str
given_name: str
family_name: int
age: str address:
Let’s unpack this definition.
from dataclasses import dataclass
is a Python
import-from statement that lets us use dataclass
below.
@dataclass
is a Python decorator. We’ve
already seen decorators for function definitions when using the
hypothesis
library for property-based
testing. A decorator for a class definition works in the same way,
acting as a modifier for our definition. In this case,
@dataclass
tells Python that the data type we’re defining
is a data class.
class Person:
is the syntax for the start of a
class
definition. This is similar to function definitions, except we use
the class
keyword instead of def
. The
name of the data class is Person
.
The rest of the code for the class is indented to put it inside of the class body.
The next line, """A custom data type...
is a
docstring that describes the purpose of the data class.
Each remaining line (starting with given_name: str
)
defines a piece of data associated with the data class; each piece of
data is called an instance attribute (often shortened
to just attribute) of the data class.
For each instance attribute, we write a name and a type annotation. This is similar to defining function parameter names and types, though of course the purposes are different here.
In general, a data class definition in Python has the following syntax:
@dataclass
class <ClassName>:
"""Description of the data class.
"""
<attribute1>: <type1>
<attribute2>: <type2>
...
Now that we’ve seen how to define a data class, we now are ready to
actually put it to use. For built-in Python data types, we know how to
create values of those types: type in literals like 3
or
['hi', 'bye']
. But with our Person
data class,
what is the corresponding literal we can write?
The answer is there isn’t—the possible literals of the Python
programming language are fixed, and can’t be changed even after defining
a new data type. But all is not lost! By defining a Person
data class, we have gained the ability to call the data class like a
function to create values whose type is
Person
. We’ve actually seen this before: we create
range
values by calling range
like a function,
e.g. range(1, 10)
, and just in the last section we created
datetime.date
objects like
datetime.date(2011, 1, 1)
. Here is an example of
creating a Person
value, passing in as arguments the values
for each instance attribute:
>>> david = Person('David', 'Liu', 100, '40 St. George Street')
Pretty cool! That line of code creates a new Person
value whose given name is 'David'
, family name is
'Liu'
, age is 100
, and address is
'40 St. George Street'
, and stores the value in the
variable david
. The type of this new value is, as
we’d expect, Person
:
>>> type(david)
<class Person>
One new piece of terminology: we say that the value
david
refers to is an instance of the
Person
data class. In other words, the phrases
“david
has type Person
” and
“david
is an instance of Person
” mean the same
thing. This explains why we refer to the bundled data in a data class as
instance attributes, since they are pieces of data that are
associated with a particular value of that type.
If we evaluate the Person
value, we see the different
pieces of data that have been bundled together:
>>> david
='David', family_name='Liu', age=100, address='40 St. George Street') Person(given_name
But from a Person
value, how do we extract the
individual pieces of data we bundled together? If we were using lists,
we’d simply do list indexing: david[0]
,
david[1]
, etc. The syntax for Python classes improves this
because we can use the names of the instance attributes together with
dot notation to access these values:
>>> david.given_name
'David'
>>> david.family_name
'Liu'
>>> david.age
100
>>> david.address
'40 St. George Street'
This is much more readable than list indexing, and this is one of the major advantages of using data classes over lists to represent custom data in Python.
One challenge when creating instances of our data classes is keeping
track of which arguments correspond to which instance attributes. In the
expression
Person('David', 'Liu', 100, '40 St. George Street')
, the
order of the arguments must match the order the instance attributes are
listed in the definition of the data class—and it’s our responsibility
to remember this order. Think about how easy it would be for us to write
Person('Liu', 'David', 100, '40 St. George Street')
, only
to discover much later in our program that we accidentally switched this
poor fellow’s given and family
names! This is exactly the same problem with using a list to
represent marriage license records.
To solve this issue, Python lets us to create data class instances
using keyword arguments to explicitly name which argument
corresponds to which instance attribute, using the exact same format as
the Person
representation we saw above:
>>> david = Person(given_name='David', family_name='Liu', age=100, address='40 St. George Street')
Not only is this more explicit, but using keyword arguments allows us to pass the argument values in any order we want:
>>> david = Person(family_name='Liu', given_name='David', address='40 St. George Street', age=100)
This is a great improvement for the readability of our code when we use data classes, especially as they grow larger. One potential downside that comes with this (and in general when writing more explicit code) is that this requires more typing, and makes our code a little longer. You can get around the first issue by using auto-completion features (e.g., in PyCharm), and for the second issue you can put the different arguments on separate lines:
>>> david = Person(
='Liu',
... family_name='David',
... given_name='40 St. George Street',
... address=100
... age ... )
Now that we have the ability to define our own data types, we need to
decide how these data types will fit into our memory model. We’ll do
this by using the representation that Python displays, formatted to show
each instance attribute on a new line. For example, we would represent
the david
variable in a memory model as follows:
Variable | Value |
---|---|
david |
|