Problems arise when computers interact with the real world. You step out of a controlled environment into one with user inputs, paper jams, invalid filenames, and dates. Dates are on the front-line of man vs. nature. Are there 365 days in a year? Not quite. 365.25? 365.2424? Closer, but no^{1}. When is noon? Daylight savings time? Which time zone?

Data Scientists work with dates and, if you’re like me, they often cause irritation and frustration. Here are some tips and tools I use to make working with dates easier. They are not a comprehensive treatment; just enough information to understand the problem and get the job done. And, at this point, let’s stipulate that any inevitable jokes about date’s double meaning are satisfied!

What do we mean by the date January 1, 2022? There are two interpretations. It can mean

- A specific day-long time or
- A single point-in-time.

Events that **occur on January 1** use the first. The second is meant when we say that a contract is **effective on January 1**: it comes into effect at a specific point during the day, usually at midnight (the start of the day) or an instant after midnight to avoid quibbles about which day contains midnight. Keeping these two interpretations clear is critical to dealing with dates.

Next, we come to a critical question:

What is the expiration date of a one-year contract effective January 1, 2022?

Is it December 31, 2022, or January 1, 2023? The expiration date means the point-in-time at which the contract ceases to apply—the second interpretation. A one-year contract must cover December 31. It expires at the end of December 31, i.e., at the very first instant (midnight) of January 1, 2023. Stick to this convention, and you win 50% of your battles with dates.

In the rest of this post, date means **point-in-time**. If no time is given, it is understood to be the start of the day. Sometimes we are given a date and time, such as 11:30 am on January 1. Again, we interpret this as a point in time. It means 11:30:00.000 even if the seconds are not specified. If we are merely given 11 o’clock, we mean 11:00 exactly. We’ll call both a date and a date-and-time just a date.

**Side-bar** As a mathematician, I have to point out the similarity between dates and the old saw “Are 0.9999… and 1 the same number?” For us: is the end of December 31 the same moment as the start of January 1? We deem that point-in-time to occur on January 1, but it’s the same point in time as the last moment on December 31. [And the math answer: also yes, because \(0.9999\dots=0.9 \times (1 + 0.1 + 0.001 +\cdots) = 0.9 \times (1 - 0.1)^{-1} =1\) by the formula for the sum of a geometric series.]

Depending on where you live, 03/02/2022 can mean February 3 or March 2. Neither interpretation is ideal because neither sorts in the correct order. The ISO Date Standard^{2} solves this problem by writing dates YYYY-MM-DD. Everyone should use it!

Now we know what we mean by date, we can ask what a computer means. The happy answer is: we don’t care! Provided the computer handles certain things for us (which it always does), the details of its internal book-keeping are irrelevant. It’s the same with numbers: you don’t care how the computer stores a float. We require that the computer:

- Allows us to create dates in some reasonable way by specifying the year, month, day, hour, minute, second, fractions of a second,
- Displays dates for us in human-readable form,
- Computes the difference between two dates, creating a
**time delta**, and - Performs basic arithmetic with dates and time deltas.

Let’s assume the computer creates a date using `=date(Y, M, D, h, m, s, fs)`

, where all entries except the last (fractions of a second) are integers. Only `YMD`

are required; if `hms.fs`

are missing, they are 0 (midnight, again).

We can create arbitrary representations of dates as real numbers with this functionality. For example, if \(d\) is a date, then \[ r_1 = \frac{d - \mathsf{date}(1900, 1, 1)}{\mathsf{date}(1900, 1, 2) - \mathsf{date}(1900, 1, 1)} \] represents \(d\) as the number of days since January 1, 1900. By changing the denominator, we readily obtain a representation in seconds (remember missing arguments are zero) \[ r_2 = \frac{d - \mathsf{date}(1900, 1, 1)}{\mathsf{date}(1900, 1, 1, 0, 0, 1) - \mathsf{date}(1900, 1, 1)}. \] That’ll be a big number. We could pick a more recent reference time, like January 1, 1970 \[ r_3 = \frac{d - \mathsf{date}(1970, 1, 1)}{\mathsf{date}(1970, 1, 1, 0, 0, 1) - \mathsf{date}(1970, 1, 1)}. \]

As I write this, on February 4, 2022, the three representations are

- \(r_1 = 44594.593127378765\) says since 1900-01-01
- \(r_2 = 3852972862.741849 \approx 3,852,972,862.7\) seconds since 1900-01-01, and
- \(r_1 = 1643984129.762957 \approx 1,643,984,129.8\) seconds since 1970-01-01.

UNIX/Linux systems use the 1970 base and work in seconds so that you may see “dates” as numbers around 1.65 billion (seconds since 1970). Other systems, catering to the financial market’s needs, work in nanoseconds, giving numbers like 1,643,984,129,762,957,056. Excel claims to use days since 1900-01-01. Right now, typing `=NOW()*1`

into a spreadsheet gives 44,59**6**.59684, which is two days more than \(r_1\)^{3}. Two things cause the difference:

- Excel, following Lotus 1-2-3
^{4}, assumes that 1900 was a leap year. It wasn’t. One ahead for Excel. - Excel thinks of 1900-01-01 as time \(t=1\) (try it:
`=DATE(1900,1,1)*1`

gives 1), not \(t=0\) required by our point-in-time interpretation. (This is analogous to saying a newborn baby is age 1, not age 0. It is in its first year of life and achieves age 1 at the end of the year.) Two ahead for Excel.

The Julian day^{5} system takes November 24, 4714 BC as the base and counts in days.

In all these systems, the different **units** are easily interchangeable. It is the interpretation of the **numbers** that is critical. Remember, point-in-time. Working in units of one day, the picture looks like this:

Using the same diagram but in units of years, year 1 of a contract spans from \(t=0\) to \(t=1\), year 2 from \(t=1\) to \(t=2\), and so forth.

Epoch converter provides a tool to convert different date representations.

Dates are not a built-in Python data type. Date functionality is provided by the `datetime`

standard library and as well as by some `pandas`

and `NumPy`

functions. Dates can be created and manipulated much as you’d expect.

```
from datetime import datetime
datetime.now()# datetime.datetime(2022, 2, 4, 14, 48, 53, 133941)
- datetime(1970, 1, 1)) / (datetime(1970,1,1, 0, 0, 1) - datetime(1970, 1, 1))
(datetime.now() # 1643986180.155806
```

The odd looking import reflects that the function `now`

is a method of the class `datetime`

defined in the `datetime`

library.

Printing dates directly is arcane, requiring the function `strftime`

derived from C (string from time). A set of hard-to-remember coding strings, described in the documentation, controls the output

```
"%a, %d %b %Y %H:%M:%S")
datetime.now().strftime(# 'Fri, 04 Feb 2022 14:53:26'
```

The good news is that `pandas`

mostly insulates you from these codes.

Pandas integrates well with the `datetime`

library, and provides human-formatted output by default

```
import pandas as pd
'x': datetime.now()}, index=[1]) pd.DataFrame({
```

renders like so in Jupyter Lab

x | |
---|---|

1 | 2022-02-04 14:56:51.417745 |

It is essential to be able to read dates into a DataFrame and convert them into Python date types. Good news again: most of the time, it is easy. The `read_csv`

function allows you to specify columns (or collections of columns, if the date data is separated) to parse as dates. Here is an example. (`StringIO`

is used to create a file-stream-like object to give to `read_csv`

.)

```
data = '''idx,eff_date,poss_date
A,10-01-2020,1.6435e+18
B,12-21-2020,1.6437e+18
C,04-15-2021,1.6438e+18
D,02-03-2022,1.6439e+18'''
from io import StringIO
sio = StringIO(data)
df = pd.read_csv(sio, parse_dates=['eff_date'])
df
```

idx | eff_date | poss_date | |
---|---|---|---|

0 | A | 2020-10-01 | 1.6435e+18 |

1 | B | 2020-12-21 | 1.6437e+18 |

2 | C | 2021-04-15 | 1.6438e+18 |

3 | D | 2022-02-03 | 1.6439e+18 |

Calling `df.dtypes`

reveals that `eff_date`

is stored as `datetime64[ns]`

: nanoseconds since 1970—so it is date-aware. Without specifying `parse_dates=['eff_date']`

it is imported as an `object`

. The last column we suspect is a date. We can use `pd.to_datetime`

to convert it:

`'new_date'] = pd.to_datetime(df.poss_date) df[`

The first row now appears

idx | eff_date | poss_date | new_date | |
---|---|---|---|---|

0 | A | 2020-10-01 | 1.6435e+18 | 2022-01-29 23:46:40 |

If we know that the last column represents a date time stamp in UTC^{6} we can convert into a local (US/Eastern for me) time using

`'new_date2'] = pd.to_datetime(df.poss_date, utc=True).dt.tz_convert('US/Eastern') df[`

(the middle `.dt`

exposes date functions) to yield

idx | eff_date | poss_date | new_date | new_date2 | |
---|---|---|---|---|---|

0 | A | 2020-10-01 | 1.6435e+18 | 2022-01-29 23:46:40 | 2022-01-29 18:46:40-05:00 |

The help on `read_csv`

and `to_datetime`

explains many other options. But for me, `to_datetime`

and the timezone conversion usually gets the job done.

To comment on this post please go over the Reddit thread.

The Julian calendar (Julius Caesar) assumed 365.25 days per year: 3 regular years followed by a leap year. In 1582, Pope Gregory XIII introduced the Gregorian calendar, with 365.2425 days per year (divisible by 100 is

*not*a leap year—1800, 1900—except divisible by 400 is—2000). The actual solar year has 365.24219 days. Because the Julian year was too “long”, by about 1 day per 100 years, fixed solar events such as the winter solstice move earlier in the year. The Gregorian calendar included a ten day catch-up, 4 October 1582 was followed by 15 October.↩︎Multiplying 44,596.59684 by 3600 to convert to seconds gives 3,85

**3,14**6,026, which is 173,163 seconds more than \(r_2=\) 3,85**2,97**2,863, almost exactly two days more.↩︎See the Microsoft article: Excel incorrectly assumes that the year 1900 is a leap year.↩︎

One other date representation you might meet is the Julian day: the number of whole days since noon Monday, January 1, 4713 BC in the proleptic (OED,

*adj.*projected backwards) Julian calendar or, equivalently, November 24, 4714 BC, in the Gregorian calendar. This date marked the start of three multi-year cycles, and it precedes any dates in recorded history. SQLite uses Julian days.Python

`datetime`

provides the function`datetime.toordinal`

to counts days since Gregorian January 1 of the first year, i.e., 0001-01-01. Currently,`datetime.now().toordinal()=738190`

. Adding 1,721,424.5 gives Julian day, 2,459,614.5.↩︎

posted 2022-02-20 | tags: Effective Python, Python, dates

- 2022-04-05 | Effective Python Projects
- 2022-02-20 | Effective Python Dates
- 2022-02-10 | Effective Python Files
- 2022-02-09 | Effective Python R Style Regression
- 2022-01-20 | Effective Python Outline
- 2022-02-15 | Effective Python Zip Enumerate and Iter
- 2022-02-15 | Effective Python Strings and Unicode
- 2022-02-15 | Effective Python Introspection With Pandas
- 2022-02-02 | Cartograms
- 2022-02-02 | Code Snippets
- 2022-01-28 | Python Environments
- 2022-01-28 | Regular Expressions
- 2022-01-28 | The Aggregate Package
- 2022-01-28 | Code Snippets