Effective Python: Files

If you write a program of any length, then you have to manipulate files and filenames. Common tasks include:

Create (write) a new file,
Read the contents of an existing file,
Read file attributes, such as size or modified date,
Determine the drive, directory, or file type from a file path,
Rename a file or change its extension,
Determine if a file exists, and
List or search the contents of a directory.

Files have a physical existence that makes them fussy, especially allowing for differences across operating systems. Windows-style c:\dir_name\filename.suf name vs. Linux //mnt/c/dir_name/filename.suf is a case-in-point. Python has a great utility class called Path that simplifies many of these operations. Path belongs to pathlib, a standard Python library.

This post shows how to accomplish several everyday file-related tasks using Path. File tasks can act on

Strings that can be interpreted as filenames,
Files, or
Directories.

Let’s examine each in turn.

1. Working with Filenames

To get started, let’s load and create a Path

from pathlib import Path
p = Path('temp\\file.aux')

We have created a Path object representing the name of a file. On Windows p is WindowsPath('temp/file.aux') and on Linux PosixPath('temp/file.aux'), but these two can both be treated like Path objects in most situations. You can use \\ or / as a separator.

At this stage, it does not matter whether the file temp/file.aux exists or not. Regardless, several operations on p are possible:

The filename: p.name returns the string file.aux.
The suffix (extension, type): p.suffix returns the string .aux. Note that the period is included in the suffix.
The parent (folder, directory): p.parent returns another Path object to the directory temp.
All the parts: p.parts returns the tuple ('temp', 'file.aux').

Exercise: run these four commands on q = Path('c:\\temp\\subfolder\\file.aux').

Path allows filenames to be manipulated intuitively.

q = p.with_name('myfile.aux')      # creates Path('temp/myfile.aux')
r = p.with_suffix('.bak')          # creates Path('temp/file.bak')
s = Path('c:\\users\\steve') / p   # creates Path('c:/users/steve/temp/file.aux')

The last method is particularly useful. We can also combine Path objects and strings

t = Path('c:\\users\\steve') / 'folder1/folder2/new_file.xlsx'
t
# WindowsPath('c:/users/steve/folder1/folder2/new_file.xlsx')

Files are often relative to your home directory. Path includes the Path.home() function as a shortcut

t = Path.home() / 'folder1/folder2/new_file.xlsx'
t
# WindowsPath('c:/users/steve/folder1/folder2/new_file.xlsx')

Be careful not to make the mistake of including an extra /

t = Path.home() / '/folder1/folder2/new_file.xlsx'
t
# WindowsPath('C:/folder1/folder2/new_file.xlsx')

Python won’t complain, but you don’t get what you want.

So far Path is just manipulating names. The file does not have to exist.

2. Working with Files

The function p.exists() returns True if there is a file temp\\file.aux, and p.resolve() gives its full name. The handy function touch creates an empty file (Windows right-click, New File). We can use it to illustrate.

p = Path('some_new_file.wxyz')
p.exists()
# False
p.touch(exist_ok=False)
p.exists()
# True
p.resolve()
# WindowsPath('c:/users/steve/.../some_new_file.wxyz')

The function stat accesses file attributes:

p.stat()
# os.stat_result(st_mode=33206, st_ino=21955048184252520, st_dev=2829893387, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1643904374, st_mtime=1643904374, st_ctime=1643904374)

I don’t want to get into operating system details and dealing with dates (see ref to dates). Instead, here is a function to humanize the results¹, notably converting the dates.

import stat
import pandas as pd

def human_stat(p):
    """
    Return human-readable stat information on Path p.
    """
    if p.exists() is False:
        print(f'{p} does not exist.')
        return
    p = p.resolve()
    s = p.stat()
    return pd.DataFrame(dict(zip(
        ('name', 'stem', 'suffix', 'drive', 'parent',
         'create_date', 'modify_date', 'access_date',
         'size', 'mode', 'links'),
        [p.name, p.stem, p.suffix,  p.drive, str(p.parent),
         pd.to_datetime(s.st_ctime, unit='s'),
         pd.to_datetime(s.st_mtime, unit='s'),
         pd.to_datetime(s.st_atime, unit='s'),
         s.st_size, stat.filemode(s.st_mode), s.st_nlink])),
        index=pd.Index([s.st_ino], name='ino'))

The stat library interprets the read, write, and execute permissions encoded in st_mode². Running human_stat on p yields

Result of running `human_stat` on the newly created `p`. The index, 219… is a unique file identifier assigned by the operating system.
	21955048184252520
name	some_new_file.wxyz
stem	some_new_file
suffix	.wxyz
drive	C:
parent	C:\…
create_date	2022-02-03 16:06:14.263677184
modify_date	2022-02-03 16:06:14.263677184
access_date	2022-02-03 16:06:14.263677184
size	0
mode	-rw-rw-rw-
links	1

The function with_name creates a new Path object but does not rename any files. If you want to do that use rename:

p = Path('newfile.txt')
p.exists()
# False
p.touch()
p.exists()
# True
Path('another.txt').exists()
# False
q = p.rename('another.txt')
q.exists()
# True
p.exists()
# False

If you want to copy a file, you need to use the shutil (shell utilities) library. Copying a file involves some tricky choices (so you want to copy attributes?) that I don’t want to discuss. However, if you want to duplicate a file, you can use link_to, which creates another name for an existing file.

r = Path('duplicate_another.txt')
r.exists()
# False
q.link_to(r)
r.exist()
# True

Running human_stat on q and r reveals they have the same identifier (reference the same bits on the hard drive), and each returns links==2. Run r.unlink() to delete a link.

Linking allows you to solve an perennial file organization problem (for consultants): I want presentations sorted by client, but I would also like all my presentations together. Creating a link allows you to do this without consuming extra disk space. And, because links share the same data on disk, a change to either file is automatically reflected in the other. I wish I had known this twenty years ago.

Path supports reading and writing to files in two ways. It can open a file, like the open function, for subsequent read-write:

with q.open('w', encoding='utf-8') as f:
    f.write('text written to file another.txt')

with q.open('r', encoding='utf-8') as f:
    print(f.read())
# 'text written to file another.txt'
print(q.stat().st_size)
# 32

Or, it can read or write directly:

Path('newfile.txt').write_text('text written to a new file')
print(Path('newfile.txt').read_text())
# 'text written to a new file'

There are read_bytes and write_bytes functions for binary data.

3. Working with Directories

Finally, there are some special functions for directories. p.is_dir() returns True if p refers to a directory. As we’ve already mentioned, Path.home() returns your home directory. Path.cwd() returns the current working directory. mkdir creates a path and is particularly useful if you are trying to write to a file but don’t know if its parent directory exists. It even creates intermediate directories.

p = Path.cwd() / 'b/c/d/newfile.txt'
p.open('w')
# FileNotFoundError
p.parent.mkdir(parents=True, exist_ok=True)

The call to mkdir creates ~/b, ~/b/c, and ~/b/c/d if they do not exist, but it doesn’t complain if they do. Now, the call p.open('w') succeeds.

There are two handy functions to access files in a directory: iterdir iterates over all files, and glob finds files matching a pattern. To print the name and size of each file in the home directory:

for f in Path.home().iterdir():
    if f.is_file():
        print(f'{f.stat().st_size:9d}\t{f.name}')

To print the name and last accessed time of all Markdown files:

for f in Path.cwd().glob('*.md'):
    if f.is_file():
        print(f"{pd.to_datetime(f.stat().st_atime, unit='s'):%Y-%m-%d %H:%M:%S}\t{f.name}")

See my post on dates for more about formatting dates.

The glob function can recursively search all subdirectories using p.glob('**/*.md'). You can also use wildcards to filter the results. The call p.glob('*.xls?') matches all Excel suffices, xlsx, xlsm, xlsa and so forth. Patterns and wildcards can be combined: p.glob('Client_*.xls?') matches all Excel filenames beginning Client_.

Conclusion

I find Path does everything I want with files using a simple, consistent interface. If you work with files, it is worth investing the time to understand its capabilities.

Notes and References

human_stat uses pandas to provide nicely formatted output. It returns a row rather than a column so that each variable has the correct data type, which is useful when applied to several files. The table above is the transpose of the returned row. See my post Introspection with Pandas for more.↩︎
-rw-rw-rw- expands into three groups rw- of read, write, and execute permissions for the owner, the owner’s group, and all others. The first character is a d for a directory.↩︎

posted 2022-02-10 | tags: Effective Python, Python, path