ZFS File System (Introduction)

This chapter provides an overview of the ZFS file system and its features and benefits. This chapter also covers some basic terminology used throughout the rest of this book.

The following sections are provided in this chapter:

What Is ZFS?

The ZFS file system was first publicly introduced to Sun OpenSolarisTM operating system in 2005. It was revolutionary at the time and until now contains features that cannot be found in many other file systems. It was designed to be robust, scalable and simple to administer.

There are two implementations of ZFS filesystem currently: Oracle Corp.TM ZFS which is proprietary and as of now only available for Oracle SolarisTM operating system. Second implementation is maintained under the umbrella of OpenZFS Project. It is fully open source ZFS implementation, with the same codebase shared among illumos, *BSD and Linux operating systems.

This guide treats about OpenZFS filesystem. Whenever ZFS is mentioned, OpenZFS is meant.

ZFS Pooled Storage

ZFS uses the concept of storage pools to manage physical storage. Historically, file systems were constructed on top of a single physical device. To address multiple devices and provide for data redundancy, the concept of a volume manager was introduced to provide the image of a single device so that file systems would not have to be modified to take advantage of multiple devices. This design added another layer of complexity and ultimately prevented certain file system advances, because the file system had no control over the physical placement of data on the virtualized volumes.

ZFS eliminates the volume management altogether. Instead of forcing you to create virtualized volumes, ZFS aggregates devices into a storage pool. The storage pool describes the physical characteristics of the storage (device layout, data redundancy, and so on,) and acts as an arbitrary data store from which file systems can be created. File systems are no longer constrained to individual devices, allowing them to share space with all file systems in the pool. You no longer need to predetermine the size of a file system, as file systems grow automatically within the space allocated to the storage pool. When new storage is added, all file systems within the pool can immediately use the additional space without additional work. In many ways, the storage pool acts as a virtual memory system. When a memory DIMM is added to a system, the operating system doesn't force you to invoke some commands to configure the memory and assign it to individual processes. All processes on the system automatically use the additional memory.

Transactional Semantics

ZFS is a transactional file system, which means that the file system state is always consistent on disk. Traditional file systems overwrite data in place, which means that if the machine loses power, for example, between the time a data block is allocated and when it is linked into a directory, the file system will be left in an inconsistent state. Historically, this problem was solved through the use of the fsck command. This command was responsible for going through and verifying file system state, making an attempt to repair any inconsistencies in the process. This problem caused great pain to administrators and was never guaranteed to fix all possible problems. More recently, file systems have introduced the concept of journaling. The journaling process records action in a separate journal, which can then be replayed safely if a system crash occurs. This process introduces unnecessary overhead, because the data needs to be written twice, and often results in a new set of problems, such as when the journal can't be replayed properly.

With a transactional file system, data is managed using copy on write semantics. Data is never overwritten, and any sequence of operations is either entirely committed or entirely ignored. This mechanism means that the file system can never be corrupted through accidental loss of power or a system crash. So, no need for a fsck equivalent exists. While the most recently written pieces of data might be lost, the file system itself will always be consistent. In addition, synchronous data (written using the O_DSYNC flag) is always guaranteed to be written before returning, so it is never lost.

Checksums and Self-Healing Data

With ZFS, all data and metadata is checksummed using a user-selectable algorithm. Traditional file systems that do provide checksumming have performed it on a per-block basis, out of necessity due to the volume management layer and traditional file system design. The traditional design means that certain failure modes, such as writing a complete block to an incorrect location, can result in properly checksummed data that is actually incorrect. ZFS checksums are stored in a way such that these failure modes are detected and can be recovered from gracefully. All checksumming and data recovery is done at the file system layer, and is transparent to applications.

In addition, ZFS provides for self-healing data. ZFS supports storage pools with varying levels of data redundancy, including mirroring and a variation on RAID-5. When a bad data block is detected, ZFS fetches the correct data from another redundant copy, and repairs the bad data, replacing it with the good copy.

Unparalleled Scalability

ZFS has been designed from the ground up to be the most scalable file system, ever. The file system itself is 128-bit, allowing for 256 quadrillion zettabytes of storage. All metadata is allocated dynamically, so no need exists to pre-allocate inodes or otherwise limit the scalability of the file system when it is first created. All the algorithms have been written with scalability in mind. Directories can have up to 248 (256 trillion) entries, and no limit exists on the number of file systems or number of filesthat can be contained within a file system.

ZFS Snapshots

A snapshot is a read-only copy of a file system or volume. Snapshots can be created quickly and easily. Initially, snapshots consume no additional space within the pool.

As data within the active dataset changes, the snapshot consumes space by continuing to reference the old data. As a result, the snapshot prevents the data from being freed back to the pool.

Simplified Administration

Most importantly, ZFS provides a greatly simplified administration model. Through the use of hierarchical file system layout, property inheritance, and automanagement of mount points and NFS share semantics, ZFS makes it easy to create and manage file systems without needing multiple commands or editing configuration files. You can easily set quotas or reservations, turn compression on or off, or manage mount points for numerous file systems with a single command. Devices can be examined or repaired without having to understand a separate set of volume manager commands. You can take an unlimited number of instantaneous snapshots of file systems. You can backup and restore individual file systems.

ZFS manages file systems through a hierarchy that allows for this simplified management of properties such as quotas, reservations, compression, and mount points. In this model, file systems become the central point of control. File systems themselves are very cheap (equivalent to a new directory), so you are encouraged to create a file system for each user, project, workspace, and so on. This design allows you to define fine-grained management points.

ZFS Terminology

This section describes the basic terminology used throughout this book:

checksum

A 256-bit hash of the data in a file system block. The checksum capability can range from the simple and fast fletcher2 (the default) to cryptographically strong hashes such as SHA256.

clone

A file system whose initial contents are identical to the contents of a snapshot.

For information about clones, see Overview of ZFS Clones.

dataset

A generic name for the following ZFS entities: clones, file systems, snapshots, or volumes.

Each dataset is identified by a unique name in the ZFS namespace. Datasets are identified using the following format:

pool/path[@snapshot]

pool

Identifies the name of the storage pool that contains the dataset

path

Is a slash-delimited path name for the dataset object

snapshot

Is an optional component that identifies a snapshot of a dataset

For more information about datasets, see Chapter 5,Managing ZFS File Systems.

file system

A dataset that contains a standard POSIX file system.

For more information about file systems, see Chapter 5, Managing ZFS File Systems.

mirror

A virtual device that stores identical copies of data on two or more disks. If any disk in a mirror fails, any other disk in that mirror can provide the same data.

pool

A logical group of devices describing the layout and physical characteristics of the available storage. Space for datasets is allocated from a pool.

For more information about storage pools, see Chapter 4, Managing ZFS Storage Pools.

RAID-Z

A virtual device that stores data and parity on multiple disks, similar to RAID-5. For more information about RAID-Z, see RAID-Z Storage Pool Configuration.

resilvering

The process of transferring data from one device to another device is known as resilvering. For example, if a mirror component is replaced or taken offline, the data from the up-to-date mirror component is copied to the newly restored mirror component. This process is referred to as mirror resynchronization in traditional volume management products.

For more information about ZFS resilvering, see Viewing Resilvering Status.

snapshot

A read-only image of a file system or volume at a given point in time.

For more information about snapshots, see Overview of ZFS Snapshots.

virtual device

A logical device in a pool, which can be a physical device, a file, or a collection of devices.

For more information about virtual devices, see Identifying Virtual Devices in a Storage Pool.

volume

A dataset used to emulate a physical device. For example, you can create an ZFS volume as a swap device.

For more information about ZFS volumes, see ZFS Volumes.

feature flags

A feature added to ZFS pool/filesystem in a plugin fashion. It is a replacement for ZFS versioning, allowing for cooperation of many ZFS implementators.

ZFS Component Naming Requirements

Each ZFS component must be named according to the following rules:

Compatibility

Pool versions and feature flags

When ZFS was a part of Sun MicrosystemsTM OpenSolaris project, it was developed solely within the company. There was no other ZFS codebase in the open. Thus ZFS versions were pretty good choice of introducing new features. When OpenSolaris code has been closed by OracleTM in 2010, Garrett d'Amore started a truly open and free fork: illumos Project, creating a basis also for future OpenZFS Project. At around the same time first operating system started a port of ZFS: FreeBSD. The question of compatibility arised.

To facilitate ability of various vendors and contributors to implement their enhancements, decision was made to freeze the ZFS version numbering. Pool version has been updated to 5000. Future enhancements are introduced as feature flags: plugin-like additions that provide new functionality. Example of feature flag is implementation of lz4 compression algorithm.

Any OpenZFS port: illumos, FreeBSD, Linux or otherwise are compatible provided they implement the same set of feature flags.

Current versions of OracleTM ZFS and OpenZFS are incompatible. To be able to move pools between OracleTM SolarisTM and OpenZFS supporting systems, you need to use pool version 28 or prior.

More detailed information

More detailed information can be found on OpenZFS Project Page: