Have fun with ZFS: Setting up storage pool

Now we've explored the benefits and limitations of ZFS, let's create a pool and try it out!

This guide will not explore how to use ZFS as root, as it is very distro/OS-dependent. View guides about this topic on your distro/OS's own Wiki or Handbook.

Concepts

Unlike filesystems that operates on a single disk, partition or logical volume (i.e. ext4 and NTFS), ZFS can span across multiple storage volumes. In ZFS, we call the storage pool we can actually store stuff zpool. Inside a zpool, there are a series of VDEV (Virtual DEVices). Data written to the zpool will be separated on all of the VDEVs, similar to RAID0¹.

If you are familiar with RAID, this may sound like a pretty bad idea, since failure of a single VDEV would bring down the whole pool. However, ZFS allows redundancy inside a VDEV to make sure they don't fail easily. There are quite a few types of VDEV²:

single: just a single disk, no parity
mirror: multiple volumes with the same data, VDEV won't fail as far as any single volume is good
- similar to RAID 1
RAIDz1, RAIDz2, RAIDz3: multiple volumes with parity data, allows 1/2/3 bad volumes before VDEV fails
- requires 3/4/5 disks at minimum
- uses parity data, similar to RAID 5/6

Building redundancy on VDEV level makes zpool very versatile. You can expand a storage pool via simply adding a new VDEV with any redundancy setup you wish, rather than reconstructing the whole array of volumes. Doing data parity in software also means that the Write hole issue that plagued other RAID5 systems can be mitigated.

Space Efficiency and Fault Tolerance

For mirror, it is rather simple: you would need at least two volumes. For 2 volumes, you get 50% space efficiency; for 3 volumes, you get 33.3% efficiency. As far as any of the drives inside a mirror VDEV is still alive, the VDEV is good.

For RAIDz, you will need at least \( 2+p \) volumes, where p is the number of parity volumes (1 for RAIDz, 2 for RAIDz2, 3 for RAIDz3). For example, assuming all volumes has the same capacity:

2 storage volumes + 1 parity volume (RAIDz1)
- \( \frac{2}{2+1} \approx 66.7\% \) space efficiency, allow 1 in 3 volumes to fail before data loss
4 storage volumes + 1 parity volume (RAIDz1)
- \( \frac{4}{4+1} = 80\% \) space efficiency, allow 1 in 5 volumes to fail before data loss
4 storage volumes + 2 parity volume (RAIDz2)
- \( \frac{4}{4+2} \approx 66.7\% \) space efficiency, allow 2 in 6 volumes to fail before data loss

Keep in mind that if you are mixing volumes with different capacity in a VDEV, ZFS would check the smallest size and assume all volumes has the same size (the minimum).

Planning

Once a volume is added to a ZFS pool, there's no way to shrink it (unlike ext3/4 or NTFS). Also, if you've assembled a VDEV with redundancy (mirror or RAIDz), there's no safe way to convert to other types of VDEV. So take your time and plan ahead!

Planning is important when designing storage system. An adequately designed system will make sure you get the most out of your hardware. Moreover, it will make life easier when bad things happen. There're three things to consider:

Performance
Space Efficiency
Fault Tolerance

We've already talked about space efficiency and fault tolerance, so we won't reiterate here. Performance only comes into consideration if there will be heavy upon the storage pool (i.e. you will be running databases on it). If you do need good performance, here're some rule of thumbs:

If budget allows, use mirror.
If RAIDz (at any level) is used, it is recommended to use \( 2^n + p \) drives to balance performance and space efficiency.

It is also very beneficial to tune ZFS to suit your workload, but this topic is beyond the scope of this article.

Create the pool

You should have a general plan on what the pool should look like. Now, let's create the pool.

Steps shown here assumes you are using ZFS on Linux. Although the general steps should be the same, FreeBSD may require different flags in the command, so check its manual and documentations before proceeding.

For storage pool creation and maintenance, we use the zpool command. This command accepts subcommands, similar to git. To create a ZFS pool, we use the create subcommand:

1

zpool create -f -o ashift=12 -m <mountpoint> <pool_name> [raidz|raidz2|raidz3|mirror] <volumes>

-f in order to mitigate Does not contain an EFI label error.
- This may be ZFS on Linux specific. Check your manual.
-o ashift=12 use AF (Advanced Format) 4k sector size to improve performance. If your pool is made up of SSDs, use -o ashift=13 since SSD uses 8k sector size.
- See more detail about this topic on OpenZFS docs.
-m <mountpoint> specify where the pool would be mounted.
<pool_name> the name of the new pool.
[raidz|raidz2|raidz3|mirror] <volumes> specify the type of VDEV to create and volumes to use

Note that generally, you should use volume ID rather than non-persistent volume location like sdX since volume IDs are invariant among reboots and hardware changes. You can find volume IDs by checking symlink in /dev/disk/by-id/.

Example: Create a simple pool with only one volume

Creating pool with only one volume is rather straightforward:

1

# zpool create -f -o ashift=12 -m /mnt/data data ata-VOLUME-ID

Here, we create a pool with /dev/disk/by-id/ata-VOLUME-ID and mount it to /mnt/data.

Example: Create a pool with mirror VDEVs

1
2


# zpool create -f -o ashift=12 -m /mnt/data data \
      mirror ata-VOLUME-1 ata-VOLUME-2

This creates a pool with a single mirror VDEV consists of ata-VOLUME-1 and ata-VOLUME-2.

You can also create multiple VDEVs:

1
2
3


# zpool create -f -o ashift=12 -m /mnt/data data \
      mirror ata-VOLUME-1 ata-VOLUME-2 \
      mirror ata-VOLUME-3 ata-VOLUME-4

Example: Create a pool with RAIDz1 VDEVs

1
2


# zpool create -f -o ashift=12 -m /mnt/data data \
      raidz ata-VOLUME-1 ata-VOLUME-2 ata-VOLUME-3 [...even more volumes]

Checking pool status

Now, our pool has been created and we can check its status:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


# zpool status data
  pool: data
 state: ONLINE
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        data                      ONLINE       0     0     0
          mirror-0                ONLINE       0     0     0
            ata-VOLUME-1          ONLINE       0     0     0
            ata-VOLUME-2          ONLINE       0     0     0

errors: No known data errors

Although there's not much going on here right now, this would be a very useful command later. We will check pool scrub/rebuild progress, check volume state and, if we are unlucky, check which files are affected by a data loss.

Adding and Removing volumes

In ZFS, we can modify the pool layout after the pool has been created.

Adding a new VDEV

We can expand a storage pool by adding a new VDEV:

1

# zpool add [pool_name] [raidz|raidz2|raidz3|mirror] <volumes>

Note that if your VDEV has a different replication level (for example, have different number of volumes inside a mirror or RAIDz VDEV), then ZFS would warn you about this. However, there shouldn't be a huge difference unless performance is critical.

See zpool-add.8 for more detail on this operation.

Transforming single volume VDEV to mirror

Note that you can also add more volumes to an existing mirror VDEV.

1

# zpool attach [pool_name] <exisitng_volume/VDEV_name> <new_volumes>

See zpool-attach.8 for more on this operation.

Remove VDEV from pool

Currently, OpenZFS only supports removing top-level, single or mirror VDEVs. Also, the pool cannot have top-level RAIDz. This operation will copy all data to the remaining of the pool and shrink the size of the pool.

1

# zpool remove [pool_name] [devices]

See zpool-remove.8 for more detail on this operation.

Importing and Exporting zpool

If you want to use the pool on other devices, you will need to export the pool:

1

# zpool export <pool_name>

Be caution when importing the pool. By default, ZFS would use the non-persistent volume naming, which will cause issue when disk arrangement changes. Instead, you should use:

1

# zpool import -d /dev/disk/by-id <pool_name>

Although it is intuitive to consider zpool to be a glorified RAID0 (that is, striping the data across VDEVs), ZFS doesn't simply segment the data and record them across the VDEVs. Rather, ZFS has a pretty complex allocator to determine when and where to write those data.

There're actually some other types of VDEVs, but they are either used for caching or used internally by ZFS. So these are all we need to know to store actual data.