Ceph Storage Cluster

Ceph Object Storage

As an alternative to a parallel file system (i.e. BeeGFS, GPFS/Spectrum Scale, Lustre, etc.) which stores data as a collection of files, ceph is a system that stores data as a series of immutable objects. One well-known implementation of object storage is the Amazon S3 API; like ceph, it also provides additional layers of abstraction that provide alternative approaches to using the storage system:

Raw Object Storage (think Amazon S3)
Block Storage (think disk images)
Parallel POSIX Filesystem (such as the parallel filesystems mentioned above)

Note that Ceph is not a RAID controller and does not implement redundancy or reliability by default in storing objects on disk; it implements data reliability through replication of objects across devices (OSDs) and/or hosts. This replication can be controlled independently for different storage pools that provide the different types of storage listed above.

Ceph can be used at many different scales, from a single node with attached storage to large clusters with hundreds of storage devices and dozens of nodes. Its architecture is scalable because it doesn't create bottlenecks in the command, metadata, or data retrieval paths.

From a overall system perspective, ceph can be used as a standalone storage platform or integrated into a kubernetes cluster -- or a hybrid approach where the kubernetes cluster uses the externally managed storage cluster for its dynamic storage provisioning needs.

In the WilliamsNet environment, ceph is used in different configurations:

A standalone cluster is hosted on storage1, which hosts the development filesystem 'workspace'
The development kubernetes cluster uses Rook to provision storage using the ceph cluster on storage1 as its platform
The production kubernetes cluster hosts its own ceph cluster internal to the kubernetes cluster, providing the 'shared' filesystem and internal kubernetes persistent storage needs.

Installing a Ceph Cluster

These instructions apply to the Octopus release of ceph, particularly version 15.2.0+ and the use of the cephadm orchestrator.

The hardware/system requirements for installing and using ceph include:

at least one, and preferably an odd number of servers to host the management daemons for ceph
direct attached disks (or iSCSI mounted volumes) on the ceph cluster hosts
storage devices that are to be used must be completely empty -- no partitions or LVM signatures -- as the entire storage device is used by ceph

The ceph master node is the one that takes the initial install ... it hosts all of the different services to begin with -- but as hosts are added to the cluster, it will deploy copies of relevant daemons to the new hosts to provide redundancy and resiliency.

These directions are extracted/distilled from the main ceph documentation ...

Prerequisites

These must be in place to install a ceph cluster:

Systemd
Podman or Docker for running containers
Time synchronization (such as chrony or NTP)
LVM2 for provisioning storage devices
Python 3 to run the tools/daemons

LVM2 and Python3 are not part of a minimal CentOS OS install ... but they are easily added if something else hasn't already brought them in:

sudo yum -y install lvm2 python3

Cephadm

This tool is the secret sauce that makes ceph easy to install and use. It was introduced in v15.2.0 -- Octopus. I had an occasion to install and try to configure Nautilus release ... and it was very challenging. Cephadm is first downloaded as a python script directly ... then later copied as part of the downloaded tool set. It provides the orchestration and coordination needed to deploy a fully capable cluster using docker containers with just a few commands.

Use curl to fetch the most recent version of the standalone script:

curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm
chmod +x cephadm

This script can be run directly from the current directory with:

./cephadm <arguments...>

Although the standalone script is sufficient to get a cluster started, it is convenient to have the cephadm command installed on the host. To install these packages for the current Octopus release:

./cephadm add-repo --release octopus
./cephadm install cephadm ceph-common

Unfortunately, cephadm does not properly set up the repository for Debian Buster, so you need to do it manually for those systems:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
echo deb https://download.ceph.com/debian-octopus/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get update

Bootstrapping the Cluster

It seems simple ... but these commands will literally create a working ceph cluster ready to start adding hosts and storage devices:

mkdir -p /etc/ceph
cephadm bootstrap --mon-ip *<mon-ip>*

One very important thing that needs to be done before bootstrapping the cluster is to set the hostname to just the base name (not a FQDN). There is an option that will allow you to use FQDNs for hostnames, but it is not very reliable and isn't respected all through the system components.

At the end of the output from the bootstrap command will be a URL and password for the dashboard. SAVE THE PASSWORD -- it won't show up again and you'll have to reset it somehow. As soon as you use the password, it will force you to change it to something you can remember.

The dashboard is extremely useful, but it doesn't completely replace the command line interface ...

Ceph CLI

The ceph-common package installed above contains most of the tools that will be needed to manage the cluster, but if something more complicated occurs and you need the full toolset, you can have cephadm spin up a docker container with all the tools available:

cephadm shell

Most of the time, however, the normal command line tools is sufficient:

ceph <subcommand> <parameters>

Adding Hosts

While running a ceph cluster with a single host is possible, it is not recommended. Adding hosts to the cluster is very straightforward; cephadm (through the 'ceph orch' command set) takes care of installing the needed software on the host to perform the functions it desires to place there:

ssh-copy-id -f -i /etc/ceph/ceph.pub root@*<new-host>*
ceph orch host add *newhost*

Adding Disks

Drives are added individually as OSDs in ceph. As mentioned above, ceph is rather particular about what it will accept for storage devices.

An inventory of storage devices on all cluster hosts can be displayed with:

ceph orch device ls

A storage device is considered available if all of the following conditions are met:

The device must have no partitions.
The device must not have any LVM state.
The device must not be mounted.
The device must not contain a file system.
The device must not contain a Ceph BlueStore OSD.
The device must be larger than 5 GB.

Ceph refuses to provision an OSD on a device that is not available.

If a disk has been used previously, you will need to clean off its partition table so that ceph will accept it. A quick and dirty way to do this is:

dd if=/dev/zero count=10 of=/dev/<drive>

Once you have a clean drive, add it by issuing this command on the ceph master (not the system the drive is mounted on):

ceph orch daemon add osd <host>:<device>

The host must have already been added in the previous section, and the cephadm takes care of installing the OSD software to manage it. Also, if the host has been rebooted, you need to set the hostname to the base name (NOT the FQDN) before adding the drive/OSD.

Single Host Operation

OOTB, ceph required replication to be across hosts, not just devices. For a single node cluster, this can be problematic. The following steps will add a new rule that will allow replication across OSDs instead of hosts:

#
# commands to set ceph to handle replication on one node
#
# create new crush rule allowing OSD-level replication
# ceph osd crush rule create-replicated <rulename> <root> <level>
ceph osd crush rule create-replicated osd_replication default osd

# verify that rule exists and is correct
ceph osd crush rule ls
ceph osd crush rule dump

# set replication level on existing pools
ceph osd pool set device_health_metrics size 3 

# apply new rule to existing pools
ceph osd pool set device_health_metrics crush_rule osd_replication

Cluster Concepts and Operations

CRUSH Rules and Replication

Pools

RBD Images

CephFS

Creating a CephFS

Mounting a Ceph FS

Mounting a ceph filesystem on a system outside the storage cluster requires three things:

The 'mount.ceph' mount helper, available in the 'ceph-common' package
The client keyring created on the ceph master node for client authentication
An entry in the /etc/fstab file

Mount Helper

All you need is the 'mount.ceph' executable, but there is no way to just get that file. So, you have to load the ceph common application bundle, which results in a bunch of dependencies that nobody will ever need:

sudo yum install -y ceph-common

Obviously, you will need to set up the repository as described above depending on your OS ...

Client keyring

While the admin keyring/credentials could be used, for obvious reasons a separate user should be created for mounting the Ceph FS. While it is possible to create a separate user for each client system, there is no need to go to that level of paranoia. The keyring must be created on a system with admin access to the cluster (generally the master node) and then copied to the client system. Go to the master node to create the authentication token, selecting something appropriate for the username:

sudo ssh <master node>
ceph fs authorize <filesystem> client.<username> / rw > /etc/ceph/ceph.client.<username>.keyring
scp /etc/ceph/ceph.client.<username>.keyring <client>:/etc/ceph

This same keyring file can then be copied over to each client system without recreating it.

/etc/fstab

This line will mount the Ceph FS on boot:

<master IP>:/                      /<mountpoint>      ceph    name=<username>,_netdev 0 0

At this point, simply mount the filesystem as normal to start using it immediately:

sudo mount /<mountpoint>