Ceph Storage Cluster

Ceph Object Storage

As an alternative to a parallel file system (i.e. BeeGFS, GPFS/Spectrum Scale, Lustre, etc.) which stores data as a collection of files, ceph is a system that stores data as a series of immutable objects. One well-known implementation of object storage is the Amazon S3 API; like ceph, it also provides additional layers of abstraction that provide alternative approaches to using the storage system:

Raw Object Storage (think Amazon S3)
Block Storage (think disk images)
Parallel POSIX Filesystem (such as the parallel filesystems mentioned above)

Note that Ceph is not a RAID controller and does not implement redundancy or reliability by default in storing objects on disk; it implements data reliability through replication of objects across devices (OSDs) and/or hosts. This replication can be controlled independently for different storage pools that provide the different types of storage listed above.

Ceph can be used at many different scales, from a single node with attached storage to large clusters with hundreds of storage devices and dozens of nodes. Its architecture is scalable because it doesn't create bottlenecks in the command, metadata, or data retrieval paths.

From a overall system perspective, ceph can be used as a standalone storage platform or integrated into a kubernetes cluster -- or a hybrid approach where the kubernetes cluster uses the externally managed storage cluster for its dynamic storage provisioning needs.

In the WilliamsNet environment, ceph is used in different configurations:

A standalone cluster is hosted on storage1, which hosts the development filesystem 'workspace'
The development kubernetes cluster uses Rook to provision storage using the ceph cluster on storage1 as its platform
The production kubernetes cluster hosts its own ceph cluster internal to the kubernetes cluster, providing the 'shared' filesystem and internal kubernetes persistent storage needs.

Installing a Ceph Cluster

These instructions apply to the Octopus release of ceph, particularly version 15.2.0+ and the use of the cephadm orchestrator.

The hardware/system requirements for installing and using ceph include:

at least one, and preferably an odd number of servers to host the management daemons for ceph
direct attached disks (or iSCSI mounted volumes) on the ceph cluster hosts
storage devices that are to be used must be completely empty -- no partitions or LVM signatures -- as the entire storage device is used by ceph

The ceph master node is the one that takes the initial install ... it hosts all of the different services to begin with -- but as hosts are added to the cluster, it will deploy copies of relevant daemons to the new hosts to provide redundancy and resiliency.

These directions are extracted/distilled from the main ceph documentation ...

Prerequisites

These must be in place to install a ceph cluster:

Systemd
Podman or Docker for running containers
Time synchronization (such as chrony or NTP)
LVM2 for provisioning storage devices

LVM2 is the only one I've found to not be universal from a standard OS install ... since ceph uses LVM to manage the disks, it has to be there -- even on nodes that do not have OSDs.

Cephadm

This tool is the secret sauce that makes ceph easy to install and use. It was introduced in v15.2.0 -- Octopus. I had an occasion to install and try to configure Nautilus release ... and it was very challenging. Cephadm is first downloaded as a python script directly ... then later copied as part of the downloaded tool set. It provides the orchestration and coordination needed to deploy a fully capable cluster using docker containers with just a few commands.

Use curl to fetch the most recent version of the standalone script:

curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm
chmod +x cephadm

This script can be run directly from the current directory with:

./cephadm <arguments...>

Although the standalone script is sufficient to get a cluster started, it is convenient to have the cephadm command installed on the host. To install these packages for the current Octopus release:

./cephadm add-repo --release octopus
./cephadm install cephadm ceph-common

Unfortunately, cephadm does not properly set up the repository for Debian Buster, so you need to do it manually for those systems:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
echo deb https://download.ceph.com/debian-octopus/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get update

Bootstrapping the Cluster

It seems simple ... but these commands will literally create a working ceph cluster ready to start adding hosts and storage devices:

mkdir -p /etc/ceph
cephadm bootstrap --mon-ip *<mon-ip>*

One very important thing that needs to be done before bootstrapping the cluster is to set the hostname to just the base name (not a FQDN). There is an option that will allow you to use FQDNs for hostnames, but it is not very reliable and isn't respected all through the system components.

Ceph CLI

The ceph-common package installed above contains most of the tools that will be needed to manage the cluster, but if something more complicated occurs and you need the full toolset, you can have cephadm spin up a docker container with all the tools available:

cephadm shell

Adding Hosts

While running a ceph cluster with a single host is possible, it is not recommended. Adding hosts to the cluster is very straightforward; cephadm (through the 'ceph orch' command set) takes care of installing the needed software on the host to perform the functions it desires to place there:

ssh-copy-id -f -i /etc/ceph/ceph.pub root@*<new-host>*
ceph orch host add *newhost*

Adding Disks

Drives are added individually as OSDs in ceph.

To zap a disk (delete its partition table) in preparation for use with Ceph, execute the following:

dd if=/dev/zero count=10 of=/dev/<drive>

Single Host Operation

OOTB, ceph required replication to be across hosts, not just devices. For a single node cluster, this can be problematic. The following steps will add a new rule that will allow replication across OSDs instead of hosts:

#
# commands to set ceph to handle replication on one node
#
# create new crush rule allowing OSD-level replication
# ceph osd crush rule create-replicated <rulename> <root> <level>
ceph osd crush rule create-replicated osd_replication default osd

# verify that rule exists and is correct
ceph osd crush rule ls
ceph osd crush rule dump

# set replication level on existing pools
ceph osd pool set device_health_metrics size 3 

# apply new rule to existing pools
ceph osd pool set device_health_metrics crush_rule osd_replication

RBD Images

CephFS

Creating a CephFS

Mounting a Ceph FS

Mounting a ceph filesystem on a system outside the storage cluster requires four things:

The master ceph config file (ceph.conf) file from the /etc/ceph directory on any cluster node
The client keyring created on the ceph master node for client authentication
The 'mount.ceph' mount helper, available in the 'ceph-common' package
An entry in the /etc/fstab file

Ceph Config File

This file should simply be copied over to the client system from a node in the storage cluster:

sudo mkdir /etc/ceph
sudo scp <node>:/etc/ceph/ceph.conf /etc/ceph

Permissions should be 644 as this needs to be readable by non-root.

Client keyring

While the admin keyring/credentials could be used, for obvious reasons a separate user should be created for mounting the Ceph FS. While it is possible to create a separate user for each client system, there is no need to go to that level of paranoia. The keyring must be created on a system with admin access to the cluster (generally a cluster node) and then copied to the client system:

sudo ssh <cluster node>
ceph fs authorize <filesystem> client.<username> / rw > /etc/ceph/ceph.client.<username>.keyring
scp /etc/ceph/ceph.client.<username>.keyring <client>:/etc/ceph

This same keyring file can then be copied over to each client system without recreating it.

Mount Helper

All you need is the 'mount.ceph' executable, but there is no way to just get that file. So, you have to load the ceph common application bundle, which results in a bunch of dependencies that nobody will ever need:

sudo yum install -y ceph-common

/etc/fstab

This line will mount the Ceph FS on boot:

:/                      /<mountpoint>      ceph    name=<client id> 0 0

At this point, simply mount the filesystem as normal:

sudo mount /<mountpoint>