Ceph Storage Cluster: Difference between revisions

Latest revision as of 20:43, 8 May 2021

Ceph Object Storage[edit]

As an alternative to a parallel file system (i.e. BeeGFS, GPFS/Spectrum Scale, Lustre, etc.) which stores data as a collection of files, ceph is a system that stores data as a series of objects. One well-known implementation of object storage is the Amazon S3 API; like ceph, it also provides additional layers of abstraction that provide alternative approaches to using the storage system:

Raw Object Storage (think Amazon S3)
Block Storage (think disk images)
Parallel POSIX Filesystem (such as the parallel filesystems mentioned above)

Note that Ceph is not a RAID controller and does not implement redundancy or reliability by default in storing objects on disk; it implements data reliability through replication of objects across devices (OSDs) and/or hosts. This replication can be controlled independently for different storage pools that provide the different types of storage listed above.

Ceph can be used at many different scales, from a single node with attached storage to large clusters with hundreds of storage devices and dozens of nodes. Its architecture is scalable because it doesn't create bottlenecks in the command, metadata, or data retrieval paths.

From a overall system perspective, ceph can be used as a standalone storage platform or integrated into a kubernetes cluster -- or a hybrid approach where the kubernetes cluster uses the externally managed storage cluster for its dynamic storage provisioning needs.

In the WilliamsNet environment, ceph is used in different configurations:

A standalone cluster is hosted on storage1, which hosts the development filesystem 'workspace'
The development kubernetes cluster uses Rook to provision storage using the ceph cluster on storage1 as its platform
The production kubernetes cluster hosts its own ceph cluster internal to the kubernetes cluster, providing the 'shared' filesystem and internal kubernetes persistent storage needs.

Installing a Ceph Cluster[edit]

These instructions apply to the Octopus release of ceph, particularly version 15.2.0+ and the use of the cephadm orchestrator.

The hardware/system requirements for installing and using ceph include:

at least one, and preferably an odd number of servers to host the management daemons for ceph
direct attached disks (or iSCSI mounted volumes) on the ceph cluster hosts
storage devices that are to be used must be completely empty -- no partitions or LVM signatures -- as the entire storage device is used by ceph

The ceph master node is the one that takes the initial install ... it hosts all of the different services to begin with -- but as hosts are added to the cluster, it will deploy copies of relevant daemons to the new hosts to provide redundancy and resiliency.

These directions are extracted/distilled from the main ceph documentation ...

Prerequisites[edit]

These must be in place to install a ceph cluster:

Systemd
Podman or Docker for running containers
Time synchronization (such as chrony or NTP)
LVM2 for provisioning storage devices
Python 3 to run the tools/daemons

LVM2 and Python3 are not part of a minimal CentOS OS install ... but they are easily added if something else hasn't already brought them in:

sudo yum -y install lvm2 python3

These packages must be in place for all nodes that form the ceph cluster, whether just running daemons or hosting storage (or both).

Cephadm[edit]

This tool is the secret sauce that makes ceph easy to install and use. It was introduced in v15.2.0 -- Octopus. I had an occasion to install and try to configure Nautilus release ... and it was very challenging. Cephadm is first downloaded as a python script directly ... then later copied as part of the downloaded tool set. It provides the orchestration and coordination needed to deploy a fully capable cluster using docker containers with just a few commands.

Use curl to fetch the most recent version of the standalone script:

curl --silent --remote-name --location https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm
chmod +x cephadm

This script can be run directly from the current directory with:

./cephadm <arguments...>

Although the standalone script is sufficient to get a cluster started, it is convenient to have the cephadm command installed on the host. To install these packages for the current Octopus release:

./cephadm add-repo --release octopus
./cephadm install cephadm ceph-common

Unfortunately, cephadm does not properly set up the repository for Debian Buster, so you need to do it manually for those systems:

wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
echo deb https://download.ceph.com/debian-octopus/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
sudo apt-get update

Note that for Fedora 33 (as of this update on 11/29/20) the latest version of ceph utilities (15.2.6) from the ceph repository do not work. The fedora distribution actually appears to keep a reasonably current version of ceph packages in its standard repository, so it may be advisable to use the fedora repository for all package installations. The fedora 'updates-testing' repository will have the absolute latest ceph version as it is validated for the platform, so that can be used if needed to get the newest release before it makes it into the main repository. DO NOT simply enable the 'updates-testing' repository as it has unvalidated packages for other software, but use the '--enablerepo' option on yum/dnf as needed:

sudo dnf install -y ceph-common --enablerepo=updates-testing

Bootstrapping the Cluster[edit]

It seems simple ... but these commands will literally create a working ceph cluster ready to start adding hosts and storage devices:

mkdir -p /etc/ceph
cephadm bootstrap --mon-ip *<mon-ip>*

One very important thing that needs to be done before bootstrapping the cluster is to set the hostname to just the base name (not a FQDN). There is an option that will allow you to use FQDNs for hostnames, but it is not very reliable and isn't respected all through the system components.

At the end of the output from the bootstrap command will be a URL and password for the dashboard. SAVE THE PASSWORD -- it won't show up again and you'll have to reset it somehow. As soon as you use the password, it will force you to change it to something you can remember.

The dashboard is extremely useful, but it doesn't completely replace the command line interface ...

Ceph CLI[edit]

The ceph-common package installed above contains most of the tools that will be needed to manage the cluster, but if something more complicated occurs and you need the full toolset, you can have cephadm spin up a docker container with all the tools available:

cephadm shell

Most of the time, however, the normal command line tools is sufficient:

ceph <subcommand> <parameters>

Adding Hosts[edit]

While running a ceph cluster with a single host is possible, it is not recommended. Adding hosts to the cluster is very straightforward; cephadm (through the 'ceph orch' command set) takes care of installing the needed software on the host to perform the functions it desires to place there:

ssh-copy-id -f -i /etc/ceph/ceph.pub root@*<new-host>*
ceph orch host add <new-host> [<new-host-ip>]

The host IP is optional, but needed when the host name doesn't resolve to the interface to be used for ceph communications.

Adding Disks[edit]

Drives are added individually as OSDs in ceph. As mentioned above, ceph is rather particular about what it will accept for storage devices.

An inventory of storage devices on all cluster hosts can be displayed with:

ceph orch device ls

A storage device is considered available if all of the following conditions are met:

The device must have no partitions.
The device must not have any LVM state.
The device must not be mounted.
The device must not contain a file system.
The device must not contain a Ceph BlueStore OSD.
The device must be larger than 5 GB.

Ceph refuses to provision an OSD on a device that is not available.

If a disk has been used previously, you will need to zap it so that ceph will accept it.

ceph orch device zap --force <host> <device>

or

dd if=/dev/zero count=10 of=/dev/<drive>

or

wipefs -a /dev/<drive>

If a drive has been used in a previous incarnation of ceph on that machine (i.e. you tore down a cluster and are trying to build it up again), you will need to do a bit more. This is especially the case when using the Rook operator to build an internal cluster. It does more than just check for the magic numbers on the disks, it checks the OS to see what it thinks of the drive. Since ceph uses LVM to manage its disks, and LVM keeps its own record of its drives. You can wipe a disk, but LVM still thinks it owns it ... and therefore the Rook operator won't touch it. The process to make LVM forget about the drive is involved, but it works. See Removing a disk from LVM for the process.

Once you have a clean drive, add it by issuing this command on the ceph master (not the system the drive is mounted on):

ceph orch daemon add osd <host>:<device>

The host must have already been added in the previous section, and the cephadm takes care of installing the OSD software to manage it. Also, if the host has been rebooted, you need to set the hostname to the base name (NOT the FQDN) before adding the drive/OSD.

Single Host Operation[edit]

OOTB, ceph required replication to be across hosts, not just devices. For a single node cluster (or small clusters with unbalanced disks), this can be problematic. The following steps will add a new rule that will allow replication across OSDs instead of hosts:

#
# commands to set ceph to handle replication on one node
#
# create new crush rule allowing OSD-level replication
# ceph osd crush rule create-replicated <rulename> <root> <level>
ceph osd crush rule create-replicated osd_replication default osd

# verify that rule exists and is correct
ceph osd crush rule ls

# get the id of the new rule (probably will be '1', but check anyway
ceph osd crush rule dump

# set replication level on existing pools
ceph osd pool set device_health_metrics size 3 

# apply new rule to existing pools
ceph osd pool set device_health_metrics crush_rule osd_replication

# make the new rule the default
# ceph config set global osd_pool_default_crush_rule <rule_id>
ceph config set global osd_pool_default_crush_rule 1

Upgrading Ceph[edit]

Patch releases can be upgraded in-place with the ceph orch command -- see https://docs.ceph.com/en/latest/cephadm/upgrade/ for details

Uninstalling Ceph[edit]

The cluster itself can be shut down by simply issuing this command on all hosts in the cluster:

systemctl disable --now ceph.target

Unfortunately, this does not clean up all the state information that ceph maintains -- which can cause issues when re-installing ceph later on. There are several places that must be deleted or cleaned out to totally remove ceph from a system:

/var/lib/ceph
/etc/ceph
/etc/systemd/system
/var/log/ceph
/etc/logrotate.d

In the /etc/systemd/system directory, make sure you get all the links out of the 'wants' directories as well as getting rid of the service files.

If you are removing ceph completely without intending to reinstall, you can remove the 'ceph' directories themselves, otherwise just clean out the contents.

Cluster Concepts and Operations[edit]

CRUSH Rules and Replication[edit]

Pools[edit]

RBD Images[edit]

CephFS[edit]

Creating a CephFS[edit]

Mounting a Ceph FS[edit]

Mounting a ceph filesystem on a system outside the storage cluster requires three things:

The 'mount.ceph' mount helper, available in the 'ceph-common' package
The client keyring created on the ceph master node for client authentication
An entry in the /etc/fstab file

Mount Helper[edit]

All you need is the 'mount.ceph' executable, but there is no way to just get that file. So, you have to load the ceph common application bundle, which results in a bunch of dependencies that nobody will ever need:

sudo yum install -y ceph-common

Obviously, you will need to set up the repository as described above depending on your OS ... and specifically the instructions for using the 'updates-testing' repository on Fedora if needed.

Client keyring[edit]

While the admin keyring/credentials could be used, for obvious reasons a separate user should be created for mounting the Ceph FS. While it is possible to create a separate user for each client system, there is no need to go to that level of paranoia. The keyring must be created on a system with admin access to the cluster (generally the master node) and then copied to the client system. Go to the master node to create the authentication token, selecting something appropriate for the username:

ceph fs authorize <filesystem> client.<username> / rw > /etc/ceph/ceph.client.<username>.keyring

This same keyring file can then be copied over to each client system without recreating it.

ssh <client> mkdir /etc/ceph
scp /etc/ceph/ceph.conf <client>:/etc/ceph
scp /etc/ceph/ceph.client.<username>.keyring <client>:/etc/ceph

/etc/fstab[edit]

This line will mount the Ceph FS on boot:

<master IP>:/                      /<mountpoint>      ceph    name=<username>,_netdev 0 0

At this point, simply mount the filesystem as normal to start using it immediately:

sudo mount /<mountpoint>

Object Gateway[edit]

Ceph implements an S3 compatible object store called the Ceph Object Gateway or RADOS. The documentation on how to set this up is very fragmented and still includes old data from previous version, so here is a cheat sheet. The links to the official documentation include:

Obviously, a working ceph cluster is required; in fact, the cluster must be 'healthy' ... or it won't even start the RADOS gateway.

The ceph object gateway uses a new set of services to implement the API -- the number of service instances required is dependent on the system load and the throughput required. The documentation recommends three with a load balancer in front (external to the cluster), but one is sufficient for low-level access. Multiple gateway nodes can also be used with each node servicing a specific 'customer'.

First thing to do is designate which node(s) will support an object gateway:

ceph orch host label add <host> rgw

The label can be anything, but 'rgw' does make sense ... repeat this for all hosts that you want to run gateways. Now you can actually start the rgw service:

ceph orch apply rgw <servicename> --port=<port> --placement=label:rgw

For the test cluster, this looked like:

ceph orch apply rgw test-rgw --port=8088 --placement=label:rgw

If the rgw service(s) do not start, you may need to force the issue ...

ceph orch daemon add rgw test-rgw port:8088 <host>

Note the slightly different format of the parameters for port and placement.

In order to access the S3 service, create a user:

radosgw-admin user create --uid=<user> --display-name="Full Name" --email=user@example.com --system

This will spit out a big blob of JSON, but the two bits of info you should capture from ta=hat are the 'access_key' and 'secret_key' you'll need them to set up the dashboard and to access the S3 service. If you lose track of them, you can always retrieve them later:

radosgw-admin user info --uid=<user>

The dashboard has a section to display info about the object gateway, but you must provide it the access keys for the user you just created (since it is a different user in ceph than the user running the dashboard:

echo -n "<access-key>" > <file-containing-access-key>
echo -n "<secret-key>" > <file-containing-secret-key>
ceph dashboard set-rgw-api-access-key -i <file-containing-access-key>
ceph dashboard set-rgw-api-secret-key -i <file-containing-secret-key>

(see https://docs.ceph.com/en/latest/mgr/dashboard/#enabling-the-object-gateway-management-frontend for more detail)

Yes, this is a very manual process, but it only needs to be done once for the cluster.

You will also need to load the access and secret key into any other application that is to use the S3 API (such as the minio client).

Troubleshooting[edit]

If you get the error:

Error ENOENT: Module not found

... or some variation of it, especially when trying to run a 'ceph orch' command (or if 'ceph orch' commands just stop responding), you have a config race condition (or cephadm is just stuck). It appears the only way to resolve this is to totally disable 'cephadm' and then re-enable it:

ceph orch set backend ""
ceph mgr module disable cephadm
ceph mgr module enable cephadm
ceph orch set backend cephadm

Not the most elegant, and it doesn't address why cephadm got so messed up, but it works.