WilliamsNet Wiki - User contributions [en]

GitLab V18 Upgrade

2025-06-13T22:52:09Z

DrEdWilliams:

Once again, the GitLab team has introduced a breaking change in the upgrade from v17 to v18. The PostgreSQL database version in v18+ is 16 (instead of 14 in previous versions). The upgrade of the database must be done in multiple steps:
* running their database_upgrade script before (to capture the state of the deployment)
* deleting the database PVCs
* performing the upgrade (without the migrations step)
* running their database_upgrade script again to restore the database and restart the services.

All this is documented on their documentation site:

https://docs.gitlab.com/charts/installation/database_upgrade/#steps-for-upgrading-the-bundled-postgresql

This is all well and good -- it's been required before ... but the postgresql user in the new configuration doesn't have all the needed permissions to do the database restore. Searching for the error message in google gemini results in the following response (which solved the problem).

----

The "permission denied for function pg_ls_waldir" error during a GitLab upgrade (especially with Helm charts) to a newer version of GitLab or PostgreSQL usually indicates that the database user GitLab is using **lacks the necessary privileges** to execute this specific PostgreSQL function.

The function `pg_ls_waldir()` (or `pg_ls_dir()` in older PostgreSQL versions) is a function that allows listing the contents of the PostgreSQL Write-Ahead Log (WAL) directory. This is often used for monitoring, replication, or certain administrative tasks.

Here's why this happens and how to resolve it in a Helm-based GitLab deployment:

==Why it happens:==

* **Security hardening:** Newer PostgreSQL versions or stricter configurations of GitLab's chart might restrict access to sensitive functions like `pg_ls_waldir()` to only superusers or specific roles (like `pg_monitor`).
* **Non-superuser GitLab user:** GitLab typically runs with a non-superuser database user for security reasons. This user might not have the permissions by default.
* **Monitoring tools:** Sometimes, this error isn't from GitLab's core migrations but from a monitoring tool (like `postgres_exporter` for Prometheus) trying to gather WAL-related metrics. If the GitLab upgrade involves an upgrade of the bundled PostgreSQL or a change in how permissions are managed, this can manifest.

==How to resolve it (with Helm):==

The solution involves granting the necessary permissions to the database user that GitLab uses. Since you're using Helm, you'll generally achieve this by either:

1. **Running a `psql` command manually (temporary/quick fix):** This is useful for immediate unblocking, but might not persist across pod restarts if your database setup doesn't automatically apply grants.
2. **Modifying your Helm `values.yaml` to apply grants during initialization (recommended for persistence).**

==Method 1: Manual Grant (via `kubectl exec` and `psql`)==

This method is good for immediate troubleshooting.

1. **Find your PostgreSQL pod:**
If GitLab is managing its own PostgreSQL (the default for the Helm chart), find the PostgreSQL primary pod:

kubectl get pods -l app=postgresql,release=<your-gitlab-release-name>,statefulset=postgresql-ha-postgresql -n <your-gitlab-namespace>

If you're using an external PostgreSQL, you'll need to connect to it directly using a PostgreSQL client.

2. **Exec into the PostgreSQL pod (or connect externally):**
kubectl exec -it <postgresql-primary-pod-name> -n <your-gitlab-namespace> -- bash

(Once inside, you might need to become the `postgres` user: `su - postgres`)

3. **Connect to the GitLab database with `psql`:**
psql -d gitlabhq_production -U postgres # Or your PostgreSQL superuser

(Replace `gitlabhq_production` if your database name is different, and `postgres` with your superuser if it's not the default.)

4. **Grant the permission:**
Identify the GitLab database user (it's often `gitlab` or `gitlab-rails` if you're using the default Helm chart). You can check it in your `values.yaml` or in the `webservice` pod's environment variables.
GRANT EXECUTE ON FUNCTION pg_catalog.pg_ls_waldir() TO <gitlab_database_user>;

For example, if your GitLab user is `gitlab`:
GRANT EXECUTE ON FUNCTION pg_catalog.pg_ls_waldir() TO gitlab;

You might also consider granting the `pg_monitor` role to the GitLab user if that's appropriate for your security model and the version of PostgreSQL:
GRANT pg_monitor TO <gitlab_database_user>;

'''Note:''' The `pg_monitor` role provides access to several monitoring-related functions, including `pg_ls_waldir()`, so this can be a more comprehensive solution.

5. **Exit `psql` and the pod:**
\q
exit

6. **Retry the GitLab upgrade:**
helm upgrade <your-gitlab-release-name> gitlab/gitlab -f your_values.yaml --version <new-gitlab-version>

==Method 2: Persistent Grant (via Helm `values.yaml`)==

This is the preferred method for Helm deployments as it ensures the grant is applied consistently.

You can inject an initialization script into the PostgreSQL container that grants the necessary permissions. This typically involves using the `postgresql.initdb` configuration in your `values.yaml`.

===Example `values.yaml` addition:===

# Add this to your values.yaml file
postgresql:
# This section assumes you are using the bundled PostgreSQL with the GitLab chart.
# If using an external PostgreSQL, you would need to run this command on your external DB.
initdb:
## @param initdb.scripts Allows mounting scripts into /docker-entrypoint-initdb.d/ to be executed at first boot of the Postgres container
## This can be used for custom grants or other database initialization.
scripts:
grant-wal-permissions.sql: |
DO
$do$
BEGIN
IF EXISTS (SELECT 1 FROM pg_catalog.pg_proc p JOIN pg_catalog.pg_namespace n ON p.pronamespace = n.oid WHERE p.proname = 'pg_ls_waldir' AND n.nspname = 'pg_catalog') THEN
GRANT EXECUTE ON FUNCTION pg_catalog.pg_ls_waldir() TO gitlab; -- Replace 'gitlab' with your actual GitLab database user if different
END IF;
-- Also consider granting pg_monitor role if appropriate for your security model
IF EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pg_monitor') THEN
GRANT pg_monitor TO gitlab; -- Replace 'gitlab' with your actual GitLab database user if different
END IF;
END
$do$;

===Explanation of the `initdb.scripts` addition:===

* `initdb.scripts`: This section allows you to define custom SQL scripts that will be run when the PostgreSQL container is first initialized.
* `grant-wal-permissions.sql`: This is a descriptive name for your script.
* `DO $$BEGIN ... END$$;`: This is an anonymous PL/pgSQL block, which allows for conditional logic (like checking if the function exists before trying to grant).
* `IF EXISTS (...) THEN ... END IF;`: This checks if `pg_ls_waldir` function exists (useful for PostgreSQL version compatibility) and if the `pg_monitor` role exists before attempting the `GRANT`.
* `TO gitlab;`: **Crucially, replace `gitlab` with the actual database user GitLab uses.** This is usually configured in your GitLab Helm chart's `global.psql.username` or `global.postgresql.username` setting, or if you're using the bundled PostgreSQL, it's often `gitlab`.

===After modifying `values.yaml`:===

Run the `helm upgrade` command again:

helm upgrade <your-gitlab-release-name> gitlab/gitlab -f your_values.yaml --version <new-gitlab-version>

==Important Notes:==

* **Database Restart:** For the `initdb.scripts` method, the script runs only on *initialization* of the PostgreSQL container. If your PostgreSQL pods are already running, you might need to force a recreation of the PostgreSQL primary pod for the `initdb` script to execute. Be careful doing this in a production environment as it will cause a brief database downtime. A common way to force recreation after an `initdb` change is to taint the old pods or delete them gracefully, allowing Kubernetes to reschedule new ones.
* **GitLab Database User:** Double-check the exact database user name GitLab is using. This is critical for granting permissions to the correct user.
* **PostgreSQL Version:** The `pg_ls_waldir()` function is present in PostgreSQL 10 and newer. If you are on a very old PostgreSQL version, the function might be `pg_ls_dir()`, but GitLab 18 definitely requires a modern PostgreSQL.
* **Backups:** Always ensure you have a complete and tested backup of your GitLab instance and its database before performing any major upgrade or making database permission changes.

By applying the correct `GRANT` statement to the appropriate GitLab database user, you should resolve the "permission denied for function pg_ls_waldir" error and allow your GitLab upgrade to proceed.

GitLab V18 Upgrade

2025-06-13T22:47:56Z

DrEdWilliams: Created page with "Once again, the GitLab team has introduced a breaking change in the upgrade from v17 to v18. The PostgreSQL database version in v18+ is 16 (instead of 14 in previous versions). The upgrade of the database must be done in multiple steps: * running their database_upgrade script before (to capture the state of the deployment) * deleting the database PVCs * performing the upgrade (without the migrations step) * running their database_upgrade script again to restore the da..."

Once again, the GitLab team has introduced a breaking change in the upgrade from v17 to v18. The PostgreSQL database version in v18+ is 16 (instead of 14 in previous versions). The upgrade of the database must be done in multiple steps:
* running their database_upgrade script before (to capture the state of the deployment)
* deleting the database PVCs
* performing the upgrade (without the migrations step)
* running their database_upgrade script again to restore the database and restarting the services.

This is all well and good -- it's been required before ... but the postgresql user in the new configuration doesn't have all the needed permissions to do the database restore. Searching for the error message in google gemini results in the following response (which solved the problem).

----

The "permission denied for function pg_ls_waldir" error during a GitLab upgrade (especially with Helm charts) to a newer version of GitLab or PostgreSQL usually indicates that the database user GitLab is using **lacks the necessary privileges** to execute this specific PostgreSQL function.

The function `pg_ls_waldir()` (or `pg_ls_dir()` in older PostgreSQL versions) is a function that allows listing the contents of the PostgreSQL Write-Ahead Log (WAL) directory. This is often used for monitoring, replication, or certain administrative tasks.

Here's why this happens and how to resolve it in a Helm-based GitLab deployment:

==Why it happens:==

* **Security hardening:** Newer PostgreSQL versions or stricter configurations of GitLab's chart might restrict access to sensitive functions like `pg_ls_waldir()` to only superusers or specific roles (like `pg_monitor`).
* **Non-superuser GitLab user:** GitLab typically runs with a non-superuser database user for security reasons. This user might not have the permissions by default.
* **Monitoring tools:** Sometimes, this error isn't from GitLab's core migrations but from a monitoring tool (like `postgres_exporter` for Prometheus) trying to gather WAL-related metrics. If the GitLab upgrade involves an upgrade of the bundled PostgreSQL or a change in how permissions are managed, this can manifest.

==How to resolve it (with Helm):==

The solution involves granting the necessary permissions to the database user that GitLab uses. Since you're using Helm, you'll generally achieve this by either:

1. **Running a `psql` command manually (temporary/quick fix):** This is useful for immediate unblocking, but might not persist across pod restarts if your database setup doesn't automatically apply grants.
2. **Modifying your Helm `values.yaml` to apply grants during initialization (recommended for persistence).**

==Method 1: Manual Grant (via `kubectl exec` and `psql`)==

This method is good for immediate troubleshooting.

1. **Find your PostgreSQL pod:**
If GitLab is managing its own PostgreSQL (the default for the Helm chart), find the PostgreSQL primary pod:

kubectl get pods -l app=postgresql,release=<your-gitlab-release-name>,statefulset=postgresql-ha-postgresql -n <your-gitlab-namespace>

If you're using an external PostgreSQL, you'll need to connect to it directly using a PostgreSQL client.

2. **Exec into the PostgreSQL pod (or connect externally):**
kubectl exec -it <postgresql-primary-pod-name> -n <your-gitlab-namespace> -- bash

(Once inside, you might need to become the `postgres` user: `su - postgres`)

3. **Connect to the GitLab database with `psql`:**
psql -d gitlabhq_production -U postgres # Or your PostgreSQL superuser

(Replace `gitlabhq_production` if your database name is different, and `postgres` with your superuser if it's not the default.)

4. **Grant the permission:**
Identify the GitLab database user (it's often `gitlab` or `gitlab-rails` if you're using the default Helm chart). You can check it in your `values.yaml` or in the `webservice` pod's environment variables.
GRANT EXECUTE ON FUNCTION pg_catalog.pg_ls_waldir() TO <gitlab_database_user>;

For example, if your GitLab user is `gitlab`:
GRANT EXECUTE ON FUNCTION pg_catalog.pg_ls_waldir() TO gitlab;

You might also consider granting the `pg_monitor` role to the GitLab user if that's appropriate for your security model and the version of PostgreSQL:
GRANT pg_monitor TO <gitlab_database_user>;

'''Note:''' The `pg_monitor` role provides access to several monitoring-related functions, including `pg_ls_waldir()`, so this can be a more comprehensive solution.

5. **Exit `psql` and the pod:**
\q
exit

6. **Retry the GitLab upgrade:**
helm upgrade <your-gitlab-release-name> gitlab/gitlab -f your_values.yaml --version <new-gitlab-version>

==Method 2: Persistent Grant (via Helm `values.yaml`)==

This is the preferred method for Helm deployments as it ensures the grant is applied consistently.

You can inject an initialization script into the PostgreSQL container that grants the necessary permissions. This typically involves using the `postgresql.initdb` configuration in your `values.yaml`.

===Example `values.yaml` addition:===

# Add this to your values.yaml file
postgresql:
# This section assumes you are using the bundled PostgreSQL with the GitLab chart.
# If using an external PostgreSQL, you would need to run this command on your external DB.
initdb:
## @param initdb.scripts Allows mounting scripts into /docker-entrypoint-initdb.d/ to be executed at first boot of the Postgres container
## This can be used for custom grants or other database initialization.
scripts:
grant-wal-permissions.sql: |
DO
$do$
BEGIN
IF EXISTS (SELECT 1 FROM pg_catalog.pg_proc p JOIN pg_catalog.pg_namespace n ON p.pronamespace = n.oid WHERE p.proname = 'pg_ls_waldir' AND n.nspname = 'pg_catalog') THEN
GRANT EXECUTE ON FUNCTION pg_catalog.pg_ls_waldir() TO gitlab; -- Replace 'gitlab' with your actual GitLab database user if different
END IF;
-- Also consider granting pg_monitor role if appropriate for your security model
IF EXISTS (SELECT 1 FROM pg_roles WHERE rolname = 'pg_monitor') THEN
GRANT pg_monitor TO gitlab; -- Replace 'gitlab' with your actual GitLab database user if different
END IF;
END
$do$;

===Explanation of the `initdb.scripts` addition:===

* `initdb.scripts`: This section allows you to define custom SQL scripts that will be run when the PostgreSQL container is first initialized.
* `grant-wal-permissions.sql`: This is a descriptive name for your script.
* `DO $$BEGIN ... END$$;`: This is an anonymous PL/pgSQL block, which allows for conditional logic (like checking if the function exists before trying to grant).
* `IF EXISTS (...) THEN ... END IF;`: This checks if `pg_ls_waldir` function exists (useful for PostgreSQL version compatibility) and if the `pg_monitor` role exists before attempting the `GRANT`.
* `TO gitlab;`: **Crucially, replace `gitlab` with the actual database user GitLab uses.** This is usually configured in your GitLab Helm chart's `global.psql.username` or `global.postgresql.username` setting, or if you're using the bundled PostgreSQL, it's often `gitlab`.

===After modifying `values.yaml`:===

Run the `helm upgrade` command again:

helm upgrade <your-gitlab-release-name> gitlab/gitlab -f your_values.yaml --version <new-gitlab-version>

==Important Notes:==

* **Database Restart:** For the `initdb.scripts` method, the script runs only on *initialization* of the PostgreSQL container. If your PostgreSQL pods are already running, you might need to force a recreation of the PostgreSQL primary pod for the `initdb` script to execute. Be careful doing this in a production environment as it will cause a brief database downtime. A common way to force recreation after an `initdb` change is to taint the old pods or delete them gracefully, allowing Kubernetes to reschedule new ones.
* **GitLab Database User:** Double-check the exact database user name GitLab is using. This is critical for granting permissions to the correct user.
* **PostgreSQL Version:** The `pg_ls_waldir()` function is present in PostgreSQL 10 and newer. If you are on a very old PostgreSQL version, the function might be `pg_ls_dir()`, but GitLab 18 definitely requires a modern PostgreSQL.
* **Backups:** Always ensure you have a complete and tested backup of your GitLab instance and its database before performing any major upgrade or making database permission changes.

By applying the correct `GRANT` statement to the appropriate GitLab database user, you should resolve the "permission denied for function pg_ls_waldir" error and allow your GitLab upgrade to proceed.

Repair GRUB Installation

2025-05-17T18:57:04Z

DrEdWilliams: /* Reinstall GRUB */

Unlike most other problems where you have a (mostly) functioning Linux system to work with, if your GRUB install gets trashed you are almost literally toast.

== How it happens ==
It doesn't really take much to mess with your GRUB configuration. Fortunately, it lives in an infrequently accessed part of your filesystem (/boot) and survives by obscurity. The biggest risk of corruption is actually when applying an update to the GRUB software -- if that gets interrupted and you don't have a chance to re-run the update (i.e. the system freezes) you are in trouble.

== Resources ==
As mentioned above, if you can't boot, you don't have the usual suite of tools to help you fix the problem -- but most major distributions offer "live" CD/DVD ISOs that allow you to 'try before you buy'. This can provide enough of a workable system to allow you to bootstrap your way to a solution. Simply download a recent ISO image of a live system and put it somewhere you can boot from. Note that Ubuntu related distribution ISOs (including kubuntu) operate in this mode by default -- you boot from the ISO and it gives you an option to install from the live system. Other OS distributions have separate 'live' ISOs.

For ubuntu related distributions, there is also a dedicated rescue disk:

https://sourceforge.net/projects/boot-repair-cd/files/boot-repair-disk-64bit.iso/download

It is the same toolset described below, but in a form that is automatically run at boot time.

== Fixing It ==
The process summarized below is from this web page:

https://phoenixnap.com/kb/grub-rescue

That page describes the process for restoring a GRUB configuration from the GRUB command line if you don't have a live ISO image available; it does require a significant level of knowledge about the system being restored ... which may not be available.

Also on that page is the rather simple process using his boot-repair tool (free) to rescue a ubuntu system. There is a lot of context on that page, but the gist is here. Once you get the live image booted, open a terminal and issue these commands:

sudo add-apt-repository ppa:yannubuntu/boot-repair
sudo apt update
sudo apt install boot-repair -y
sudo boot-repair

The application will show a status dialog with a bunch of activities, then give you a big dialog with an option to implement the recommended changes ... if you have anything other than a single-disk simple configuration, click on the 'Advanced Options' link to look at what it wants to do. Make sure the location of your deployment is correct; it assumes the first disk with a reasonably valid configuration is what you want (not always the case). Make any changes and then hit the 'Apply' button. Cross your fingers ... and you will hopefully get the success dialog. Then you reboot back into your normal configuration.

== Reinstall GRUB ==
If the repair tool can't salvage your configuration, you will need to reinstall GRUB. This does require a live ISO image of the same distribution as is on the main system drive.

The trick here is that you have to fool the GRUB installer into thinking it is working on the OS image on the system drive, but with the multitude of faux filesystem in a current Linux system, you need to make sure enough of them are available in the main disk image:

sudo mount /dev/sda1 /mnt
sudo mount --bind /dev /mnt/dev &&
sudo mount --bind /dev/pts /mnt/dev/pts &&
sudo mount --bind /proc /mnt/proc &&
sudo mount --bind /sys /mnt/sys

Now you can do the installation of the new grub configuration:

sudo grub-install -root-directory=/mnt/ /dev/sda

If that succeeds, then you need to clean things up:

sudo umount /mnt/sys &&
sudo umount /mnt/proc &&
sudo umount /mnt/dev/pts &&
sudo umount /mnt/dev &&
sudo umount /mnt

Of course, substitute the correct device name for your main system drive ...

Repair GRUB Installation

2025-05-17T18:53:11Z

DrEdWilliams: Created page with "Unlike most other problems where you have a (mostly) functioning Linux system to work with, if your GRUB install gets trashed you are almost literally toast. == How it happens == It doesn't really take much to mess with your GRUB configuration. Fortunately, it lives in an infrequently accessed part of your filesystem (/boot) and survives by obscurity. The biggest risk of corruption is actually when applying an update to the GRUB software -- if that gets interrupted an..."

Unlike most other problems where you have a (mostly) functioning Linux system to work with, if your GRUB install gets trashed you are almost literally toast.

== How it happens ==
It doesn't really take much to mess with your GRUB configuration. Fortunately, it lives in an infrequently accessed part of your filesystem (/boot) and survives by obscurity. The biggest risk of corruption is actually when applying an update to the GRUB software -- if that gets interrupted and you don't have a chance to re-run the update (i.e. the system freezes) you are in trouble.

== Resources ==
As mentioned above, if you can't boot, you don't have the usual suite of tools to help you fix the problem -- but most major distributions offer "live" CD/DVD ISOs that allow you to 'try before you buy'. This can provide enough of a workable system to allow you to bootstrap your way to a solution. Simply download a recent ISO image of a live system and put it somewhere you can boot from. Note that Ubuntu related distribution ISOs (including kubuntu) operate in this mode by default -- you boot from the ISO and it gives you an option to install from the live system. Other OS distributions have separate 'live' ISOs.

For ubuntu related distributions, there is also a dedicated rescue disk:

https://sourceforge.net/projects/boot-repair-cd/files/boot-repair-disk-64bit.iso/download

It is the same toolset described below, but in a form that is automatically run at boot time.

== Fixing It ==
The process summarized below is from this web page:

https://phoenixnap.com/kb/grub-rescue

That page describes the process for restoring a GRUB configuration from the GRUB command line if you don't have a live ISO image available; it does require a significant level of knowledge about the system being restored ... which may not be available.

Also on that page is the rather simple process using his boot-repair tool (free) to rescue a ubuntu system. There is a lot of context on that page, but the gist is here. Once you get the live image booted, open a terminal and issue these commands:

sudo add-apt-repository ppa:yannubuntu/boot-repair
sudo apt update
sudo apt install boot-repair -y
sudo boot-repair

The application will show a status dialog with a bunch of activities, then give you a big dialog with an option to implement the recommended changes ... if you have anything other than a single-disk simple configuration, click on the 'Advanced Options' link to look at what it wants to do. Make sure the location of your deployment is correct; it assumes the first disk with a reasonably valid configuration is what you want (not always the case). Make any changes and then hit the 'Apply' button. Cross your fingers ... and you will hopefully get the success dialog. Then you reboot back into your normal configuration.

== Reinstall GRUB ==
If the repair tool can't salvage your configuration, you will need to reinstall GRUB. This does require a live ISO image of the same distribution as is on the main system drive.

The trick here is that you have to fool the live system into thinking it is working on the OS image on the system drive using the 'change root' utility, but with the multitude of faux filesystem in a current Linux system, you need to make sure enough of them are available when you chroot to the main disk image:

sudo mount /dev/sda1 /mnt
sudo mount --bind /dev /mnt/dev &&
sudo mount --bind /dev/pts /mnt/dev/pts &&
sudo mount --bind /proc /mnt/proc &&
sudo mount --bind /sys /mnt/sys

Now you can do the installation of the new grub configuration:

sudo grub-install -root-directory=/mnt/ /dev/sda

If that succeeds, then you need to clean things up:

sudo umount /mnt/sys &&
sudo umount /mnt/proc &&
sudo umount /mnt/dev/pts &&
sudo umount /mnt/dev &&
sudo umount /mnt

System Administration Processes

2025-05-17T18:20:39Z

DrEdWilliams: /* Sustainment Practices */

== Non-standard Repositories ==
* [[Centos Repos]]

== Basic OS installation ==
* [[Basic Ubuntu Installation]]
* [[Basic CentOS 7 Installation]]
* [[Basic Debian Installation]]
* [[Basic Fedora 31+ Installation]]
* [[Fedora 33 - Workstation Installation]]
* [[Basic Rocky Linux 8 Installation]]
* [[Common Post-Install]]

== System Configuration ==
* [[HPE ProLiant System Configuration/Support]]
* [[Ethernet Device Configuration]]
* [[CUDA + NVIDIA drivers]]
* [[NVIDIA Docker Runtime|Installing NVIDIA Docker Runtime]]
* [[VLAN Interfaces]]
*[[Mounting Google Drive and Dropbox]]
*[[Mounting iSCSI LUNs]]

== Application Installation ==
* [[Plex Media Server - Debian]]
* [[Docker Installation]]
* [[Ceph Storage Cluster]]
* [[Kubernetes Cluster Installation]]
* [http://www.webmin.com/download.html Webmin Installation]
* [[Mail Server Installation]]

== Sustainment Practices ==
* [[SSH Host Key Management]]
* [[Log Accumulator]]
* [[Benchmark commands]]
* [[Memory Testing]]
* [[Repair GRUB Installation]]

== Deprecated ==
*

K3s - Kubernetes Simplified

2025-03-29T19:32:26Z

DrEdWilliams: /* External LoadBalancer/Reverse Proxy */

Lightweight Kubernetes. Production ready, easy to install, half the memory, all in a binary less than 100 MB.

Great for:
* Edge
* IoT
* CI
* Development
* ARM
* Embedding k8s
* Situations where a PhD in k8s clusterology is infeasible

Extracted from https://github.com/k3s-io/k3s ... YMMV

What makes '''k3s''' special is that they have stripped out all the bloat-ware from upstream kubernetes that is not needed for small and bare-metal deployments. None of the so-called cloud features are present, and most of the CSI storage drivers are also gone.

As such, it is a simpler replacement for the '''kubeadm''' command, and it does a bit more for you:
* installs '''traefix''' as the default ingress controller -- it works ... especially for a test/dev cluster, but for production I'll use Contour/Envoy
* installs '''flannel''' as the CNI networking layer ... I had problems with flannel before, let's see if this works any better
* installs '''metrics-server''' to collect node/pod performance data
* provides a 'local path provisioner' ... not sure what this does ... yet ...
__NOTOC__
== Basic Cluster Installation ==
They weren't kidding ... this is all it takes to create a simple cluster:
curl -sfL https://get.k3s.io | sh -
A kubeconfig file is written to /etc/rancher/k3s/k3s.yaml and the service is automatically started or restarted. The install script will install K3s and additional utilities, such as kubectl, crictl, k3s-killall.sh, and k3s-uninstall.sh, for example:
sudo kubectl get nodes
K3S_TOKEN is created at /var/lib/rancher/k3s/server/node-token on your server. To install on worker nodes we should pass K3S_URL along with K3S_TOKEN or K3S_CLUSTER_SECRET environment variables, for example:
curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=XXX sh -

== HA Cluster Installation ==
One thing that attracted me to this distribution (as it is a distribution, not a fork of kubernetes) is that it provides a very simple way to set up a HA cluster control plane. There are command options in the deployment that provide for the configuration of a common etcd server to facilitate a HA control plane. Even with that ... it can get a bit cumbersome. Enter [https://github.com/alexellis/k3sup k3sup].

=== k3sup ===
k3sup is a wrapper for the k3s installer that enables:
* installation of a complete cluster from a single login session using SSH
* combining multiple k3s parameters into single options to k3sup for simplicity (i.e. '''--cluster''')

=== External LoadBalancer/Reverse Proxy ===
The one remaining item needed for a fully HA control plane is an external load balancer or reverse proxy to provide a single IP address for accessing the control API from inside or outside the cluster. Lacking something like a BigIron F5 load balancer or something equivalent, the poor-man's approach is to use a separate reverse proxy. This can be done with HAproxy or NGINX. HAproxy can support a HA reverse proxy if you need that level of redundancy, or a simple NGINX configuration like this will work:

<pre>stream {
upstream k3s-servers {
server 10.0.0.10:6443;
server 10.0.0.74:6443;
server 10.0.0.64:6443;
}

server {
listen 6443;
proxy_pass k3s-servers;
}
}
</pre>

This needs to be at the top level in the NGINX configuration -- either in the /etc/nginx/nginx.conf itself, or included at the top level. This process is more completely described in the K3S documentation:

https://docs.k3s.io/datastore/cluster-loadbalancer

If setting up a reverse proxy is not an option, you can set up a round robin in the DNS. Problem is, the pfSense firewall doesn't do round robins by default, so a little arm twisting is needed. Put this in the 'custom options' in the DNS Resolver settings page:
server:
rrset-roundrobin: yes
local-zone: "kube-test.williams.localnet" transparent
local-data: "kube-test.williams.localnet A 10.0.0.10"
local-data: "kube-test.williams.localnet A 10.0.0.64"
local-data: "kube-test.williams.localnet A 10.0.0.74"
Change the name and IP addresses as needed, and then load the changes.

Using either approach somewhat complicates the deployment commands below, so pay attention ...

=== Get k3sup ===
Not surprisingly, there are multiple ways to do this. The author wants you to download it from a marketplace app (arkade):
curl -sLS https://dl.get-arkade.dev | sh
# Install the binary using the command given as output, such as:
sudo cp arkade-darwin /usr/local/bin/arkade
# Or on Linux:
sudo cp arkade /usr/local/bin/arkade
arkade get k3sup

... or you can just download it directly ...
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

You just need to put it somewhere in your path (or use a full path reference).

=== Install the master nodes ===
Install the first master node with these commands. This will put it on the node that the command is issued from ... if that is not the desire, remove the '--local' parameter
mkdir $HOME/.kube
export EXTRA_ARGS="--disable servicelb --disable traefik --disable local-storage --write-kubeconfig-mode=644 --flannel-backend host-gw" \
k3sup install --cluster --local --ip <server-ip-1> --k3s-channel stable --k3s-extra-args '$EXTRA_ARGS' \
--local-path $HOME/.kube/config --tls-san <rr-dns-name>

Explanations:
* --cluster sets this up for a HA control plane
* --local says to do it on this system, instead of using SSH to the node -- '''but you still need to provide the IP address''', otherwise it will put 127.0.0.1 as the API server address
* get the last stable version available from k3s (use 'latest' to get bleeding edge updates)
* we want to disable the simplified k3s loadbalancer ... it really is a crowbar solution and doesn't provide IP request capability
* disable the default ingress controller, as we really want something better
* disable the local storage driver (it's a pain if you're installing another storage driver)
* make the k3s.yaml file readable by the pods ... really should be by default
* (temporary workaround) force the flannel backend to use 'host-gw' mode since 'vxlan' mode seems to be broken with Debian 11
* the --local-path tells k3sup where to put the kubeconfig files ... this is the standard place, so we will put it there. The directory has to exist -- hence the 'mkdir' command ...
* by providing the name of the DNS round robin, it should be able to pass muster on TLS connections ...

Subsequent master nodes can be installed next:
k3sup join --server --host <new master server> --k3s-channel stable --server-host <original master server> --user ewilliam --k3s-extra-args "$EXTRA_ARGS"

Explanations:
* join as another server, not an agent/worker
* provide the hostname of the new master node to be joined
* specify the hostname of the first master created
* use the same EXTRA_ARGS provided to the first server

Repeat this for all the masters (minimum 3) providing the same server IP address of the first master. Note that if you need to add/remove more parameters to the k3s service, you need to do it on all server nodes -- just edit the k3s.service file and restart one server at a time.

Now that you have all the masters online, reset the kubeconfig file (~/.kube/config) to use the common name through the reverse proxy or DNS Round Robin instead of the IP of the first master:
sed -i -e s/<server-ip-1>/<common-name>/g $HOME/.kube/config

=== Install Worker Nodes ===
At this point, you should have an operational cluster. k3s enables masters to be used for normal pods by default, so this is the minimum operational cluster. Next, install the agent/worker nodes:
k3sup join --ip <agent-node-x> --k3s-channel stable --server-host <common-name>

Explanations:
* Joining as an agent, so no '--server' this time
* provide each agent/worker IP in turn
* instead of the '--server-ip' parameter, we are using the '--server-host' parameter so we can use the DNS RR to ensure that the nodes always can access a master

== Automated Upgrades: System Upgrade Controller ==

Rancher has taken the simplicity of k3s to another level by providing a kubernetes controller that can update the cluster ``from within the cluster`` using the [https://rancher.com/docs/k3s/latest/en/upgrades/automated/ System Upgrade Controller].

Installing the controller is straightforward using a single manifest (also located in the git repo):
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.7.6/system-upgrade-controller.yaml

The controller looks for upgrade plans as custom resources (of type 'Plan', for some odd reason) to define its behavior. Two 'Plan's are required ... one for server/master nodes and one for agent/worker nodes. These plans can be set up to either upgrade to a specific k3s version or to monitor the release channel and automatically upgrade when a new release is available. The manifest looks like this:

<pre># Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable
---
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade #:v1.17.4-k3s1
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable</pre>

Simply apply this manifest and the controller will identify nodes to upgrade and will take care of them as instructed. It's as simple as that. If you don't want to automatically upgrade, comment out the 'channel' parameter in each Plan, and uncomment the 'version' parameter and set it to the version to upgrade to and apply. The controller will see the new plan and act accordingly. If there is nothing to do, the controller just sits there waiting for a new plan with some work.

== Cluster Access==
The kubeconfig file stored at /etc/rancher/k3s/k3s.yaml is used to configure access to the Kubernetes cluster. If you have installed upstream Kubernetes command line tools such as kubectl or helm you will need to configure them with the correct kubeconfig path. This can be done by either exporting the KUBECONFIG environment variable or by invoking the --kubeconfig command line flag. Refer to the examples below for details.

Leverage the KUBECONFIG environment variable:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods --all-namespaces
helm ls --all-namespaces

Or specify the location of the kubeconfig file in the command:

kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get pods --all-namespaces
helm --kubeconfig /etc/rancher/k3s/k3s.yaml ls --all-namespaces

To access the kubernetes cluster from outside the cluster, copy <pre>/etc/rancher/k3s/k3s.yaml</pre> on your machine located outside the cluster as <pre>~/.kube/config</pre>Then replace the value of the server field with the common name of the K3s server cluster as described above. kubectl can now manage your K3s cluster.

==Cluster Maintenance==
As packages (particularly helm charts) are updated, old container images are left behind. Not deleting container images is not generally a bad thing since it enables pods to migrate between nodes in the cluster without waiting for the container to be downloaded. Upgrading packages, however leaves a lot of cruft behind that can eventually cause issues with disk space. Supposedly Kubernetes manages this garbage collection, but other related packages (such as ceph) complain about low disk space long before Kubernetes decides to do anything about it. Cleaning out the old containers is simple, fortunately:
/usr/local/bin/k3s crictl rmi -q
While this can be run at any time, running it too often will remove containers that are still valid, just not used at that time on that node -- resulting in longer migration times when nodes are drained or fail.

==Finishing The Job==
These instructions will create a simple or HA kubernetes cluster. The HA cluster can survive a single master failure, but two will take down the cluster.

At this point you jump back into the rest of the instructions for installing a [[Kubernetes Cluster Installation|Kubernetes Cluster]], remembering that the network layer and metrics server have already been installed.

K3s - Kubernetes Simplified

2025-03-29T19:30:16Z

DrEdWilliams:

Lightweight Kubernetes. Production ready, easy to install, half the memory, all in a binary less than 100 MB.

Great for:
* Edge
* IoT
* CI
* Development
* ARM
* Embedding k8s
* Situations where a PhD in k8s clusterology is infeasible

Extracted from https://github.com/k3s-io/k3s ... YMMV

What makes '''k3s''' special is that they have stripped out all the bloat-ware from upstream kubernetes that is not needed for small and bare-metal deployments. None of the so-called cloud features are present, and most of the CSI storage drivers are also gone.

As such, it is a simpler replacement for the '''kubeadm''' command, and it does a bit more for you:
* installs '''traefix''' as the default ingress controller -- it works ... especially for a test/dev cluster, but for production I'll use Contour/Envoy
* installs '''flannel''' as the CNI networking layer ... I had problems with flannel before, let's see if this works any better
* installs '''metrics-server''' to collect node/pod performance data
* provides a 'local path provisioner' ... not sure what this does ... yet ...
__NOTOC__
== Basic Cluster Installation ==
They weren't kidding ... this is all it takes to create a simple cluster:
curl -sfL https://get.k3s.io | sh -
A kubeconfig file is written to /etc/rancher/k3s/k3s.yaml and the service is automatically started or restarted. The install script will install K3s and additional utilities, such as kubectl, crictl, k3s-killall.sh, and k3s-uninstall.sh, for example:
sudo kubectl get nodes
K3S_TOKEN is created at /var/lib/rancher/k3s/server/node-token on your server. To install on worker nodes we should pass K3S_URL along with K3S_TOKEN or K3S_CLUSTER_SECRET environment variables, for example:
curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=XXX sh -

== HA Cluster Installation ==
One thing that attracted me to this distribution (as it is a distribution, not a fork of kubernetes) is that it provides a very simple way to set up a HA cluster control plane. There are command options in the deployment that provide for the configuration of a common etcd server to facilitate a HA control plane. Even with that ... it can get a bit cumbersome. Enter [https://github.com/alexellis/k3sup k3sup].

=== k3sup ===
k3sup is a wrapper for the k3s installer that enables:
* installation of a complete cluster from a single login session using SSH
* combining multiple k3s parameters into single options to k3sup for simplicity (i.e. '''--cluster''')

=== External LoadBalancer/Reverse Proxy ===
The one remaining item needed for a fully HA control plane is an external load balancer or reverse proxy to provide a single IP address for accessing the control API from inside or outside the cluster. Lacking something like a BigIron F5 load balancer or something equivalent, the poor-man's approach is to use a separate reverse proxy. This can be done with HAproxy or NGINX. HAproxy can support a HA reverse proxy if you need that level of redundancy, or a simple NGINX configuration like this will work:

<pre>stream {
upstream k3s-servers {
server 10.0.0.10:6443;
server 10.0.0.74:6443;
server 10.0.0.64:6443;
}

server {
listen 6443;
proxy_pass k3s-servers;
}
}
</pre>

This needs to be at the top level in the NGINX configuration -- either in the /etc/nginx/nginx.conf itself, or included at the top level.

If setting up a reverse proxy is not an option, you can set up a round robin in the DNS. Problem is, the pfSense firewall doesn't do round robins by default, so a little arm twisting is needed. Put this in the 'custom options' in the DNS Resolver settings page:
server:
rrset-roundrobin: yes
local-zone: "kube-test.williams.localnet" transparent
local-data: "kube-test.williams.localnet A 10.0.0.10"
local-data: "kube-test.williams.localnet A 10.0.0.64"
local-data: "kube-test.williams.localnet A 10.0.0.74"
Change the name and IP addresses as needed, and then load the changes.

Using either approach somewhat complicates the deployment commands below, so pay attention ...

=== Get k3sup ===
Not surprisingly, there are multiple ways to do this. The author wants you to download it from a marketplace app (arkade):
curl -sLS https://dl.get-arkade.dev | sh
# Install the binary using the command given as output, such as:
sudo cp arkade-darwin /usr/local/bin/arkade
# Or on Linux:
sudo cp arkade /usr/local/bin/arkade
arkade get k3sup

... or you can just download it directly ...
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

You just need to put it somewhere in your path (or use a full path reference).

=== Install the master nodes ===
Install the first master node with these commands. This will put it on the node that the command is issued from ... if that is not the desire, remove the '--local' parameter
mkdir $HOME/.kube
export EXTRA_ARGS="--disable servicelb --disable traefik --disable local-storage --write-kubeconfig-mode=644 --flannel-backend host-gw" \
k3sup install --cluster --local --ip <server-ip-1> --k3s-channel stable --k3s-extra-args '$EXTRA_ARGS' \
--local-path $HOME/.kube/config --tls-san <rr-dns-name>

Explanations:
* --cluster sets this up for a HA control plane
* --local says to do it on this system, instead of using SSH to the node -- '''but you still need to provide the IP address''', otherwise it will put 127.0.0.1 as the API server address
* get the last stable version available from k3s (use 'latest' to get bleeding edge updates)
* we want to disable the simplified k3s loadbalancer ... it really is a crowbar solution and doesn't provide IP request capability
* disable the default ingress controller, as we really want something better
* disable the local storage driver (it's a pain if you're installing another storage driver)
* make the k3s.yaml file readable by the pods ... really should be by default
* (temporary workaround) force the flannel backend to use 'host-gw' mode since 'vxlan' mode seems to be broken with Debian 11
* the --local-path tells k3sup where to put the kubeconfig files ... this is the standard place, so we will put it there. The directory has to exist -- hence the 'mkdir' command ...
* by providing the name of the DNS round robin, it should be able to pass muster on TLS connections ...

Subsequent master nodes can be installed next:
k3sup join --server --host <new master server> --k3s-channel stable --server-host <original master server> --user ewilliam --k3s-extra-args "$EXTRA_ARGS"

Explanations:
* join as another server, not an agent/worker
* provide the hostname of the new master node to be joined
* specify the hostname of the first master created
* use the same EXTRA_ARGS provided to the first server

Repeat this for all the masters (minimum 3) providing the same server IP address of the first master. Note that if you need to add/remove more parameters to the k3s service, you need to do it on all server nodes -- just edit the k3s.service file and restart one server at a time.

Now that you have all the masters online, reset the kubeconfig file (~/.kube/config) to use the common name through the reverse proxy or DNS Round Robin instead of the IP of the first master:
sed -i -e s/<server-ip-1>/<common-name>/g $HOME/.kube/config

=== Install Worker Nodes ===
At this point, you should have an operational cluster. k3s enables masters to be used for normal pods by default, so this is the minimum operational cluster. Next, install the agent/worker nodes:
k3sup join --ip <agent-node-x> --k3s-channel stable --server-host <common-name>

Explanations:
* Joining as an agent, so no '--server' this time
* provide each agent/worker IP in turn
* instead of the '--server-ip' parameter, we are using the '--server-host' parameter so we can use the DNS RR to ensure that the nodes always can access a master

== Automated Upgrades: System Upgrade Controller ==

Rancher has taken the simplicity of k3s to another level by providing a kubernetes controller that can update the cluster ``from within the cluster`` using the [https://rancher.com/docs/k3s/latest/en/upgrades/automated/ System Upgrade Controller].

Installing the controller is straightforward using a single manifest (also located in the git repo):
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.7.6/system-upgrade-controller.yaml

The controller looks for upgrade plans as custom resources (of type 'Plan', for some odd reason) to define its behavior. Two 'Plan's are required ... one for server/master nodes and one for agent/worker nodes. These plans can be set up to either upgrade to a specific k3s version or to monitor the release channel and automatically upgrade when a new release is available. The manifest looks like this:

<pre># Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable
---
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade #:v1.17.4-k3s1
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable</pre>

Simply apply this manifest and the controller will identify nodes to upgrade and will take care of them as instructed. It's as simple as that. If you don't want to automatically upgrade, comment out the 'channel' parameter in each Plan, and uncomment the 'version' parameter and set it to the version to upgrade to and apply. The controller will see the new plan and act accordingly. If there is nothing to do, the controller just sits there waiting for a new plan with some work.

== Cluster Access==
The kubeconfig file stored at /etc/rancher/k3s/k3s.yaml is used to configure access to the Kubernetes cluster. If you have installed upstream Kubernetes command line tools such as kubectl or helm you will need to configure them with the correct kubeconfig path. This can be done by either exporting the KUBECONFIG environment variable or by invoking the --kubeconfig command line flag. Refer to the examples below for details.

Leverage the KUBECONFIG environment variable:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods --all-namespaces
helm ls --all-namespaces

Or specify the location of the kubeconfig file in the command:

kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get pods --all-namespaces
helm --kubeconfig /etc/rancher/k3s/k3s.yaml ls --all-namespaces

To access the kubernetes cluster from outside the cluster, copy <pre>/etc/rancher/k3s/k3s.yaml</pre> on your machine located outside the cluster as <pre>~/.kube/config</pre>Then replace the value of the server field with the common name of the K3s server cluster as described above. kubectl can now manage your K3s cluster.

==Cluster Maintenance==
As packages (particularly helm charts) are updated, old container images are left behind. Not deleting container images is not generally a bad thing since it enables pods to migrate between nodes in the cluster without waiting for the container to be downloaded. Upgrading packages, however leaves a lot of cruft behind that can eventually cause issues with disk space. Supposedly Kubernetes manages this garbage collection, but other related packages (such as ceph) complain about low disk space long before Kubernetes decides to do anything about it. Cleaning out the old containers is simple, fortunately:
/usr/local/bin/k3s crictl rmi -q
While this can be run at any time, running it too often will remove containers that are still valid, just not used at that time on that node -- resulting in longer migration times when nodes are drained or fail.

==Finishing The Job==
These instructions will create a simple or HA kubernetes cluster. The HA cluster can survive a single master failure, but two will take down the cluster.

At this point you jump back into the rest of the instructions for installing a [[Kubernetes Cluster Installation|Kubernetes Cluster]], remembering that the network layer and metrics server have already been installed.

K3s - Kubernetes Simplified

2025-03-29T19:25:53Z

DrEdWilliams: /* External LoadBalancer/Reverse Proxy */

Lightweight Kubernetes. Production ready, easy to install, half the memory, all in a binary less than 100 MB.

Great for:
* Edge
* IoT
* CI
* Development
* ARM
* Embedding k8s
* Situations where a PhD in k8s clusterology is infeasible

Extracted from https://github.com/k3s-io/k3s ... YMMV

What makes '''k3s''' special is that they have stripped out all the bloat-ware from upstream kubernetes that is not needed for small and bare-metal deployments. None of the so-called cloud features are present, and most of the CSI storage drivers are also gone.

As such, it is a simpler replacement for the '''kubeadm''' command, and it does a bit more for you:
* installs '''traefix''' as the default ingress controller -- it works ... especially for a test/dev cluster, but for production I'll use Contour/Envoy
* installs '''flannel''' as the CNI networking layer ... I had problems with flannel before, let's see if this works any better
* installs '''metrics-server''' to collect node/pod performance data
* provides a 'local path provisioner' ... not sure what this does ... yet ...
__NOTOC__
== Basic Cluster Installation ==
They weren't kidding ... this is all it takes to create a simple cluster:
curl -sfL https://get.k3s.io | sh -
A kubeconfig file is written to /etc/rancher/k3s/k3s.yaml and the service is automatically started or restarted. The install script will install K3s and additional utilities, such as kubectl, crictl, k3s-killall.sh, and k3s-uninstall.sh, for example:
sudo kubectl get nodes
K3S_TOKEN is created at /var/lib/rancher/k3s/server/node-token on your server. To install on worker nodes we should pass K3S_URL along with K3S_TOKEN or K3S_CLUSTER_SECRET environment variables, for example:
curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=XXX sh -

== HA Cluster Installation ==
One thing that attracted me to this distribution (as it is a distribution, not a fork of kubernetes) is that it provides a very simple way to set up a HA cluster control plane. There are command options in the deployment that provide for the configuration of a common etcd server to facilitate a HA control plane. Even with that ... it can get a bit cumbersome. Enter [https://github.com/alexellis/k3sup k3sup].

=== k3sup ===
k3sup is a wrapper for the k3s installer that enables:
* installation of a complete cluster from a single login session using SSH
* combining multiple k3s parameters into single options to k3sup for simplicity (i.e. '''--cluster''')

=== External LoadBalancer/Reverse Proxy ===
The one remaining item needed for a fully HA control plane is an external load balancer or reverse proxy to provide a single IP address for accessing the control API from inside or outside the cluster. Lacking something like a BigIron F5 load balancer or something equivalent, the poor-man's approach is to use a separate reverse proxy. This can be done with HAproxy or NGINX. HAproxy can support a HA reverse proxy if you need that level of redundancy, or a simple NGINX configuration like this will work:

<pre>stream {
upstream k3s-servers {
server 10.0.0.10:6443;
server 10.0.0.74:6443;
server 10.0.0.64:6443;
}

server {
listen 6443;
proxy_pass k3s-servers;
}
}
</pre>

This needs to be at the top level in the NGINX configuration -- either in the /etc/nginx/nginx.conf itself, or included at the top level.

If setting up a reverse proxy is not an option, you can set up a round robin in the DNS. Problem is, the pfSense firewall doesn't do round robins by default, so a little arm twisting is needed. Put this in the 'custom options' in the DNS Resolver settings page:
server:
rrset-roundrobin: yes
local-zone: "kube-test.williams.localnet" transparent
local-data: "kube-test.williams.localnet A 10.0.0.10"
local-data: "kube-test.williams.localnet A 10.0.0.64"
local-data: "kube-test.williams.localnet A 10.0.0.74"
Change the name and IP addresses as needed, and then load the changes.

Using either approach somewhat complicates the deployment commands below, so pay attention ...

=== Get k3sup ===
Not surprisingly, there are multiple ways to do this. The author wants you to download it from a marketplace app (arkade):
curl -sLS https://dl.get-arkade.dev | sh
# Install the binary using the command given as output, such as:
sudo cp arkade-darwin /usr/local/bin/arkade
# Or on Linux:
sudo cp arkade /usr/local/bin/arkade
arkade get k3sup

... or you can just download it directly ...
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

You just need to put it somewhere in your path (or use a full path reference).

=== Install the master nodes ===
Install the first master node with these commands. This will put it on the node that the command is issued from ... if that is not the desire, remove the '--local' parameter
mkdir $HOME/.kube
export EXTRA_ARGS="--disable servicelb --disable traefik --disable local-storage --write-kubeconfig-mode=644 --flannel-backend host-gw" \
k3sup install --cluster --local --ip <server-ip-1> --k3s-channel stable --k3s-extra-args '$EXTRA_ARGS' \
--local-path $HOME/.kube/config --tls-san <rr-dns-name>

Explanations:
* --cluster sets this up for a HA control plane
* --local says to do it on this system, instead of using SSH to the node -- '''but you still need to provide the IP address''', otherwise it will put 127.0.0.1 as the API server address
* get the last stable version available from k3s (use 'latest' to get bleeding edge updates)
* we want to disable the simplified k3s loadbalancer ... it really is a crowbar solution and doesn't provide IP request capability
* disable the default ingress controller, as we really want something better
* disable the local storage driver (it's a pain if you're installing another storage driver)
* make the k3s.yaml file readable by the pods ... really should be by default
* (temporary workaround) force the flannel backend to use 'host-gw' mode since 'vxlan' mode seems to be broken with Debian 11
* the --local-path tells k3sup where to put the kubeconfig files ... this is the standard place, so we will put it there. The directory has to exist -- hence the 'mkdir' command ...
* by providing the name of the DNS round robin, it should be able to pass muster on TLS connections ...

Subsequent master nodes can be installed next:
k3sup join --server --host <new master server> --k3s-channel stable --server-host <original master server> --user ewilliam --k3s-extra-args "$EXTRA_ARGS"

Explanations:
* join as another server, not an agent/worker
* provide the hostname of the new master node to be joined
* specify the hostname of the first master created
* use the same EXTRA_ARGS provided to the first server

Repeat this for all the masters (minimum 3) providing the same server IP address of the first master. Note that if you need to add/remove more parameters to the k3s service, you need to do it on all server nodes -- just edit the k3s.service file and restart one server at a time.

Now that you have all the masters online, reset the kubeconfig file (~/.kube/config) to use the DNS RR name instead of the IP of the first master:
sed -i -e s/<server-ip-1>/<rr-dns-name>/g $HOME/.kube/config

=== Install Worker Nodes ===
At this point, you should have an operational cluster. k3s enables masters to be used for normal pods by default, so this is the minimum operational cluster. Next, install the agent/worker nodes:
k3sup join --ip <agent-node-x> --k3s-channel stable --server-host <rr-dns-name>

Explanations:
* Joining as an agent, so no '--server' this time
* provide each agent/worker IP in turn
* instead of the '--server-ip' parameter, we are using the '--server-host' parameter so we can use the DNS RR to ensure that the nodes always can access a master

== Automated Upgrades: System Upgrade Controller ==

Rancher has taken the simplicity of k3s to another level by providing a kubernetes controller that can update the cluster ``from within the cluster`` using the [https://rancher.com/docs/k3s/latest/en/upgrades/automated/ System Upgrade Controller].

Installing the controller is straightforward using a single manifest (also located in the git repo):
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.7.6/system-upgrade-controller.yaml

The controller looks for upgrade plans as custom resources (of type 'Plan', for some odd reason) to define its behavior. Two 'Plan's are required ... one for server/master nodes and one for agent/worker nodes. These plans can be set up to either upgrade to a specific k3s version or to monitor the release channel and automatically upgrade when a new release is available. The manifest looks like this:

<pre># Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable
---
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade #:v1.17.4-k3s1
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable</pre>

Simply apply this manifest and the controller will identify nodes to upgrade and will take care of them as instructed. It's as simple as that. If you don't want to automatically upgrade, comment out the 'channel' parameter in each Plan, and uncomment the 'version' parameter and set it to the version to upgrade to and apply. The controller will see the new plan and act accordingly. If there is nothing to do, the controller just sits there waiting for a new plan with some work.

== Cluster Access==
The kubeconfig file stored at /etc/rancher/k3s/k3s.yaml is used to configure access to the Kubernetes cluster. If you have installed upstream Kubernetes command line tools such as kubectl or helm you will need to configure them with the correct kubeconfig path. This can be done by either exporting the KUBECONFIG environment variable or by invoking the --kubeconfig command line flag. Refer to the examples below for details.

Leverage the KUBECONFIG environment variable:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods --all-namespaces
helm ls --all-namespaces

Or specify the location of the kubeconfig file in the command:

kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get pods --all-namespaces
helm --kubeconfig /etc/rancher/k3s/k3s.yaml ls --all-namespaces

To access the kubernetes cluster from outside the cluster, copy <pre>/etc/rancher/k3s/k3s.yaml</pre> on your machine located outside the cluster as <pre>~/.kube/config</pre>Then replace the value of the server field with the modified name of the K3s server cluster as described above. kubectl can now manage your K3s cluster.

==Cluster Maintenance==
As packages (particularly helm charts) are updated, old container images are left behind. Not deleting container images is not generally a bad thing since it enables pods to migrate between nodes in the cluster without waiting for the container to be downloaded. Upgrading packages, however leaves a lot of cruft behind that can eventually cause issues with disk space. Supposedly Kubernetes manages this garbage collection, but other related packages (such as ceph) complain about low disk space long before Kubernetes decides to do anything about it. Cleaning out the old containers is simple, fortunately:
/usr/local/bin/k3s crictl rmi -q
While this can be run at any time, running it too often will remove containers that are still valid, just not used at that time on that node -- resulting in longer migration times when nodes are drained or fail.

==Finishing The Job==
These instructions will create a simple or HA kubernetes cluster. The HA cluster can survive a single master failure, but two will take down the cluster.

At this point you jump back into the rest of the instructions for installing a [[Kubernetes Cluster Installation|Kubernetes Cluster]], remembering that the network layer and metrics server have already been installed.

K3s - Kubernetes Simplified

2025-03-29T19:24:34Z

DrEdWilliams: update to include nginx reverse proxy

Lightweight Kubernetes. Production ready, easy to install, half the memory, all in a binary less than 100 MB.

Great for:
* Edge
* IoT
* CI
* Development
* ARM
* Embedding k8s
* Situations where a PhD in k8s clusterology is infeasible

Extracted from https://github.com/k3s-io/k3s ... YMMV

What makes '''k3s''' special is that they have stripped out all the bloat-ware from upstream kubernetes that is not needed for small and bare-metal deployments. None of the so-called cloud features are present, and most of the CSI storage drivers are also gone.

As such, it is a simpler replacement for the '''kubeadm''' command, and it does a bit more for you:
* installs '''traefix''' as the default ingress controller -- it works ... especially for a test/dev cluster, but for production I'll use Contour/Envoy
* installs '''flannel''' as the CNI networking layer ... I had problems with flannel before, let's see if this works any better
* installs '''metrics-server''' to collect node/pod performance data
* provides a 'local path provisioner' ... not sure what this does ... yet ...
__NOTOC__
== Basic Cluster Installation ==
They weren't kidding ... this is all it takes to create a simple cluster:
curl -sfL https://get.k3s.io | sh -
A kubeconfig file is written to /etc/rancher/k3s/k3s.yaml and the service is automatically started or restarted. The install script will install K3s and additional utilities, such as kubectl, crictl, k3s-killall.sh, and k3s-uninstall.sh, for example:
sudo kubectl get nodes
K3S_TOKEN is created at /var/lib/rancher/k3s/server/node-token on your server. To install on worker nodes we should pass K3S_URL along with K3S_TOKEN or K3S_CLUSTER_SECRET environment variables, for example:
curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=XXX sh -

== HA Cluster Installation ==
One thing that attracted me to this distribution (as it is a distribution, not a fork of kubernetes) is that it provides a very simple way to set up a HA cluster control plane. There are command options in the deployment that provide for the configuration of a common etcd server to facilitate a HA control plane. Even with that ... it can get a bit cumbersome. Enter [https://github.com/alexellis/k3sup k3sup].

=== k3sup ===
k3sup is a wrapper for the k3s installer that enables:
* installation of a complete cluster from a single login session using SSH
* combining multiple k3s parameters into single options to k3sup for simplicity (i.e. '''--cluster''')

=== External LoadBalancer/Reverse Proxy ===
The one remaining item needed for a fully HA control plane is an external load balancer or reverse proxy to provide a single IP address for accessing the control API from inside or outside the cluster. Lacking something like a BigIron F5 load balancer or something equivalent, the poor-man's approach is to use a separate reverse proxy. This can be done with HAproxy or NGINX. HAproxy can support a HA reverse proxy if you need that level of redundancy, or a simple NGINX configuration like this will work:

<pre>stream {
upstream k3s-servers {
server 10.0.0.10:6443;
server 10.0.0.74:6443;
server 10.0.0.64:6443;
}

server {
listen 6443;
proxy_pass k3s-servers;
}
}
</pre>

This needs to be at the top level in the NGINX configuration -- either in the /etc/nginx/nginx.conf itself, or included at the top level.

If setting up a reverse proxy is not an option, you can set up a round robin in the DNS. Problem is, the pfSense firewall doesn't do round robins by default, so a little arm twisting is needed. Put this in the 'custom options' in the DNS Resolver settings page:
server:
rrset-roundrobin: yes
local-zone: "kube-test.williams.localnet" transparent
local-data: "kube-test.williams.localnet A 10.0.0.10"
local-data: "kube-test.williams.localnet A 10.0.0.64"
local-data: "kube-test.williams.localnet A 10.0.0.74"
Change the name and IP addresses as needed, and then load the changes. This approach somewhat complicates the deployment commands below, so pay attention ...

=== Get k3sup ===
Not surprisingly, there are multiple ways to do this. The author wants you to download it from a marketplace app (arkade):
curl -sLS https://dl.get-arkade.dev | sh
# Install the binary using the command given as output, such as:
sudo cp arkade-darwin /usr/local/bin/arkade
# Or on Linux:
sudo cp arkade /usr/local/bin/arkade
arkade get k3sup

... or you can just download it directly ...
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

You just need to put it somewhere in your path (or use a full path reference).

=== Install the master nodes ===
Install the first master node with these commands. This will put it on the node that the command is issued from ... if that is not the desire, remove the '--local' parameter
mkdir $HOME/.kube
export EXTRA_ARGS="--disable servicelb --disable traefik --disable local-storage --write-kubeconfig-mode=644 --flannel-backend host-gw" \
k3sup install --cluster --local --ip <server-ip-1> --k3s-channel stable --k3s-extra-args '$EXTRA_ARGS' \
--local-path $HOME/.kube/config --tls-san <rr-dns-name>

Explanations:
* --cluster sets this up for a HA control plane
* --local says to do it on this system, instead of using SSH to the node -- '''but you still need to provide the IP address''', otherwise it will put 127.0.0.1 as the API server address
* get the last stable version available from k3s (use 'latest' to get bleeding edge updates)
* we want to disable the simplified k3s loadbalancer ... it really is a crowbar solution and doesn't provide IP request capability
* disable the default ingress controller, as we really want something better
* disable the local storage driver (it's a pain if you're installing another storage driver)
* make the k3s.yaml file readable by the pods ... really should be by default
* (temporary workaround) force the flannel backend to use 'host-gw' mode since 'vxlan' mode seems to be broken with Debian 11
* the --local-path tells k3sup where to put the kubeconfig files ... this is the standard place, so we will put it there. The directory has to exist -- hence the 'mkdir' command ...
* by providing the name of the DNS round robin, it should be able to pass muster on TLS connections ...

Subsequent master nodes can be installed next:
k3sup join --server --host <new master server> --k3s-channel stable --server-host <original master server> --user ewilliam --k3s-extra-args "$EXTRA_ARGS"

Explanations:
* join as another server, not an agent/worker
* provide the hostname of the new master node to be joined
* specify the hostname of the first master created
* use the same EXTRA_ARGS provided to the first server

Repeat this for all the masters (minimum 3) providing the same server IP address of the first master. Note that if you need to add/remove more parameters to the k3s service, you need to do it on all server nodes -- just edit the k3s.service file and restart one server at a time.

Now that you have all the masters online, reset the kubeconfig file (~/.kube/config) to use the DNS RR name instead of the IP of the first master:
sed -i -e s/<server-ip-1>/<rr-dns-name>/g $HOME/.kube/config

=== Install Worker Nodes ===
At this point, you should have an operational cluster. k3s enables masters to be used for normal pods by default, so this is the minimum operational cluster. Next, install the agent/worker nodes:
k3sup join --ip <agent-node-x> --k3s-channel stable --server-host <rr-dns-name>

Explanations:
* Joining as an agent, so no '--server' this time
* provide each agent/worker IP in turn
* instead of the '--server-ip' parameter, we are using the '--server-host' parameter so we can use the DNS RR to ensure that the nodes always can access a master

== Automated Upgrades: System Upgrade Controller ==

Rancher has taken the simplicity of k3s to another level by providing a kubernetes controller that can update the cluster ``from within the cluster`` using the [https://rancher.com/docs/k3s/latest/en/upgrades/automated/ System Upgrade Controller].

Installing the controller is straightforward using a single manifest (also located in the git repo):
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.7.6/system-upgrade-controller.yaml

The controller looks for upgrade plans as custom resources (of type 'Plan', for some odd reason) to define its behavior. Two 'Plan's are required ... one for server/master nodes and one for agent/worker nodes. These plans can be set up to either upgrade to a specific k3s version or to monitor the release channel and automatically upgrade when a new release is available. The manifest looks like this:

<pre># Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable
---
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade #:v1.17.4-k3s1
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable</pre>

Simply apply this manifest and the controller will identify nodes to upgrade and will take care of them as instructed. It's as simple as that. If you don't want to automatically upgrade, comment out the 'channel' parameter in each Plan, and uncomment the 'version' parameter and set it to the version to upgrade to and apply. The controller will see the new plan and act accordingly. If there is nothing to do, the controller just sits there waiting for a new plan with some work.

== Cluster Access==
The kubeconfig file stored at /etc/rancher/k3s/k3s.yaml is used to configure access to the Kubernetes cluster. If you have installed upstream Kubernetes command line tools such as kubectl or helm you will need to configure them with the correct kubeconfig path. This can be done by either exporting the KUBECONFIG environment variable or by invoking the --kubeconfig command line flag. Refer to the examples below for details.

Leverage the KUBECONFIG environment variable:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods --all-namespaces
helm ls --all-namespaces

Or specify the location of the kubeconfig file in the command:

kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get pods --all-namespaces
helm --kubeconfig /etc/rancher/k3s/k3s.yaml ls --all-namespaces

To access the kubernetes cluster from outside the cluster, copy <pre>/etc/rancher/k3s/k3s.yaml</pre> on your machine located outside the cluster as <pre>~/.kube/config</pre>Then replace the value of the server field with the modified name of the K3s server cluster as described above. kubectl can now manage your K3s cluster.

==Cluster Maintenance==
As packages (particularly helm charts) are updated, old container images are left behind. Not deleting container images is not generally a bad thing since it enables pods to migrate between nodes in the cluster without waiting for the container to be downloaded. Upgrading packages, however leaves a lot of cruft behind that can eventually cause issues with disk space. Supposedly Kubernetes manages this garbage collection, but other related packages (such as ceph) complain about low disk space long before Kubernetes decides to do anything about it. Cleaning out the old containers is simple, fortunately:
/usr/local/bin/k3s crictl rmi -q
While this can be run at any time, running it too often will remove containers that are still valid, just not used at that time on that node -- resulting in longer migration times when nodes are drained or fail.

==Finishing The Job==
These instructions will create a simple or HA kubernetes cluster. The HA cluster can survive a single master failure, but two will take down the cluster.

At this point you jump back into the rest of the instructions for installing a [[Kubernetes Cluster Installation|Kubernetes Cluster]], remembering that the network layer and metrics server have already been installed.

K3s - Kubernetes Simplified

2025-03-29T12:44:35Z

DrEdWilliams: /* External LoadBalancer (or DNS Round Robin) */

Lightweight Kubernetes. Production ready, easy to install, half the memory, all in a binary less than 100 MB.

Great for:
* Edge
* IoT
* CI
* Development
* ARM
* Embedding k8s
* Situations where a PhD in k8s clusterology is infeasible

Extracted from https://github.com/k3s-io/k3s ... YMMV

What makes '''k3s''' special is that they have stripped out all the bloat-ware from upstream kubernetes that is not needed for small and bare-metal deployments. None of the so-called cloud features are present, and most of the CSI storage drivers are also gone.

As such, it is a simpler replacement for the '''kubeadm''' command, and it does a bit more for you:
* installs '''traefix''' as the default ingress controller -- it works ... especially for a test/dev cluster, but for production I'll use Contour/Envoy
* installs '''flannel''' as the CNI networking layer ... I had problems with flannel before, let's see if this works any better
* installs '''metrics-server''' to collect node/pod performance data
* provides a 'local path provisioner' ... not sure what this does ... yet ...
__NOTOC__
== Basic Cluster Installation ==
They weren't kidding ... this is all it takes to create a simple cluster:
curl -sfL https://get.k3s.io | sh -
A kubeconfig file is written to /etc/rancher/k3s/k3s.yaml and the service is automatically started or restarted. The install script will install K3s and additional utilities, such as kubectl, crictl, k3s-killall.sh, and k3s-uninstall.sh, for example:
sudo kubectl get nodes
K3S_TOKEN is created at /var/lib/rancher/k3s/server/node-token on your server. To install on worker nodes we should pass K3S_URL along with K3S_TOKEN or K3S_CLUSTER_SECRET environment variables, for example:
curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=XXX sh -

== HA Cluster Installation ==
One thing that attracted me to this distribution (as it is a distribution, not a fork of kubernetes) is that it provides a very simple way to set up a HA cluster control plane. There are command options in the deployment that provide for the configuration of a common etcd server to facilitate a HA control plane. Even with that ... it can get a bit cumbersome. Enter [https://github.com/alexellis/k3sup k3sup].

=== k3sup ===
k3sup is a wrapper for the k3s installer that enables:
* installation of a complete cluster from a single login session using SSH
* combining multiple k3s parameters into single options to k3sup for simplicity (i.e. '''--cluster''')

=== External LoadBalancer (or DNS Round Robin) ===
The one remaining item needed for a fully HA control plane is an external load balancer to provide a single IP address for accessing the control API from inside or outside the cluster. Lacking something like a BigIron F5 load balancer or something equivalent, the poor-man's approach is to use a round robin in the DNS. Problem is, the pfSense firewall doesn't do round robins by default, so a little arm twisting is needed. Put this in the 'custom options' in the DNS Resolver settings page:
server:
rrset-roundrobin: yes
local-zone: "kube-test.williams.localnet" transparent
local-data: "kube-test.williams.localnet A 10.0.0.10"
local-data: "kube-test.williams.localnet A 10.0.0.64"
local-data: "kube-test.williams.localnet A 10.0.0.74"
Change the name and IP addresses as needed, and then load the changes. This approach somewhat complicates the deployment commands below, so pay attention ...

=== Get k3sup ===
Not surprisingly, there are multiple ways to do this. The author wants you to download it from a marketplace app (arkade):
curl -sLS https://dl.get-arkade.dev | sh
# Install the binary using the command given as output, such as:
sudo cp arkade-darwin /usr/local/bin/arkade
# Or on Linux:
sudo cp arkade /usr/local/bin/arkade
arkade get k3sup

... or you can just download it directly ...
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

You just need to put it somewhere in your path (or use a full path reference).

=== Install the master nodes ===
Install the first master node with these commands. This will put it on the node that the command is issued from ... if that is not the desire, remove the '--local' parameter
mkdir $HOME/.kube
export EXTRA_ARGS="--disable servicelb --disable traefik --disable local-storage --write-kubeconfig-mode=644 --flannel-backend host-gw" \
k3sup install --cluster --local --ip <server-ip-1> --k3s-channel stable --k3s-extra-args '$EXTRA_ARGS' \
--local-path $HOME/.kube/config --tls-san <rr-dns-name>

Explanations:
* --cluster sets this up for a HA control plane
* --local says to do it on this system, instead of using SSH to the node -- '''but you still need to provide the IP address''', otherwise it will put 127.0.0.1 as the API server address
* get the last stable version available from k3s (use 'latest' to get bleeding edge updates)
* we want to disable the simplified k3s loadbalancer ... it really is a crowbar solution and doesn't provide IP request capability
* disable the default ingress controller, as we really want something better
* disable the local storage driver (it's a pain if you're installing another storage driver)
* make the k3s.yaml file readable by the pods ... really should be by default
* (temporary workaround) force the flannel backend to use 'host-gw' mode since 'vxlan' mode seems to be broken with Debian 11
* the --local-path tells k3sup where to put the kubeconfig files ... this is the standard place, so we will put it there. The directory has to exist -- hence the 'mkdir' command ...
* by providing the name of the DNS round robin, it should be able to pass muster on TLS connections ...

Subsequent master nodes can be installed next:
k3sup join --server --host <new master server> --k3s-channel stable --server-host <original master server> --user ewilliam --k3s-extra-args "$EXTRA_ARGS"

Explanations:
* join as another server, not an agent/worker
* provide the hostname of the new master node to be joined
* specify the hostname of the first master created
* use the same EXTRA_ARGS provided to the first server

Repeat this for all the masters (minimum 3) providing the same server IP address of the first master. Note that if you need to add/remove more parameters to the k3s service, you need to do it on all server nodes -- just edit the k3s.service file and restart one server at a time.

Now that you have all the masters online, reset the kubeconfig file (~/.kube/config) to use the DNS RR name instead of the IP of the first master:
sed -i -e s/<server-ip-1>/<rr-dns-name>/g $HOME/.kube/config

=== Install Worker Nodes ===
At this point, you should have an operational cluster. k3s enables masters to be used for normal pods by default, so this is the minimum operational cluster. Next, install the agent/worker nodes:
k3sup join --ip <agent-node-x> --k3s-channel stable --server-host <rr-dns-name>

Explanations:
* Joining as an agent, so no '--server' this time
* provide each agent/worker IP in turn
* instead of the '--server-ip' parameter, we are using the '--server-host' parameter so we can use the DNS RR to ensure that the nodes always can access a master

== Automated Upgrades: System Upgrade Controller ==

Rancher has taken the simplicity of k3s to another level by providing a kubernetes controller that can update the cluster ``from within the cluster`` using the [https://rancher.com/docs/k3s/latest/en/upgrades/automated/ System Upgrade Controller].

Installing the controller is straightforward using a single manifest (also located in the git repo):
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.7.6/system-upgrade-controller.yaml

The controller looks for upgrade plans as custom resources (of type 'Plan', for some odd reason) to define its behavior. Two 'Plan's are required ... one for server/master nodes and one for agent/worker nodes. These plans can be set up to either upgrade to a specific k3s version or to monitor the release channel and automatically upgrade when a new release is available. The manifest looks like this:

<pre># Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable
---
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade #:v1.17.4-k3s1
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable</pre>

Simply apply this manifest and the controller will identify nodes to upgrade and will take care of them as instructed. It's as simple as that. If you don't want to automatically upgrade, comment out the 'channel' parameter in each Plan, and uncomment the 'version' parameter and set it to the version to upgrade to and apply. The controller will see the new plan and act accordingly. If there is nothing to do, the controller just sits there waiting for a new plan with some work.

== Cluster Access==
The kubeconfig file stored at /etc/rancher/k3s/k3s.yaml is used to configure access to the Kubernetes cluster. If you have installed upstream Kubernetes command line tools such as kubectl or helm you will need to configure them with the correct kubeconfig path. This can be done by either exporting the KUBECONFIG environment variable or by invoking the --kubeconfig command line flag. Refer to the examples below for details.

Leverage the KUBECONFIG environment variable:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods --all-namespaces
helm ls --all-namespaces

Or specify the location of the kubeconfig file in the command:

kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get pods --all-namespaces
helm --kubeconfig /etc/rancher/k3s/k3s.yaml ls --all-namespaces

To access the kubernetes cluster from outside the cluster, copy <pre>/etc/rancher/k3s/k3s.yaml</pre> on your machine located outside the cluster as <pre>~/.kube/config</pre>Then replace the value of the server field with the modified name of the K3s server cluster as described above. kubectl can now manage your K3s cluster.

==Cluster Maintenance==
As packages (particularly helm charts) are updated, old container images are left behind. Not deleting container images is not generally a bad thing since it enables pods to migrate between nodes in the cluster without waiting for the container to be downloaded. Upgrading packages, however leaves a lot of cruft behind that can eventually cause issues with disk space. Supposedly Kubernetes manages this garbage collection, but other related packages (such as ceph) complain about low disk space long before Kubernetes decides to do anything about it. Cleaning out the old containers is simple, fortunately:
/usr/local/bin/k3s crictl rmi -q
While this can be run at any time, running it too often will remove containers that are still valid, just not used at that time on that node -- resulting in longer migration times when nodes are drained or fail.

==Finishing The Job==
These instructions will create a simple or HA kubernetes cluster. The HA cluster can survive a single master failure, but two will take down the cluster.

At this point you jump back into the rest of the instructions for installing a [[Kubernetes Cluster Installation|Kubernetes Cluster]], remembering that the network layer and metrics server have already been installed.

External Name Resolution

2025-03-27T23:07:14Z

DrEdWilliams: added customization instructions

While the coredns configuration is supposed to refer all non-cluster names out to the system defined DNS resolver, it seems that it doesn't always happen. It may be related to the observation I've made that the nameserver IP in the /etc/resolv.conf file is changed to 127.0.0.53 - which doesn't always work. In any case, the coredns configuration allows you to add custom domain override definitions by creating a ConfigMap named 'coredns-custom' using a YAML file like this:

<pre>ApiVersion: v1
kind: ConfigMap
metadata:
name: coredns-custom
namespace: kube-system
data:
# Log all incoming DNS queries.
log.override: |
log
# Private DNS resolution. Handles FQDN resolutions for *.williams.localnet
dev.server: |
williams.localnet.:53 {
forward . 10.0.0.254
}
</pre>

The general process is described in the online kubernetes documentation:

https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/

Note that the '*.override' tag is different from the online documentation. The correct label was identified by looking at the standard 'coredns' ConfigMap to see where and how these custom sections are added.

You can include multiple *.override sections and multiple *.server sections if needed to include all the overrides desired.

External Name Resolution

2025-03-27T23:01:50Z

DrEdWilliams: instructions on how to provide DNS overrides

While the coredns configuration is supposed to refer all non-cluster names out to the system defined DNS resolver, it seems that it doesn't always happen. It may be related to the observation I've made that the nameserver IP in the /etc/resolv.conf file is changed to 127.0.0.53 - which doesn't always work. In any case, the coredns configuration allows you to add custom domain override definitions by creating a ConfigMap named 'coredns-custom' using a YAML file like this:

<pre>ApiVersion: v1
kind: ConfigMap
metadata:
name: coredns-custom
namespace: kube-system
data:
# Log all incoming DNS queries.
log.override: |
log
# Private DNS resolution example. Handles FQDN resolutions for *.mydomain.com
# Replace <dns-private-ip> with the target IP address.
dev.server: |
williams.localnet.:53 {
forward . 10.0.0.254
}
</pre>

The general process is described in the online kubernetes documentation:

https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/

Talk:Rook Storage for Kubernetes

2025-03-26T01:31:25Z

DrEdWilliams:

clearing/reusing drives in rook/ceph:

lsblk /dev/xxxx

if you see a ceph ID, you need to remove it:

dmsetup remove ceph--(whatever you see in lsblk)

you also need to clean out the partition table:

wipefs -a /dev/xxxx

if that device had been in ceph before, you need to remove the old OSD (that should be marked as down/out). open the rook-ceph-tools container:

ceph osd purge <number> --yes-i-really-mean-it

You need to make sure that the osd is truly dead ...

Removing error messages that make no sense:

- warning about low host disk space

ceph config set mon mon_data_avail_warn 10

- warning about pool objects per pg

ceph config set mgr mon_pg_warn_max_object_skew 0.0

Production Cluster Configuration

2024-09-14T23:09:36Z

DrEdWilliams:

These packages form the basic functionality of the production cluster.

Scripts & config files are checked into gitlab under the Kubernetes group project listed.

{| class="wikitable"
|-
! activity !! gitlab !! script/procedures/config !! IP !! hostname(s)
|-
| [[Ceph Storage Cluster]] || k8s-admin || || ||
|-
| [[Rook Storage for Kubernetes|Rook Storage]] || k8s-admin || || || (StorageClass) rook-ceph-hdd rook-ceph-nvme
|-
| gitlab registry secrets || || gitlab-registry-kube-system.yaml gitlab-registry-secret.yaml || ||
|-
| wordpress (dredwilliams.com) || k8s/dredwilliams || || || dredwilliams.williams-net.org
|-
| mediawiki || mediawiki || || 10.0.0.116 || wiki.williams.localnet wiki.williams-net.org
|-
| [[MariaDB]] || mariadb || || 10.0.0.117 || database.williams.localnet
|}

=== Storage ===
The production cluster depends on the '''/shared''' filesystem for its persistent storage as provided by the production Ceph cluster. The Ceph is configured as shown here:
{| class="wikitable"
|-
! system !! function !! storage !! size
|-
| caspian || master || NVMe HDD || 1TB 1TB
|-
| uvilas || node || NVMe HDD HDD || 1TB 1TB 1TB
|-
| belisar || node || NVMe HDD || 1TB 250GB
|}

The work filesystem can be mounted via NFS:

10.0.0.75:/work /work nfs4 soft 0 0

=== Backups ===

In addition to the normal backups configured in the basic OS installation steps, the databases in the production cluster must be backed up daily using the 'mysqldump' command:

TBD

These commands should be inserted into the /etc/cron.daily/backup file on one of the cluster nodes (telmar is a good choice). The first does a complete database dump of the MediaWiki database server, the second dumps just the mediawiki database itself, and the third dumps the general purpose database server. Additional dump commands should be inserted for additional significant databases, as parsing individual databases out of a system dump can be tedious.

Mail Server Installation

2024-09-14T23:07:56Z

DrEdWilliams: initial version

The previous deployment of a mail server as a home grown kubernetes deployment finally broke when the underlying centos7 container was discontinued. This solution surfaced recently, and after some challenges in adapting to this environment seems to be working properly. Like the previous solution, it uses docker containers; but the author only provided a docker compose deployment method, so it requires a VM to keep it contained.

The solution is called "MailCow", and is located on GitHub:

https://docs.mailcow.email/getstarted/install/

The package contains multiple docker containers that implement all the basic services required to support the mail server -- including a webmail application (that I couldn't get to work).

== Pre-requisites ==
Per the documentation, MailCow is best installed in its own system, so a small VM is the best hosting solution. They recommended ubuntu, but my [[Basic Debian Installation]] worked just fine. Preparing for this deployment is straightforward:

# Create a VM with 2 cores, 4 GB RAM, and 100GB in a single storage volume
# Install Debian in the VM as described in [[Basic Debian Installation]]
# Create a static DHCP IP assignment (or use a static network configuration) for the mail server to provide a consistent IP address
# Create a DNS entry for the server as 'mail.williams.localnet' with CNAMES for 'imap' and 'smtp'

== Installation ==
The instructions for MailCow are very comprehensive and easy to follow:

# Follow the instructions in the documentation to install docker
# Clone the MailCow GitHub repo
# Generate a config file -- edit to specify the hostname and to disable the "Let's Encrypt" certificate generation (it will generate self-signed certs as needed)
# Edit config files as needed to accept mail from the local systems (10.0.0.0/24) as specified in `https://docs.mailcow.email/manual-guides/Postfix/u_e-postfix-unauthenticated-relaying/?h=relay`
# Start the server as specified in the docs (NOTE: it takes several minutes to get things initialized, so be patient).

== Follow-up ==
Reconnecting Thunderbird to the new server is a manual process, but not massively eventful. This more robust server did require a modification to the standard SSMTP mail client configuration (the source domain needed to be specified), and it has become obvious that I need to actually switch back to a more robust postfix mail deployment on each server in the environment. It is definitely overkill, but the SSMTP client is not very robust or fault tolerant.

Basic Debian Installation

2024-02-10T16:49:18Z

DrEdWilliams: added HP watchdog timer fix

__NOTOC__

'''NOTE''': The instructions on this page have been tested and appear to work properly on Debian 10 ('''Buster'''), Debian 11 ('''Bullseye'''), and Debian 12 ('''Bookworm''').

Debian standard install from distributed ISO images:
* Debian 10/11 - ''firmware-amd64-DVD'' (which includes the proprietary firmware that is supposedly not included in the other install disks)
* Debian 12 - select the full or network standard media - they have appeared to include the extra firmware in the primary install images

Basic instructions for all versions:
* Do a normal 'Install' option from the boot menu
** if that hangs on the network device detection, do the 'Expert Install' option from the 'Advanced' menu and select the default options
* select basic options, one partition formatting, and only the minimal software selection, but make sure the 'SSHD server' option is selected
* For a graphic workstation, select the desired desktop from the list
* select 'yes' for network mirror, using the defaults (USA/deb.debian.org)
* let it finish and reboot

== Basic system prep ==
Most of the initial configuration is now contained in a script that can be executed directly from the config server as root on the target system:

curl -s http://config/config/debian-basic-config.sh | bash -x | tee config.out
wget -qO - http://config/config/debian-basic-config.sh | bash -x | tee config.out

The contents of this script are included here for reference, though updates to the script may occur without updates to this page:

<pre>#!/bin/sh
#
# script to do the basic install of a debian headless server
#
# Assumptions:
# - this is run as root immediately after the install has completed
# - the hostname has been set as desired before this script is run
# - an administrator account 'ewilliam' was created during installation
#
# invoke using one of these commands:
#
# wget -qO http://config/config/debian-basic-config.sh | bash
# curl -s http://config/config/debian-basic-config.sh | bash
#

CONFIG=http://config/config

# first -- install all the basic necessities (some may already be there)
apt-get update
apt-get install -y net-tools zsh sudo rsync mlocate wget nfs-common psmisc
apt-get install -y aptitude sshfs git curl smartmontools vim

# the ssmtp package is not officially supported under buster
# download the package from stretch and install
#SSMTP_DEB=ssmtp_2.64-9_amd64.deb
SSMTP_DEB=ssmtp_2.64-8+b2_amd64.deb
wget http://http.us.debian.org/debian/pool/main/s/ssmtp/${SSMTP_DEB}
apt-get install -y ./${SSMTP_DEB}

# now get all the updates
apt-get -y upgrade

# make sudo passwordless for group sudo
echo "%sudo ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/91-wheel-NOPASSWD
chmod 440 /etc/sudoers.d/91-wheel-NOPASSWD
usermod -aG sudo ewilliam

# now that zsh is installed, make it the default
chsh -s /bin/zsh
chsh -s /bin/zsh ewilliam

# copy the standard zsh config files
curl -s $CONFIG/common/.zshrc -o /root/.zshrc
curl -s $CONFIG/common/.zshrc -o /home/ewilliam/.zshrc
chown ewilliam.ewilliam /home/ewilliam/.zshrc

# load ssh known_hosts from the config server
curl -s http://config/config/ssh/known_hosts -o /etc/ssh/ssh_known_hosts

# get backup scripts
curl -s $CONFIG/common/backup -o /etc/cron.daily/backup
curl -s $CONFIG/common/rsync_backup.sh -o /usr/local/bin/rsync_backup.sh
chmod +x /usr/local/bin/rsync_backup.sh /etc/cron.daily/backup

# configure log server
curl -s $CONFIG/common/99-remotelog.conf -o /etc/rsyslog.d/99-remotelog.conf
systemctl restart rsyslog

# enable mail to the central email server
# unlike the CentOS version of the ssmtp package, this doesn't recognize the /etc/aliases file
# to make it work, the hostname needs to be set to 'williams.localnet' and the mail server must be at 'mail'
sed -i 's/^hostname=.*$/hostname=williams.localnet/' /etc/ssmtp/ssmtp.conf

# get rid of the obnoxious political statement in the /etc/motd
echo > /etc/motd

# get the standard /etc/hosts file
curl -s $CONFIG/hosts -o /etc/hosts

# install more supporting stuff
apt-get install -y gpg apt-transport-https

# install webmin
wget -qO - https://download.webmin.com/jcameron-key.asc | sudo apt-key add -
echo "deb https://download.webmin.com/download/repository sarge contrib" | sudo tee -a /etc/apt/sources.list.d/webmin.list
apt-get update
apt-get install -y webmin

# install the host/known_hosts synchronization
curl -s $CONFIG/host_check.sh > /etc/cron.hourly/host_check
chmod +x /etc/cron.hourly/host_check

# provide a daily list of packages that need updating
curl -s $CONFIG/debian/0-apt-upgradeable -o /etc/cron.daily/0-apt-upgradable
chmod +x /etc/cron.daily/0-apt-upgradable

# install other status checks
curl -s $CONFIG/common/ssd-endurance > /etc/cron.weekly/ssd-endurance
chmod +x /etc/cron.weekly/ssd-endurance

# now reboot
reboot</pre>

== Proliant System Utilities ==
For Proliant servers, we need to install the [[HPE ProLiant System Configuration/Support|Management Component Pack (MCP)]] to manage the storage controller. A script to do this is available on the config server:
curl -s http://config/config/debian/install-mcp.sh | bash -x

[[Proliant SSA Command Examples | Here]] are some useful command examples.

== Proliant iLO Watchdog Timer ==
There appears to be an issue during boot with the watchdog timer that ends up in a deadlock waiting for a response from iLO. It has happened now on most of the HP servers running Debian. Thanks to a rather in-depth analysis and testing [[ https://www.claudiokuenzler.com/blog/1125/debian-11-bullseye-boot-freeze-kernel-panic-hp-proliant-dl380 | here ]], the problem seems to be the hpwdt module. According to the article, it has been blacklisted in most other distros already. The script below creates the blacklist entry and rebuilds the initramfs images:

#!/bin/bash
#
# blacklist the HP iLO watchdog timer module (hpwdt) to avoid boot freezes under debian
#
echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist-hp.conf
update-initramfs -k all -u
update-grub

This script is available on the config server:
curl -s http://config/config/debian/fix-hpwdt.sh | bash -x

== Non-Free Drivers ==
Debian's militant policy towards non-free drivers can be rather annoying. The Radeon built-in graphics controller on the motherboard for lamppost (ASUS M4A785-M) apparently needs some 'non-free' firmware, which is not detected during the install process, even though I'm using the non-free firmware installation disk image. Because the firmmware is not loaded during the install, it isn't included in the initramdisk, and therefore the device is loaded in an inoperative state. There is a firmware package available to install:
apt install firmware-amd-graphics
... but even after rebooting it doesn't load because by the time the root filesystem is mounted, the devices are already initialized. Until I find a better way, the device needs to be unloaded and reloaded after boot to enable the graphics
rmmod radeon
modprobe radeon

Similarly, some RealTek drivers need to be added:
apt install firmware-realtek

Some systems also require tigon/tg3 drivers. This one installs more than a few other packages, too ... including the amd-graphics package mentioned above.
apt install -y firmware-linux-nonfree

== Next Steps ==
The rest of the installation process is not OS-dependent and is described [[Common Post-Install|here]].

Basic Debian Installation

2024-02-10T16:41:01Z

DrEdWilliams: remove webmin, fix known_hosts load, updated for Debian 12

__NOTOC__

'''NOTE''': The instructions on this page have been tested and appear to work properly on Debian 10 ('''Buster'''), Debian 11 ('''Bullseye'''), and Debian 12 ('''Bookworm''').

Debian standard install from distributed ISO images:
* Debian 10/11 - ''firmware-amd64-DVD'' (which includes the proprietary firmware that is supposedly not included in the other install disks)
* Debian 12 - select the full or network standard media - they have appeared to include the extra firmware in the primary install images

Basic instructions for all versions:
* Do a normal 'Install' option from the boot menu
** if that hangs on the network device detection, do the 'Expert Install' option from the 'Advanced' menu and select the default options
* select basic options, one partition formatting, and only the minimal software selection, but make sure the 'SSHD server' option is selected
* For a graphic workstation, select the desired desktop from the list
* select 'yes' for network mirror, using the defaults (USA/deb.debian.org)
* let it finish and reboot

== Basic system prep ==
Most of the initial configuration is now contained in a script that can be executed directly from the config server as root on the target system:

curl -s http://config/config/debian-basic-config.sh | bash -x | tee config.out
wget -qO - http://config/config/debian-basic-config.sh | bash -x | tee config.out

The contents of this script are included here for reference, though updates to the script may occur without updates to this page:

<pre>#!/bin/sh
#
# script to do the basic install of a debian headless server
#
# Assumptions:
# - this is run as root immediately after the install has completed
# - the hostname has been set as desired before this script is run
# - an administrator account 'ewilliam' was created during installation
#
# invoke using one of these commands:
#
# wget -qO http://config/config/debian-basic-config.sh | bash
# curl -s http://config/config/debian-basic-config.sh | bash
#

CONFIG=http://config/config

# first -- install all the basic necessities (some may already be there)
apt-get update
apt-get install -y net-tools zsh sudo rsync mlocate wget nfs-common psmisc
apt-get install -y aptitude sshfs git curl smartmontools vim

# the ssmtp package is not officially supported under buster
# download the package from stretch and install
#SSMTP_DEB=ssmtp_2.64-9_amd64.deb
SSMTP_DEB=ssmtp_2.64-8+b2_amd64.deb
wget http://http.us.debian.org/debian/pool/main/s/ssmtp/${SSMTP_DEB}
apt-get install -y ./${SSMTP_DEB}

# now get all the updates
apt-get -y upgrade

# make sudo passwordless for group sudo
echo "%sudo ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/91-wheel-NOPASSWD
chmod 440 /etc/sudoers.d/91-wheel-NOPASSWD
usermod -aG sudo ewilliam

# now that zsh is installed, make it the default
chsh -s /bin/zsh
chsh -s /bin/zsh ewilliam

# copy the standard zsh config files
curl -s $CONFIG/common/.zshrc -o /root/.zshrc
curl -s $CONFIG/common/.zshrc -o /home/ewilliam/.zshrc
chown ewilliam.ewilliam /home/ewilliam/.zshrc

# load ssh known_hosts from the config server
curl -s http://config/config/ssh/known_hosts -o /etc/ssh/ssh_known_hosts

# get backup scripts
curl -s $CONFIG/common/backup -o /etc/cron.daily/backup
curl -s $CONFIG/common/rsync_backup.sh -o /usr/local/bin/rsync_backup.sh
chmod +x /usr/local/bin/rsync_backup.sh /etc/cron.daily/backup

# configure log server
curl -s $CONFIG/common/99-remotelog.conf -o /etc/rsyslog.d/99-remotelog.conf
systemctl restart rsyslog

# enable mail to the central email server
# unlike the CentOS version of the ssmtp package, this doesn't recognize the /etc/aliases file
# to make it work, the hostname needs to be set to 'williams.localnet' and the mail server must be at 'mail'
sed -i 's/^hostname=.*$/hostname=williams.localnet/' /etc/ssmtp/ssmtp.conf

# get rid of the obnoxious political statement in the /etc/motd
echo > /etc/motd

# get the standard /etc/hosts file
curl -s $CONFIG/hosts -o /etc/hosts

# install more supporting stuff
apt-get install -y gpg apt-transport-https

# install webmin
wget -qO - https://download.webmin.com/jcameron-key.asc | sudo apt-key add -
echo "deb https://download.webmin.com/download/repository sarge contrib" | sudo tee -a /etc/apt/sources.list.d/webmin.list
apt-get update
apt-get install -y webmin

# install the host/known_hosts synchronization
curl -s $CONFIG/host_check.sh > /etc/cron.hourly/host_check
chmod +x /etc/cron.hourly/host_check

# provide a daily list of packages that need updating
curl -s $CONFIG/debian/0-apt-upgradeable -o /etc/cron.daily/0-apt-upgradable
chmod +x /etc/cron.daily/0-apt-upgradable

# install other status checks
curl -s $CONFIG/common/ssd-endurance > /etc/cron.weekly/ssd-endurance
chmod +x /etc/cron.weekly/ssd-endurance

# now reboot
reboot</pre>

== Proliant System Utilities ==
For Proliant servers, we need to install the [[HPE ProLiant System Configuration/Support|Management Component Pack (MCP)]] to manage the storage controller. A script to do this is available on the config server:
curl -s http://config/config/debian/install-mcp.sh | bash -x

[[Proliant SSA Command Examples | Here]] are some useful command examples.

== Non-Free Drivers ==
Debian's militant policy towards non-free drivers can be rather annoying. The Radeon built-in graphics controller on the motherboard for lamppost (ASUS M4A785-M) apparently needs some 'non-free' firmware, which is not detected during the install process, even though I'm using the non-free firmware installation disk image. Because the firmmware is not loaded during the install, it isn't included in the initramdisk, and therefore the device is loaded in an inoperative state. There is a firmware package available to install:
apt install firmware-amd-graphics
... but even after rebooting it doesn't load because by the time the root filesystem is mounted, the devices are already initialized. Until I find a better way, the device needs to be unloaded and reloaded after boot to enable the graphics
rmmod radeon
modprobe radeon

Similarly, some RealTek drivers need to be added:
apt install firmware-realtek

Some systems also require tigon/tg3 drivers. This one installs more than a few other packages, too ... including the amd-graphics package mentioned above.
apt install -y firmware-linux-nonfree

== Next Steps ==
The rest of the installation process is not OS-dependent and is described [[Common Post-Install|here]].

IP Controller

2024-02-10T01:03:46Z

DrEdWilliams: updates for newer version of metallb

In order for kubernetes services to be accessible outside the cluster, they need to have externally routable IP addresses. While it would seem that this is a basic need that should be provided by kubernetes itself, its implementation is actually very dependent on the underlying platform -- and therefore left for the provider to implement.

We are using the bare metal load balancer called [https://metallb.universe.tf/ MetalLB]. It is a full featured loadbalancer and IP controller that has multiple modes of operation. As deployed here, it is using simple Layer 2 operation, advertising IP addresses using the existing network interfaces.

The basic load balancer is deployed with a simple manifest:

<nowiki>kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.3/config/manifests/metallb-native.yaml
</nowiki>

At this point, the controller itself is active, but it has nothing to do -- it doesn't have any IP addresses to hand out. The controller uses a Custom Resource Definition to define address pools, allocation methods, and routing protocols. There are multiple definitions in the [https://gitlab.williams-net.org/k8s/k8s-admin k8s-admin repository on GitLab], use the one appropriate to the system:

{| class="wikitable"
|-
! filename !! IP range
|-
| ipaddresspool-prod.yaml || 10.0.0.110 - 10.0.0.127
|}

Apply the correct pool definition to get things working:
kubectl apply -f <filename>

This loadbalancer implementation supports the requesting of IP addresses through the '''loadBalancerIP''' parameter in the service -- if the address is available, it will allocate it. The '''externalTrafficPolicy''' parameter is also respected.

See the home page for the MetalLB package for complete installation and configuration options.

----

<references/>

K3s - Kubernetes Simplified

2023-07-09T18:36:37Z

DrEdWilliams: added description on how to access the cluster from outside

Lightweight Kubernetes. Production ready, easy to install, half the memory, all in a binary less than 100 MB.

Great for:
* Edge
* IoT
* CI
* Development
* ARM
* Embedding k8s
* Situations where a PhD in k8s clusterology is infeasible

Extracted from https://github.com/k3s-io/k3s ... YMMV

What makes '''k3s''' special is that they have stripped out all the bloat-ware from upstream kubernetes that is not needed for small and bare-metal deployments. None of the so-called cloud features are present, and most of the CSI storage drivers are also gone.

As such, it is a simpler replacement for the '''kubeadm''' command, and it does a bit more for you:
* installs '''traefix''' as the default ingress controller -- it works ... especially for a test/dev cluster, but for production I'll use Contour/Envoy
* installs '''flannel''' as the CNI networking layer ... I had problems with flannel before, let's see if this works any better
* installs '''metrics-server''' to collect node/pod performance data
* provides a 'local path provisioner' ... not sure what this does ... yet ...
__NOTOC__
== Basic Cluster Installation ==
They weren't kidding ... this is all it takes to create a simple cluster:
curl -sfL https://get.k3s.io | sh -
A kubeconfig file is written to /etc/rancher/k3s/k3s.yaml and the service is automatically started or restarted. The install script will install K3s and additional utilities, such as kubectl, crictl, k3s-killall.sh, and k3s-uninstall.sh, for example:
sudo kubectl get nodes
K3S_TOKEN is created at /var/lib/rancher/k3s/server/node-token on your server. To install on worker nodes we should pass K3S_URL along with K3S_TOKEN or K3S_CLUSTER_SECRET environment variables, for example:
curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=XXX sh -

== HA Cluster Installation ==
One thing that attracted me to this distribution (as it is a distribution, not a fork of kubernetes) is that it provides a very simple way to set up a HA cluster control plane. There are command options in the deployment that provide for the configuration of a common etcd server to facilitate a HA control plane. Even with that ... it can get a bit cumbersome. Enter [https://github.com/alexellis/k3sup k3sup].

=== k3sup ===
k3sup is a wrapper for the k3s installer that enables:
* installation of a complete cluster from a single login session using SSH
* combining multiple k3s parameters into single options to k3sup for simplicity (i.e. '''--cluster''')

=== External LoadBalancer (or DNS Round Robin) ===
The one remaining item needed for a fully HA control plane is an external load balancer to provide a single IP address for accessing the control API from inside or outside the cluster. Lacking something like a BigIron F5 load balancer or something equivalent, the poor-man's approach is to use a round robin in the DNS. Problem is, the pfSense firewall doesn't do round robins by default, so a little arm twisting is needed. Put this in the 'custom options' in the DNS Resolver settings page:
server:
rrset-roundrobin: yes
local-zone: "kube-test.williams.localnet" transparent
local-data: "kube-test.williams.localnet A 10.0.0.70"
local-data: "kube-test.williams.localnet A 10.0.0.71"
local-data: "kube-test.williams.localnet A 10.0.0.72"
Change the name and IP addresses as needed, and then load the changes. This approach somewhat complicates the deployment commands below, so pay attention ...

=== Get k3sup ===
Not surprisingly, there are multiple ways to do this. The author wants you to download it from a marketplace app (arkade):
curl -sLS https://dl.get-arkade.dev | sh
# Install the binary using the command given as output, such as:
sudo cp arkade-darwin /usr/local/bin/arkade
# Or on Linux:
sudo cp arkade /usr/local/bin/arkade
arkade get k3sup

... or you can just download it directly ...
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

You just need to put it somewhere in your path (or use a full path reference).

=== Install the master nodes ===
Install the first master node with these commands. This will put it on the node that the command is issued from ... if that is not the desire, remove the '--local' parameter
mkdir $HOME/.kube
EXTRA_ARGS="--disable servicelb --disable traefik --disable local-storage --write-kubeconfig-mode=644 --flannel-backend host-gw" \
k3sup install --cluster --local --ip <server-ip-1> --k3s-channel stable --k3s-extra-args '$EXTRA_ARGS' \
--local-path $HOME/.kube/config --tls-san <rr-dns-name>

Explanations:
* --cluster sets this up for a HA control plane
* --local says to do it on this system, instead of using SSH to the node -- '''but you still need to provide the IP address''', otherwise it will put 127.0.0.1 as the API server address
* get the last stable version available from k3s (use 'latest' to get bleeding edge updates)
* we want to disable the simplified k3s loadbalancer ... it really is a crowbar solution and doesn't provide IP request capability
* disable the default ingress controller, as we really want something better
* disable the local storage driver (it's a pain if you're installing another storage driver)
* make the k3s.yaml file readable by the pods ... really should be by default
* (temporary workaround) force the flannel backend to use 'host-gw' mode since 'vxlan' mode seems to be broken with Debian 11
* the --local-path tells k3sup where to put the kubeconfig files ... this is the standard place, so we will put it there. The directory has to exist -- hence the 'mkdir' command ...
* by providing the name of the DNS round robin, it should be able to pass muster on TLS connections ...

Subsequent master nodes can be installed next:
k3sup join --server --ip <server-ip-2> --k3s-channel stable --server-ip <server-ip-1> --k3s-extra-args "$EXTRA_ARGS"

Explanations:
* join as another server, not an agent/worker
* provide the IP address of the node to be joined
* specify the IP of the first mater created
* use the same EXTRA_ARGS provided to the first server

Repeat this for all the masters (minimum 3) providing the same server IP address of the first master. Note that if you need to add/remove more parameters to the k3s service, you need to do it on all server nodes -- just edit the k3s.service file and restart one server at a time.

Now that you have all the masters online, reset the kubeconfig file (~/.kube/config) to use the DNS RR name instead of the IP of the first master:
sed -i -e s/<server-ip-1>/<rr-dns-name>/g $HOME/.kube/config

=== Install Worker Nodes ===
At this point, you should have an operational cluster. k3s enables masters to be used for normal pods by default, so this is the minimum operational cluster. Next, install the agent/worker nodes:
k3sup join --ip <agent-node-x> --k3s-channel stable --server-host <rr-dns-name>

Explanations:
* Joining as an agent, so no '--server' this time
* provide each agent/worker IP in turn
* instead of the '--server-ip' parameter, we are using the '--server-host' parameter so we can use the DNS RR to ensure that the nodes always can access a master

== Automated Upgrades: System Upgrade Controller ==

Rancher has taken the simplicity of k3s to another level by providing a kubernetes controller that can update the cluster ``from within the cluster`` using the [https://rancher.com/docs/k3s/latest/en/upgrades/automated/ System Upgrade Controller].

Installing the controller is straightforward using a single manifest (also located in the git repo):
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.7.6/system-upgrade-controller.yaml

The controller looks for upgrade plans as custom resources (of type 'Plan', for some odd reason) to define its behavior. Two 'Plan's are required ... one for server/master nodes and one for agent/worker nodes. These plans can be set up to either upgrade to a specific k3s version or to monitor the release channel and automatically upgrade when a new release is available. The manifest looks like this:

<pre># Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable
---
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade #:v1.17.4-k3s1
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable</pre>

Simply apply this manifest and the controller will identify nodes to upgrade and will take care of them as instructed. It's as simple as that. If you don't want to automatically upgrade, comment out the 'channel' parameter in each Plan, and uncomment the 'version' parameter and set it to the version to upgrade to and apply. The controller will see the new plan and act accordingly. If there is nothing to do, the controller just sits there waiting for a new plan with some work.

== Cluster Access==
The kubeconfig file stored at /etc/rancher/k3s/k3s.yaml is used to configure access to the Kubernetes cluster. If you have installed upstream Kubernetes command line tools such as kubectl or helm you will need to configure them with the correct kubeconfig path. This can be done by either exporting the KUBECONFIG environment variable or by invoking the --kubeconfig command line flag. Refer to the examples below for details.

Leverage the KUBECONFIG environment variable:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl get pods --all-namespaces
helm ls --all-namespaces

Or specify the location of the kubeconfig file in the command:

kubectl --kubeconfig /etc/rancher/k3s/k3s.yaml get pods --all-namespaces
helm --kubeconfig /etc/rancher/k3s/k3s.yaml ls --all-namespaces

To access the kubernetes cluster from outside the cluster, copy <pre>/etc/rancher/k3s/k3s.yaml</pre> on your machine located outside the cluster as <pre>~/.kube/config</pre>Then replace the value of the server field with the modified name of the K3s server cluster as described above. kubectl can now manage your K3s cluster.

==Cluster Maintenance==
As packages (particularly helm charts) are updated, old container images are left behind. Not deleting container images is not generally a bad thing since it enables pods to migrate between nodes in the cluster without waiting for the container to be downloaded. Upgrading packages, however leaves a lot of cruft behind that can eventually cause issues with disk space. Supposedly Kubernetes manages this garbage collection, but other related packages (such as ceph) complain about low disk space long before Kubernetes decides to do anything about it. Cleaning out the old containers is simple, fortunately:
/usr/local/bin/k3s crictl rmi -q
While this can be run at any time, running it too often will remove containers that are still valid, just not used at that time on that node -- resulting in longer migration times when nodes are drained or fail.

==Finishing The Job==
These instructions will create a simple or HA kubernetes cluster. The HA cluster can survive a single master failure, but two will take down the cluster.

At this point you jump back into the rest of the instructions for installing a [[Kubernetes Cluster Installation|Kubernetes Cluster]], remembering that the network layer and metrics server have already been installed.

Aslan

2023-06-04T01:14:43Z

DrEdWilliams: "removed vmware horizon & fixed spacing"

This system is the primary linux workstation in the office and needs to have the basic apps for a graphic workstation:

* KDE Plasma Desktop
* Mate terminal (Ubuntu app)
* VideoLAN media player (https://videolan.org/vlc/download-redhat.html)
* Visual Studio Code (vscode - https://code.visualstudio.com/download)
* Calibre library manager (download)
* HPCMP Kerberos Kit (https://centers.hpc.mil)
** CAC software (CA CoolKey?)
* Kontena Lens (download latest)
* Thunderbird
* Slack
* Synergy (v1)
* NVIDIA drivers (default with kubuntu/nonfree)
* Zoom client
* Firefox
* rclone (for google & dropbox mounts)
* pdsh
* Oracle Java (the real thing) for the Dell Storage manager
* minio client (mc) for gitlab backup -- there is a copy in /work/bin if that will work for the platform.
* Dell Equalogic Group Manager (java webstart - requires iSCSInet interface below and Oracle Java)

Configurations:
* mount /work and /files filesystems
* configure extra network devices
** StorageNet (10.1.0.8) - use the second ethernet interface
** iSCSInet (10.2.0.200) - done via [[VLAN Interfaces|VLAN]] 15 on the StorageNet interface
* set up [[Mounting Google Drive and Dropbox|dropbox/google drive mounts]] (gdrives script in /work/bin)

Tasks/Applications (in addition to normal housekeeping):

* cron daily:
** check-all (moved to pro6)
** Backup GitLab (moved to caspian)
*** mount storage1:/backup/systems to /backup_systems
*** install the cert for the gitlab instance (since it is self-signed)

Aslan

2023-01-15T03:16:35Z

DrEdWilliams: "gitlab backup moved to caspian"

This system is the primary linux workstation in the office and needs to have the basic apps for a graphic workstation:

* KDE Plasma Desktop
* Mate terminal (Ubuntu app)
* VideoLAN media player (https://videolan.org/vlc/download-redhat.html)
* VMware Horizon Client (for SAIC VDI access)
* Visual Studio Code (vscode - https://code.visualstudio.com/download)
* Calibre library manager (download)
* HPCMP Kerberos Kit (https://centers.hpc.mil)
** CAC software (CA CoolKey?)
* Kontena Lens (download latest)
* Thunderbird
* Slack
* Synergy (v1)
* NVIDIA drivers (default with kubuntu/nonfree)
* Zoom client
* Firefox
* rclone (for google & dropbox mounts)
*pdsh
*Oracle Java (the real thing) for the Dell Storage manager
*minio client (mc) for gitlab backup -- there is a copy in /work/bin if that will work for the platform.
*Dell Equalogic Group Manager (java webstart - requires iSCSInet interface below and Oracle Java)

Configurations:
* mount /work and /files filesystems
*configure extra network devices
**StorageNet (10.1.0.8) - use the second ethernet interface
**iSCSInet (10.2.0.200) - done via [[VLAN Interfaces|VLAN]] 15 on the StorageNet interface
* set up [[Mounting Google Drive and Dropbox|dropbox/google drive mounts]] (gdrives script in /work/bin)
Tasks/Applications (in addition to normal housekeeping):

* cron daily:
** check-all (moved to pro6)
** Backup GitLab (moved to caspian)
*** mount storage1:/backup/systems to /backup_systems
*** install the cert for the gitlab instance (since it is self-signed)

Aslan

2023-01-15T03:16:35Z

DrEdWilliams: "test commit"

This system is the primary linux workstation in the office and needs to have the basic apps for a graphic workstation:

* KDE Plasma Desktop
* Mate terminal (Ubuntu app)
* VideoLAN media player (https://videolan.org/vlc/download-redhat.html)
* VMware Horizon Client (for SAIC VDI access)
* Visual Studio Code (vscode - https://code.visualstudio.com/download)
* Calibre library manager (download)
* HPCMP Kerberos Kit (https://centers.hpc.mil)
** CAC software (CA CoolKey?)
* Kontena Lens (download latest)
* Thunderbird
* Slack
* Synergy (v1)
* NVIDIA drivers (default with kubuntu/nonfree)
* Zoom client
* Firefox
* rclone (for google & dropbox mounts)
*pdsh
*Oracle Java (the real thing) for the Dell Storage manager
*minio client (mc) for gitlab backup -- there is a copy in /work/bin if that will work for the platform.
*Dell Equalogic Group Manager (java webstart - requires iSCSInet interface below and Oracle Java)

Configurations:
* mount /work and /files filesystems
*configure extra network devices
**StorageNet (10.1.0.8) - use the second ethernet interface
**iSCSInet (10.2.0.200) - done via [[VLAN Interfaces|VLAN]] 15 on the StorageNet interface
* set up [[Mounting Google Drive and Dropbox|dropbox/google drive mounts]] (gdrives script in /work/bin)
Tasks/Applications (in addition to normal housekeeping):

* cron daily:
** check-all (moved to pro6)
** Backup GitLab
***mount storage1:/backup/systems to /backup_systems
***install the cert for the gitlab instance (since it is self-signed)

Production Cluster Configuration

2022-05-18T00:21:36Z

DrEdWilliams:

These packages form the basic functionality of the production cluster.

Scripts & config files are checked into gitlab under the Kubernetes group project listed.

{| class="wikitable"
|-
! activity !! gitlab !! script/procedures/config !! IP !! hostname(s)
|-
| [[Ceph Storage Cluster]] || k8s-admin || || ||
|-
| [[Rook Storage for Kubernetes|Rook Storage]] || k8s-admin || || || (StorageClass) rook-ceph-hdd rook-ceph-nvme
|-| gitlab registry secrets || || gitlab-registry-kube-system.yaml gitlab-registry-secret.yaml || ||
|-
| mail || k8s/mail || || 10.0.0.114 || mail.williams.localnet
|-
| wordpress (dredwilliams.com) || k8s/dredwilliams || || || dredwilliams.williams-net.org
|-
| mediawiki || mediawiki || || 10.0.0.116 || wiki.williams.localnet wiki.williams-net.org
|-
| [[MariaDB]] || mariadb || || 10.0.0.117 || database.williams.localnet
|}

=== Storage ===
The production cluster depends on the '''/shared''' filesystem for its persistent storage as provided by the production Ceph cluster. The Ceph is configured as shown here:
{| class="wikitable"
|-
! system !! function !! storage !! size
|-
| caspian || master || NVMe HDD || 1TB 1TB
|-
| uvilas || node || NVMe HDD HDD || 1TB 1TB 1TB
|-
| belisar || node || NVMe HDD || 1TB 250GB
|}

The work filesystem can be mounted via NFS:

10.0.0.75:/work /work nfs4 soft 0 0

=== Backups ===

In addition to the normal backups configured in the basic OS installation steps, the databases in the production cluster must be backed up daily using the 'mysqldump' command:

TBD

These commands should be inserted into the /etc/cron.daily/backup file on one of the cluster nodes (telmar is a good choice). The first does a complete database dump of the MediaWiki database server, the second dumps just the mediawiki database itself, and the third dumps the general purpose database server. Additional dump commands should be inserted for additional significant databases, as parsing individual databases out of a system dump can be tedious.

K3s - Kubernetes Simplified

2022-02-25T12:34:46Z

DrEdWilliams: added container cleanup instructions

Lightweight Kubernetes. Production ready, easy to install, half the memory, all in a binary less than 100 MB.

Great for:
* Edge
* IoT
* CI
* Development
* ARM
* Embedding k8s
* Situations where a PhD in k8s clusterology is infeasible

Extracted from https://github.com/k3s-io/k3s ... YMMV

What makes '''k3s''' special is that they have stripped out all the bloat-ware from upstream kubernetes that is not needed for small and bare-metal deployments. None of the so-called cloud features are present, and most of the CSI storage drivers are also gone.

As such, it is a simpler replacement for the '''kubeadm''' command, and it does a bit more for you:
* installs '''traefix''' as the default ingress controller -- it works ... especially for a test/dev cluster, but for production I'll use Contour/Envoy
* installs '''flannel''' as the CNI networking layer ... I had problems with flannel before, let's see if this works any better
* installs '''metrics-server''' to collect node/pod performance data
* provides a 'local path provisioner' ... not sure what this does ... yet ...
__NOTOC__
== Basic Cluster Installation ==
They weren't kidding ... this is all it takes to create a simple cluster:
curl -sfL https://get.k3s.io | sh -
A kubeconfig file is written to /etc/rancher/k3s/k3s.yaml and the service is automatically started or restarted. The install script will install K3s and additional utilities, such as kubectl, crictl, k3s-killall.sh, and k3s-uninstall.sh, for example:
sudo kubectl get nodes
K3S_TOKEN is created at /var/lib/rancher/k3s/server/node-token on your server. To install on worker nodes we should pass K3S_URL along with K3S_TOKEN or K3S_CLUSTER_SECRET environment variables, for example:
curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=XXX sh -

== HA Cluster Installation ==
One thing that attracted me to this distribution (as it is a distribution, not a fork of kubernetes) is that it provides a very simple way to set up a HA cluster control plane. There are command options in the deployment that provide for the configuration of a common etcd server to facilitate a HA control plane. Even with that ... it can get a bit cumbersome. Enter [https://github.com/alexellis/k3sup k3sup].

=== k3sup ===
k3sup is a wrapper for the k3s installer that enables:
* installation of a complete cluster from a single login session using SSH
* combining multiple k3s parameters into single options to k3sup for simplicity (i.e. '''--cluster''')

=== External LoadBalancer (or DNS Round Robin) ===
The one remaining item needed for a fully HA control plane is an external load balancer to provide a single IP address for accessing the control API from inside or outside the cluster. Lacking something like a BigIron F5 load balancer or something equivalent, the poor-man's approach is to use a round robin in the DNS. Problem is, the pfSense firewall doesn't do round robins by default, so a little arm twisting is needed. Put this in the 'custom options' in the DNS Resolver settings page:
server:
rrset-roundrobin: yes
local-zone: "kube-test.williams.localnet" transparent
local-data: "kube-test.williams.localnet A 10.0.0.70"
local-data: "kube-test.williams.localnet A 10.0.0.71"
local-data: "kube-test.williams.localnet A 10.0.0.72"
Change the name and IP addresses as needed, and then load the changes. This approach somewhat complicates the deployment commands below, so pay attention ...

=== Get k3sup ===
Not surprisingly, there are multiple ways to do this. The author wants you to download it from a marketplace app (arkade):
curl -sLS https://dl.get-arkade.dev | sh
# Install the binary using the command given as output, such as:
sudo cp arkade-darwin /usr/local/bin/arkade
# Or on Linux:
sudo cp arkade /usr/local/bin/arkade
arkade get k3sup

... or you can just download it directly ...
curl -sLS https://get.k3sup.dev | sh
sudo install k3sup /usr/local/bin/

You just need to put it somewhere in your path (or use a full path reference).

=== Install the master nodes ===
Install the first master node with these commands. This will put it on the node that the command is issued from ... if that is not the desire, remove the '--local' parameter
mkdir $HOME/.kube
EXTRA_ARGS="--disable servicelb --disable traefik --disable local-storage --write-kubeconfig-mode=644 --flannel-backend host-gw" \
k3sup install --cluster --local --ip <server-ip-1> --k3s-channel stable --k3s-extra-args '$EXTRA_ARGS' \
--local-path $HOME/.kube/config --tls-san <rr-dns-name>

Explanations:
* --cluster sets this up for a HA control plane
* --local says to do it on this system, instead of using SSH to the node -- '''but you still need to provide the IP address''', otherwise it will put 127.0.0.1 as the API server address
* get the last stable version available from k3s (use 'latest' to get bleeding edge updates)
* we want to disable the simplified k3s loadbalancer ... it really is a crowbar solution and doesn't provide IP request capability
* disable the default ingress controller, as we really want something better
* disable the local storage driver (it's a pain if you're installing another storage driver)
* make the k3s.yaml file readable by the pods ... really should be by default
* (temporary workaround) force the flannel backend to use 'host-gw' mode since 'vxlan' mode seems to be broken with Debian 11
* the --local-path tells k3sup where to put the kubeconfig files ... this is the standard place, so we will put it there. The directory has to exist -- hence the 'mkdir' command ...
* by providing the name of the DNS round robin, it should be able to pass muster on TLS connections ...

Subsequent master nodes can be installed next:
k3sup join --server --ip <server-ip-2> --k3s-channel stable --server-ip <server-ip-1> --k3s-extra-args "$EXTRA_ARGS"

Explanations:
* join as another server, not an agent/worker
* provide the IP address of the node to be joined
* specify the IP of the first mater created
* use the same EXTRA_ARGS provided to the first server

Repeat this for all the masters (minimum 3) providing the same server IP address of the first master. Note that if you need to add/remove more parameters to the k3s service, you need to do it on all server nodes -- just edit the k3s.service file and restart one server at a time.

Now that you have all the masters online, reset the kubeconfig file (~/.kube/config) to use the DNS RR name instead of the IP of the first master:
sed -i -e s/<server-ip-1>/<rr-dns-name>/g $HOME/.kube/config

=== Install Worker Nodes ===
At this point, you should have an operational cluster. k3s enables masters to be used for normal pods by default, so this is the minimum operational cluster. Next, install the agent/worker nodes:
k3sup join --ip <agent-node-x> --k3s-channel stable --server-host <rr-dns-name>

Explanations:
* Joining as an agent, so no '--server' this time
* provide each agent/worker IP in turn
* instead of the '--server-ip' parameter, we are using the '--server-host' parameter so we can use the DNS RR to ensure that the nodes always can access a master

== Automated Upgrades: System Upgrade Controller ==

Rancher has taken the simplicity of k3s to another level by providing a kubernetes controller that can update the cluster ``from within the cluster`` using the [https://rancher.com/docs/k3s/latest/en/upgrades/automated/ System Upgrade Controller].

Installing the controller is straightforward using a single manifest (also located in the git repo):
kubectl apply -f https://github.com/rancher/system-upgrade-controller/releases/download/v0.7.6/system-upgrade-controller.yaml

The controller looks for upgrade plans as custom resources (of type 'Plan', for some odd reason) to define its behavior. Two 'Plan's are required ... one for server/master nodes and one for agent/worker nodes. These plans can be set up to either upgrade to a specific k3s version or to monitor the release channel and automatically upgrade when a new release is available. The manifest looks like this:

<pre># Server plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: server-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: In
values:
- "true"
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable
---
# Agent plan
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
name: agent-plan
namespace: system-upgrade
spec:
concurrency: 1
cordon: true
nodeSelector:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
prepare:
args:
- prepare
- server-plan
image: rancher/k3s-upgrade #:v1.17.4-k3s1
serviceAccountName: system-upgrade
upgrade:
image: rancher/k3s-upgrade
#version: v1.20.4+k3s1
channel: https://update.k3s.io/v1-release/channels/stable</pre>

Simply apply this manifest and the controller will identify nodes to upgrade and will take care of them as instructed. It's as simple as that. If you don't want to automatically upgrade, comment out the 'channel' parameter in each Plan, and uncomment the 'version' parameter and set it to the version to upgrade to and apply. The controller will see the new plan and act accordingly. If there is nothing to do, the controller just sits there waiting for a new plan with some work.

== Cluster Maintenance ==
As packages (particularly helm charts) are updated, old container images are left behind. Not deleting container images is not generally a bad thing since it enables pods to migrate between nodes in the cluster without waiting for the container to be downloaded. Upgrading packages, however leaves a lot of cruft behind that can eventually cause issues with disk space. Supposedly Kubernetes manages this garbage collection, but other related packages (such as ceph) complain about low disk space long before Kubernetes decides to do anything about it. Cleaning out the old containers is simple, fortunately:
/usr/local/bin/k3s crictl rmi -q
While this can be run at any time, running it too often will remove containers that are still valid, just not used at that time on that node -- resulting in longer migration times when nodes are drained or fail.

== Finishing The Job ==
These instructions will create a simple or HA kubernetes cluster. The HA cluster can survive a single master failure, but two will take down the cluster.

At this point you jump back into the rest of the instructions for installing a [[Kubernetes Cluster Installation|Kubernetes Cluster]], remembering that the network layer and metrics server have already been installed.

Memory Testing

2021-12-28T02:24:22Z

DrEdWilliams: Created page with "Install the memtester utility from the software repository, if it is not already installed. Ubuntu/Debian and Linux Mint: sudo apt-get install memtester Fedora/RHEL/RPM: y..."

Install the memtester utility from the software repository, if it is not already installed.

Ubuntu/Debian and Linux Mint:
sudo apt-get install memtester
Fedora/RHEL/RPM:
yum install memtester
As root, run the test:
memtester <memory size> <repetitions>
... where:

* <memory size> is in megabytes
* <repetitions> is the number of times the full suite will be run

To get the exit status:

echo $?

Possible values:

* x00: success
* x01: Error allocating or locking memory, or invocation error
* x02: Error during stuck address test
* x04: Error during one of the other tests.

(excerpted from https://www.techwalla.com/articles/how-to-test-the-ram-on-linux)

System Administration Processes

2021-12-28T02:17:51Z

DrEdWilliams: /* Sustainment Practices */

== Non-standard Repositories ==
* [[Centos Repos]]

== Basic OS installation ==
* [[Basic Ubuntu Installation]]
*[[Basic CentOS 7 Installation]]
* [[Basic Debian Installation]]
* [[Basic Fedora 31+ Installation]]
* [[Fedora 33 - Workstation Installation]]
*[[Common Post-Install]]

== System Configuration ==
* [[HPE ProLiant System Configuration/Support]]
* [[Ethernet Device Configuration]]
* [[CUDA + NVIDIA drivers]]
* [[NVIDIA Docker Runtime|Installing NVIDIA Docker Runtime]]
* [[VLAN Interfaces]]
*[[Mounting Google Drive and Dropbox]]
*[[Mounting iSCSI LUNs]]

== Application Installation ==
* [[Plex Media Server - Debian]]
* [[Docker Installation]]
* [[Ceph Storage Cluster]]
* [[Kubernetes Cluster Installation]]
* [http://www.webmin.com/download.html Webmin Installation]

== Sustainment Practices ==
* [[SSH Host Key Management]]
* [[Log Accumulator]]
* [[Benchmark commands]]
*[[Memory Testing]]

== Deprecated ==
*

HPE ProLiant System Configuration/Support

2021-12-24T18:07:06Z

DrEdWilliams: /* HPE Enterprise Support */

Debian has a page in their Wiki covering anecdotal reports of compatibility/support for Debian on the HP ProLiant Server family: https://wiki.debian.org/HP/ProLiant

== HPE Software Repositories ==
HPE maintains RPM and Debian repositories for their system support software -- see http://downloads.linux.hpe.com/SDR/project/mcp/

The software available through the repository includes:

{| class="wikitable"
|-
! package !! description
|-
| hp-health || HPE System Health Application and Command line Utilities (Gen9 and earlier)
|-
| hponcfg || HPE RILOE II/iLO online configuration utility
|-
| amsd || HPE Agentless Management Service (Gen10 only)
|-
| hp-ams || HPE Agentless Management Service (Gen9 and earlier)
|-
| hp-snmp-agents || Insight Management SNMP Agents for HPE ProLiant Systems (Gen9 and earlier)
|-
| hpsmh || HPE System Management Homepage (Gen9 and earlier)
|-
| hp-smh-templates || HPE System Management Homepage Templates (Gen9 and earlier)
|-
| ssacli || HPE Command Line Smart Storage Administration Utility
|-
| ssaducli || HPE Command Line Smart Storage Administration Diagnostics
|-
| ssa || HPE Array Smart Storage Administration Service
|}

=== RPM Repository ===
Cut-n-paste the following section (substituting distribution, architecture and project version) into /etc/yum.repos.d/mcp.repo on your system:

[mcp]
name=Management Component Pack
baseurl=http://downloads.linux.hpe.com/repo/mcp/dist/dist_ver/arch/project_ver
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/GPG-KEY-mcp

Where:
<pre> dist centos, fedora, opensuse, oracle, asianux
dist_ver [http://downloads.linux.hpe.com/SDR/repo/mcp/ Browse repo to identify supported distribution versions]
arch i386, x86_64
project_ver current, 11.30, 11.21, 11.05, 10.62, 10.50, 10.40, 10.20, 10.00, 9.30, 9.25</pre>

List the packages in the repository
yum --disablerepo="*" --enablerepo="short_repo_name" list available

Install a specific package
yum install <packagename>

=== Debian Repositories ===
Cut-n-paste the following section (substituting distribution, architecture and project version) into /etc/apt/sources.list.d/mcp.list on your system:

# HPE Management Component Pack
deb http://downloads.linux.hpe.com/SDR/repo/mcp dist/project_ver non-free

Where:
<pre> dist bionic, xenial, trusty, precise, stretch, jessie, squeeze, wheezy
project_ver current, 11.30, 11.21, 11.05, 10.80, 10.60, 10.50, 10.40, 9.50, 9.40</pre>

Install the HPE public gpg key -- instructions at http://downloads.linux.hpe.com/SDR/keys

Update the local apt indexes
apt-get update

Search for a specific package
apt-cache search <packagename> # browse debs

Install a specific package
apt-get install <packagename>

== SmartArray P410i ==
=== Smart Storage Administration ===
The ssa utility provides a nice GUI for managing the drive arrays in the DL380G7 ... but it comes with a heavy cost in terms of software that must be installed. Beyond the '''ssa''' package that must be installed, the use requires that firefox is installed and displayable from the local machine. Given that the systems do not run in graphics mode, this means that you need to make sure that the X11 DISPLAY veriable is set properly to tunnel the result back to a graphical workstation.

After installing the repository above, install the packages:
yum install -y ssa firefox xauth
... or ...
apt install -y ssa firefox-esr xauth
Be warned -- the firefox install alone grabs over 100 dependent packages ...

After installing everything, run the SSA package in local mode:
ssa -local

This will start a background process and then launch firefox pointing at that process, which listens on a random high port number. This is a very poor situation, but the tool itself is very nice and allows (as far as I can tell) do everything to configure the controller and attached disks and arrays.

For a quick status that can be put in a regular status check (cron.daily or crontab), use the '''ssacli''' command:
yum install ssacli
... or ...
apt install ssacli

ssacli ctrl slot=0 pd all show
This will show all the drives in the first controller slot (ther is only one in the chassis, I think)

=== HPE Enterprise Support ===
The HP overview is located here: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01677092

HPE full parts lists: https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=c04396629

The full enterprise support site is located at https://support.hpe.com/hpesc/public/km/product/3883933/Product#t=Documents&sort=relevancy&layout=table&f:@kmdoclanguagecode=[cv1871440]&hpe=1

== ProLiant DL380G7 ==
HPE overview: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02215285

HPE ProLiant DL380 G7 Server - Specifications: https://support.hpe.com/hpsc/doc/public/display?docLocale=en_US&docId=emr_na-c02215298

Search | HPE Support Center - Drivers for HPE ProLiant DL380 G7 Server Models : https://support.hpe.com/hpesc/public/km/product/4091567/Product#t=DriversandSoftware&sort=relevancy&layout=table

== ProLiant DL380p Gen8 ==

HPE Overview: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03235277

Intel CPU (Xeon E5-2640 2.5 GHz 6 core):

https://ark.intel.com/content/www/us/en/ark/products/64591/intel-xeon-processor-e52640-15m-cache-2-50-ghz-7-20-gts-intel-qpi.html

HPE ProLiant System Configuration/Support

2021-12-21T00:12:04Z

DrEdWilliams: /* ProLiant DL380p Gen8 */

Debian has a page in their Wiki covering anecdotal reports of compatibility/support for Debian on the HP ProLiant Server family: https://wiki.debian.org/HP/ProLiant

== HPE Software Repositories ==
HPE maintains RPM and Debian repositories for their system support software -- see http://downloads.linux.hpe.com/SDR/project/mcp/

The software available through the repository includes:

{| class="wikitable"
|-
! package !! description
|-
| hp-health || HPE System Health Application and Command line Utilities (Gen9 and earlier)
|-
| hponcfg || HPE RILOE II/iLO online configuration utility
|-
| amsd || HPE Agentless Management Service (Gen10 only)
|-
| hp-ams || HPE Agentless Management Service (Gen9 and earlier)
|-
| hp-snmp-agents || Insight Management SNMP Agents for HPE ProLiant Systems (Gen9 and earlier)
|-
| hpsmh || HPE System Management Homepage (Gen9 and earlier)
|-
| hp-smh-templates || HPE System Management Homepage Templates (Gen9 and earlier)
|-
| ssacli || HPE Command Line Smart Storage Administration Utility
|-
| ssaducli || HPE Command Line Smart Storage Administration Diagnostics
|-
| ssa || HPE Array Smart Storage Administration Service
|}

=== RPM Repository ===
Cut-n-paste the following section (substituting distribution, architecture and project version) into /etc/yum.repos.d/mcp.repo on your system:

[mcp]
name=Management Component Pack
baseurl=http://downloads.linux.hpe.com/repo/mcp/dist/dist_ver/arch/project_ver
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/GPG-KEY-mcp

Where:
<pre> dist centos, fedora, opensuse, oracle, asianux
dist_ver [http://downloads.linux.hpe.com/SDR/repo/mcp/ Browse repo to identify supported distribution versions]
arch i386, x86_64
project_ver current, 11.30, 11.21, 11.05, 10.62, 10.50, 10.40, 10.20, 10.00, 9.30, 9.25</pre>

List the packages in the repository
yum --disablerepo="*" --enablerepo="short_repo_name" list available

Install a specific package
yum install <packagename>

=== Debian Repositories ===
Cut-n-paste the following section (substituting distribution, architecture and project version) into /etc/apt/sources.list.d/mcp.list on your system:

# HPE Management Component Pack
deb http://downloads.linux.hpe.com/SDR/repo/mcp dist/project_ver non-free

Where:
<pre> dist bionic, xenial, trusty, precise, stretch, jessie, squeeze, wheezy
project_ver current, 11.30, 11.21, 11.05, 10.80, 10.60, 10.50, 10.40, 9.50, 9.40</pre>

Install the HPE public gpg key -- instructions at http://downloads.linux.hpe.com/SDR/keys

Update the local apt indexes
apt-get update

Search for a specific package
apt-cache search <packagename> # browse debs

Install a specific package
apt-get install <packagename>

== SmartArray P410i ==
=== Smart Storage Administration ===
The ssa utility provides a nice GUI for managing the drive arrays in the DL380G7 ... but it comes with a heavy cost in terms of software that must be installed. Beyond the '''ssa''' package that must be installed, the use requires that firefox is installed and displayable from the local machine. Given that the systems do not run in graphics mode, this means that you need to make sure that the X11 DISPLAY veriable is set properly to tunnel the result back to a graphical workstation.

After installing the repository above, install the packages:
yum install -y ssa firefox xauth
... or ...
apt install -y ssa firefox-esr xauth
Be warned -- the firefox install alone grabs over 100 dependent packages ...

After installing everything, run the SSA package in local mode:
ssa -local

This will start a background process and then launch firefox pointing at that process, which listens on a random high port number. This is a very poor situation, but the tool itself is very nice and allows (as far as I can tell) do everything to configure the controller and attached disks and arrays.

For a quick status that can be put in a regular status check (cron.daily or crontab), use the '''ssacli''' command:
yum install ssacli
... or ...
apt install ssacli

ssacli ctrl slot=0 pd all show
This will show all the drives in the first controller slot (ther is only one in the chassis, I think)

=== HPE Enterprise Support ===
The HP overview is located here: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01677092

The full enterprise support site is located at https://support.hpe.com/hpesc/public/km/product/3883933/Product#t=Documents&sort=relevancy&layout=table&f:@kmdoclanguagecode=[cv1871440]&hpe=1

== ProLiant DL380G7 ==
HPE overview: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02215285

HPE ProLiant DL380 G7 Server - Specifications: https://support.hpe.com/hpsc/doc/public/display?docLocale=en_US&docId=emr_na-c02215298

Search | HPE Support Center - Drivers for HPE ProLiant DL380 G7 Server Models : https://support.hpe.com/hpesc/public/km/product/4091567/Product#t=DriversandSoftware&sort=relevancy&layout=table

== ProLiant DL380p Gen8 ==

HPE Overview: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03235277

Intel CPU (Xeon E5-2640 2.5 GHz 6 core):

https://ark.intel.com/content/www/us/en/ark/products/64591/intel-xeon-processor-e52640-15m-cache-2-50-ghz-7-20-gts-intel-qpi.html

Hardware

2021-12-21T00:10:34Z

DrEdWilliams: /* HP ProLiant Server */

== HP NC364T 4-port PCIe GigaBit NIC ==
Overview: http://www.harddrivesdirect.com/product_info.php?manufacturers_id=&products_id=454534

== HP ProLiant Server ==
Here is some info on the HP servers ... reference: https://en.wikipedia.org/wiki/ProLiant -- useful, but not as complete as the other page.

ProLiant servers are also split into several series which denote processor configuration.
* The 100, 200, 300 and 400 series comprise single and dual socket capable systems,
* the 500 and 600 series comprise quad socket capable systems,
* the 700 and 900 series comprise eight socket capable systems.
* The 900 series also includes eight sockets, supporting up to 80 CPU cores and up to 4 TB of RAM.
The second digit represents the capability of the system -- sockets, DIMMs, cores, drives
* x20 -- single socket, very limited
* x60 -- good performance
* x80 -- high capacity
* x90 -- outrageous

the 'Gen' indicator is how modern the CPU is ... higher is newer ... Gen10 is current product line

== Related Pages ==
[[HPE ProLiant System Configuration/Support]]

HPE ProLiant System Configuration/Support

2021-12-20T23:58:59Z

DrEdWilliams: /* ProLiant DL380G7 */

Debian has a page in their Wiki covering anecdotal reports of compatibility/support for Debian on the HP ProLiant Server family: https://wiki.debian.org/HP/ProLiant

== HPE Software Repositories ==
HPE maintains RPM and Debian repositories for their system support software -- see http://downloads.linux.hpe.com/SDR/project/mcp/

The software available through the repository includes:

{| class="wikitable"
|-
! package !! description
|-
| hp-health || HPE System Health Application and Command line Utilities (Gen9 and earlier)
|-
| hponcfg || HPE RILOE II/iLO online configuration utility
|-
| amsd || HPE Agentless Management Service (Gen10 only)
|-
| hp-ams || HPE Agentless Management Service (Gen9 and earlier)
|-
| hp-snmp-agents || Insight Management SNMP Agents for HPE ProLiant Systems (Gen9 and earlier)
|-
| hpsmh || HPE System Management Homepage (Gen9 and earlier)
|-
| hp-smh-templates || HPE System Management Homepage Templates (Gen9 and earlier)
|-
| ssacli || HPE Command Line Smart Storage Administration Utility
|-
| ssaducli || HPE Command Line Smart Storage Administration Diagnostics
|-
| ssa || HPE Array Smart Storage Administration Service
|}

=== RPM Repository ===
Cut-n-paste the following section (substituting distribution, architecture and project version) into /etc/yum.repos.d/mcp.repo on your system:

[mcp]
name=Management Component Pack
baseurl=http://downloads.linux.hpe.com/repo/mcp/dist/dist_ver/arch/project_ver
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/GPG-KEY-mcp

Where:
<pre> dist centos, fedora, opensuse, oracle, asianux
dist_ver [http://downloads.linux.hpe.com/SDR/repo/mcp/ Browse repo to identify supported distribution versions]
arch i386, x86_64
project_ver current, 11.30, 11.21, 11.05, 10.62, 10.50, 10.40, 10.20, 10.00, 9.30, 9.25</pre>

List the packages in the repository
yum --disablerepo="*" --enablerepo="short_repo_name" list available

Install a specific package
yum install <packagename>

=== Debian Repositories ===
Cut-n-paste the following section (substituting distribution, architecture and project version) into /etc/apt/sources.list.d/mcp.list on your system:

# HPE Management Component Pack
deb http://downloads.linux.hpe.com/SDR/repo/mcp dist/project_ver non-free

Where:
<pre> dist bionic, xenial, trusty, precise, stretch, jessie, squeeze, wheezy
project_ver current, 11.30, 11.21, 11.05, 10.80, 10.60, 10.50, 10.40, 9.50, 9.40</pre>

Install the HPE public gpg key -- instructions at http://downloads.linux.hpe.com/SDR/keys

Update the local apt indexes
apt-get update

Search for a specific package
apt-cache search <packagename> # browse debs

Install a specific package
apt-get install <packagename>

== SmartArray P410i ==
=== Smart Storage Administration ===
The ssa utility provides a nice GUI for managing the drive arrays in the DL380G7 ... but it comes with a heavy cost in terms of software that must be installed. Beyond the '''ssa''' package that must be installed, the use requires that firefox is installed and displayable from the local machine. Given that the systems do not run in graphics mode, this means that you need to make sure that the X11 DISPLAY veriable is set properly to tunnel the result back to a graphical workstation.

After installing the repository above, install the packages:
yum install -y ssa firefox xauth
... or ...
apt install -y ssa firefox-esr xauth
Be warned -- the firefox install alone grabs over 100 dependent packages ...

After installing everything, run the SSA package in local mode:
ssa -local

This will start a background process and then launch firefox pointing at that process, which listens on a random high port number. This is a very poor situation, but the tool itself is very nice and allows (as far as I can tell) do everything to configure the controller and attached disks and arrays.

For a quick status that can be put in a regular status check (cron.daily or crontab), use the '''ssacli''' command:
yum install ssacli
... or ...
apt install ssacli

ssacli ctrl slot=0 pd all show
This will show all the drives in the first controller slot (ther is only one in the chassis, I think)

=== HPE Enterprise Support ===
The HP overview is located here: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01677092

The full enterprise support site is located at https://support.hpe.com/hpesc/public/km/product/3883933/Product#t=Documents&sort=relevancy&layout=table&f:@kmdoclanguagecode=[cv1871440]&hpe=1

== ProLiant DL380G7 ==
HPE overview: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c02215285

HPE ProLiant DL380 G7 Server - Specifications: https://support.hpe.com/hpsc/doc/public/display?docLocale=en_US&docId=emr_na-c02215298

Search | HPE Support Center - Drivers for HPE ProLiant DL380 G7 Server Models : https://support.hpe.com/hpesc/public/km/product/4091567/Product#t=DriversandSoftware&sort=relevancy&layout=table

== ProLiant DL380p Gen8 ==

HPE Overview: https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c03235277

How-To

2021-12-19T11:17:17Z

DrEdWilliams: /* Miscellaneous How-To Instructions */

__NOTOC__
== Miscellaneous How-To Instructions ==
* [[Linux RAID Array]]
*Certificate Procedures
** [[Creating a Self-Signed Certificate]]
**[[Creating a CA and certificates]]
**[[Verify certificate with CA]]
** [[Renewing the 'localhost' Certificate]]
** [[Adding a certificate to the system store]]
* [[Repositories for Miniconda]]
* [[Mobile IPsec Clients]]
* [[NVMe Drives]]
* iSCSI Procedures
** [[Mounting iSCSI LUNs]]
** [[Systemd mounted iSCSI volumes]]
** [[Removing a persistent iSCSI connection]]
* [[Installing a recent copy of git]]
* [[OSQuery|OSQuery - SQL-like access to system information]]
*[[Mounting Google Drive and Dropbox]]
*[[NGINX Basic Authentication]]
*[[Removing a disk from LVM]]
*Changing [[Open File Limits]]

== System Update Procedures ==
* [[Upgrading Docker]]

== MediaWiki notes ==
* [[Deleting old copies of pages from the database]]

Open File Limits

2021-12-19T11:16:35Z

DrEdWilliams: Created page with "To change maximum open file limits for your terminal session, run this command: # ulimit -n 3000 After closing the terminal and creating a new session, the limits will get ba..."

To change maximum open file limits for your terminal session, run this command:
# ulimit -n 3000
After closing the terminal and creating a new session, the limits will get back to the original values specified in /etc/security/limits.conf.

To change the general value for the system <code>/proc/sys/fs/file-max</code>, change the <code>fs.file-max</code> value in /<code>etc/sysctl.conf</code>:
fs.file-max = 100000
And apply it:
# sysctl -p

Cluster Operations

2021-12-02T22:23:11Z

DrEdWilliams:

== Rolling Restart ==
The OpenSearch cluster is very resilient -- it tries to make sure it has all of its indices appropriately replicated at all times. It can handle nodes dropping out and being added -- but sometimes we don't want that. If nodes are simply restarted (to change settings) or the system is rebooted (for patching), we don't want to have OpenSearch move all its indices around.

This process works for a whole cluster restart (not recommended unless absolutely necessary) or individual node restarts. If individual nodes are being restarted, only do one node at a time to avoid cluster panic should all shards of an index go offline on the nodes that get restarted. The cluster '''''should''''' recover, but why stress it unnecessarily.

The process is described in the [https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-cluster.html Elastic documentation], and summarized below.

=== Disable Shard Allocation ===
When you shut down a data node, the allocation process waits for <code>index.unassigned.node_left.delayed_timeout</code> (by default, one minute) before starting to replicate the shards on that node to other nodes in the cluster, which can involve a lot of I/O. Since the node is shortly going to be restarted, this I/O is unnecessary. You can avoid racing the clock by disabling allocation of replicas before shutting down data nodes:
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}'

=== Flush Cache ===
This will prevent data loss by flushing any in-process activities to disk:
curl -X POST "localhost:9200/_flush/synced?pretty"

=== Shut Down/Restart Nodes ===
Now that things have settled out, we can stop/restart the opensearch services. Note that a full cluster shutdown should be avoided, since any activity on the cluster as it is going down or coming up could cause errors to the user. Use the systemd commands to stop/start/restart the nodes:
# do this on all nodes
sudo systemctl stop opensearch
... do required activity on all nodes ...
sudo systemctl start opensearch
... or ...
# restart a single node
sudo systemctl restart opensearch
If it was a complete cluster shutdown, all nodes must rejoin the cluster before proceeding. In either case, wait until the cluster (or node) achieves a "YELLOW" state before proceeding ... this means that all primary shards are accounted for, but replicas may not be.

=== Enable Shard Allocation ===
As each node re-starts, it tries to identify which shards it had before. Unless something has happened to disrupt the node's data, it will simply reactivate those shards (most likely as replica shards if it is a single-node restart). To enable the cluster to get back to normal shard management, we remove the block we set above:
curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent": {
"cluster.routing.allocation.enable": null
}
}'

=== Follow Through ===
When doing any activity that disrupts the cluster operations enough to cause it to drop into a "YELLOW" or "RED" state, wait until the cluster is "GREEN" again before proceeding. In the case of a full rolling restart (one node at a time), the cluster should be "GREEN" before proceeding to the next node.

This process (node rolling restart, anyway) is captured in the script <code>node-restart.sh</code>:
#!/bin/bash
#
# perform a soft restart of a node, minimizing the reindexing that would
# normally occur when a node drops/inserts

# set to one of the master nodes
MASTER=master-node

# replace with either the real admin password or a cert/key combination
AUTH="-u admin:admin"

# first shut off shard allocation
curl -XPUT -H "Content-Type: application/json" <nowiki>https://$MASTER:9200/_cluster/settings</nowiki> $AUTH -d '
{
"persistent": {
"cluster.routing.allocation.enable": "primaries"
}
}'
echo

# now flush the cache
curl -XPOST <nowiki>https://$MASTER:9200/_flush</nowiki> $AUTH
echo

# restart the opensearch process
sudo systemctl restart opensearch

# re-enable shard allocation
curl -XPUT -H "Content-Type: application/json" <nowiki>https://$MASTER:9200/_cluster/settings</nowiki> $AUTH -d '
{
"persistent": {
"cluster.routing.allocation.enable": null
}
}'
echo
This script is intended to be run on the node being restarted.

== Cluster Upgrade ==
Assuming there are no other adjustments due to features and/or bugfixes, the basic process for upgrading a cluster is to upgrade one node at a time:

# Expand tarball in the home directory of the <code>opensearch</code> user
# Copy the certificates, keys, and CAs from the <code>config</code> directory of the current version into the new one
# Copy the <code>opensearch.yml</code> file from the current version into the new one and make any adjustments
# Copy the <code>jvm.options.d/heapsize.options</code> file from the current version into the new one
# Move the <code>opensearch</code> link in the home directory from the current version into the new one
# Reload and restart the opensearch process using the '''Rolling Restart''' process described above
# Wait for the cluster to get to a "GREEN" state before moving to the next node

This assumes that the systemd unit file uses the softlink to access the opensearch executable as described in [[OpenSearch Cluster Installation]].

== Certificate Updates ==
Given that the standard lifetime for certificates these days is one year, this is something that will need to be accomplished several times during the lifetime of a cluster.

This process does not require downtime for the cluster, but care must be taken to ensure that the mix of old and new certificates does not cripple the cluster. This results in up to three complete cluster restarts (using the '''Rolling Restart''' method described above), but it does not disrupt the operation of the cluster as long as nothing exceptional happens, and the cluster is allowed to stabilize between each node restart.

=== Create New Certificates ===
Using the process outlined in [https://wiki.williams-net.org/wiki/OpenSearch_Cluster_Installation#Create_Certificates OpenSearch Cluster Creation], create the required certificates and keys. If at all possible, keep the DNs on all certificates the same as the ones they are replacing ... it reduces the number of restarts and minimizes the edits of the config files. It also helps if the node certificate/key files have the same names ... for the same reasons.

Send each node its new certificate/keys and the admin certificate/key -- or make them available for remote retrieval. If user certificates were created, send them to the system(s) where they will be used.

=== Create CA Bundle ===
Using a CA bundle with the old and new CAs will allow us to do a gradual transition and avoid downtime. Fortunately, creating a bundle for the CAs is easy
cat <old CA> <new CA> >> rootCAbundle.crt
Again, if there is currently a bundle in place, use the same name for the bundle file and you'll save yourself a lot of editing. If the REST API uses a different CA bundle already that supports other (external) CAs, create a second CA bundle that includes the external CAs, the old internal CA, and the new internal CA.

Send this new CA bundle to all cluster nodes, dashboard instances, and logstash instances.

=== Update CA Files ===
For each cluster node:

* Copy the new CA bundle in place (in the opensearch <code>config</code> directory) making sure that permissions allow for read access to the <code>opensearch</code> user.
* DO NOT copy the node certificates yet.
* Edit <code>opensearch.yml</code> if needed:
** Update the name of the CA to the new CA bundle file (if needed) in both the <code>ssl</code> and <code>transport</code> parameters, using the appropriate bundle for each parameter (see above)
** Add the DNs (DO NOT replace them yet) for the node and admin certificates if the new DNs are different from what is in the file

* Perform a '''Rolling Restart''' of the node, verifying that there are no SSL/certificate errors, and waiting for "GREEN" status before proceeding

Now do the same for all dashboards, logstash instances, and any other clients that use the CA to validate cluster identify:

* Install the CA bundle as appropriate, ensuring that the client has read access to the CA bundle.
* Update the CA filename in the configuration file (if needed)
* DO NOT update user certificates at this point
* Restart the client (if needed) and verify proper operation

=== Update Node Certificates ===
For each node in the cluster:

* Copy the node certificate and key and the admin certificate and key into the opensearch <code>config</code> directory, verifying that the <code>opensearch</code> user has read permissions
* Edit <code>opensearch.yml</code> if needed to correct certificate file names
* DO NOT remove extra DNs from the admin or node sections of the file yet
* Perform a '''Rolling Restart''' of the node, waiting for the cluster to achieve "GREEN" status before going to the next node

=== Update Client Certificates ===
For each client that uses certificate authentication:

* Copy the client certificate bundle (.p12 file) into place
* Edit the configuration file if needed to reflect the new file
* Restart the client (if needed) to activate the new configuration

=== Clean Up ===
At this point we can remove any old DNs from the <code>opensearch.yml</code> configurations and in general remove all traces of the old certificates in the cluster nodes and clients. If a restart is warranted to verify that nothing critical was deleted, that should be done at this time.

It is not necessary to replace the new CA bundle with the CA certificate -- it serves no purpose as the old cert is just along for the ride at this point, and once it reaches its expiration date, will not be used. If it must be done, use the same process above that was used to install the CA bundle, performing a rolling restart on each node as it is completed.

== Troubleshooting ==
Setting debug-level logging for opensearch and opensearch-dashboards:
# Opensearch
log4j2.properties:
rootLogger.level = debug

# Opensearch Dashboards
opensearch_dashboards.yml:
logging.verbose: true

OpenSearch Cluster Installation

2021-11-30T22:18:01Z

DrEdWilliams: /* Install OpenSearch Dashboards */

These instructions describe the creation of a multi-node OpenSearch cluster using the tarball distribution. The configuration instructions below assume a cluster with the following components:

* Multiple master/ingest nodes
* Multiple hot data nodes
* Multiple cold data nodes

The instructions should be adaptable to any multi-node configuration, and even to a uniform node configuration (i.e. three master/ingress/data nodes) -- though that simple configuration does not need the multiple steps required for the more complex deployment.

Read through the entire document before doing anything, as there is a specific order in which things must be done for a successful installation.

== Node Preparation ==
Each node to be used in the cluster must be prepared to support the OpenSearch installation.

=== Create Opensearch User ===
Debian/Ubuntu:
sudo adduser --system -uid 968 --shell /bin/bash --gecos 'OpenSearch User' --group --disabled-password --home /opt/opensearch opensearch
Fedora/RHEL:
sudo adduser --system --uid 968 --shell /bin/bash --home-dir /opt/opensearch -m opensearch
Then add your normal user to the group:
sudo usermod -aG opensearch ewilliam

=== Set vm.max_map_count ===
Create <code>/etc/sysctl.d/vm.maxmap_count.conf</code> (as root) and add the following line:
vm.max_map_count=262144
Run the following command as root to load the new configuration:
sysctl --system

=== Systemd Unit File ===
One of the big things missing from the tarball distribution is a systemd unit file. Fortunately, it isn't difficult to create a simple one that will work.

Create a file <code>/etc/systemd/system/opensearch.service</code> with the following contents:
[Unit]
Description=OpenSearch
Documentation=<nowiki>https://opensearch.org/docs/</nowiki>
Wants=network-online.target
After=network-online.target

[Service]
Type=exec
RuntimeDirectory=opensearch
PrivateTmp=true
Environment=OPENSEARCH_HOME=/opt/opensearch/opensearch
Environment=OPENSEARCH_PATH_CONF=/opt/opensearch/opensearch/config
Environment=PID_DIR=/opt/opensearch/opensearch
Environment=OPENSEARCH_SD_NOTIFY=true
EnvironmentFile=-/opt/opensearch/opensearch/opensearch.env

WorkingDirectory=/opt/opensearch/opensearch

User=opensearch
Group=opensearch

ExecStart=/opt/opensearch/opensearch/bin/opensearch -p ${PID_DIR}/opensearch.pid --quiet

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

# Allow a slow startup before the systemd notifier module kicks in to extend the timeout
TimeoutStartSec=75

[Install]
WantedBy=multi-user.target

== Create Certificates ==
There are three types of certificates used in OpenSearch:

* CA certificate - one for the whole cluster to validate everything else
* Node certificate/key - identifies the node for internal cluster communications (transport layer) and data ingest (http layer)
** the CN must be the FQDN of the node
** even though OpenSearch doesn't seem to use them at this point, there should be SANs for the full FQDN, the short hostname, and the node's IP address
* User certificate/key - identifies a specific OpenSearch user
** the CN is a username in OpenSearch (explicit or implicit)
** username is used to determine roles/privileges in the cluster
** Need at least one cert for admin authentication

Node and User certificates must all be signed by the cluster CA, and '''all keys must be in PKCS8 format'''. If user certificates are to be used for authentication by logstash, they must be imported (along with the key and CA) into a PKCS12 (.p12) keystore. The OpenSearch documentation has a good piece on [https://opensearch.org/docs/latest/security-plugin/configuration/generate-certificates/ how to create certificates].

In summary, create the following certificates/keys

# CA certificate/key
# For each cluster node:
#* <node>.crt
#* <node>-pkcs8.key
# For each user:
#* <user>.crt
#* <user>-pkcs8.key
#* <user>.p12

All these should be created and available for the configuration of each cluster node below.

== Install OpenSearch Tarball ==
The OpenSearch tarball must be installed on each node in the cluster; there is no difference at this point between the installations on the different types of nodes. Assuming the deployment will use the <code>opensearch</code> user created above ... change to the <code>opensearch</code> user (from root):
su - opensearch
cd ~
Run the following commands as user <code>opensearch</code> (from the opensearch home directory <code>/opt/opensearch</code>):
wget https://artifacts.opensearch.org/releases/bundle/opensearch/1.2.0/opensearch-1.2.0-linux-x64.tar.gz
tar -xvzf opensearch-1.2.0-linux-x64.tar.gz
== Configure OpenSearch ==
Configuration varies depending on the type of node, but these are the places that need to be touched.

=== <code>jvm.options</code> ===
Set the heap size. Common practice is that the heap should be between 4GB and half the available memory, but no higher than 32GB. A minimum configuration might be (assuming the hot data nodes have 32GB installed RAM:

* 4G for master and cold data nodes
* 16G for hot data nodes

This must be done individually for each node -- create the file <code>config/jvm.options.d/heapsize.options</code> with the heap size options:
-Xms4g
-Xmx4g
... with, of course, the proper heap size for that node.

=== <code>opensearch.yml</code> ===
While the OpenSearch software can be rather forgiving for low-security environments, if you turn on full hostname verification, it is rather strict. The biggest thing to remember is that '''SAN entries in the node certificates aren't always respected.''' What this boils down to is that if full verification is turned on, you cannot use IP addresses in the <code>opensearch.yml</code> (or any other configuration file such as the logstash pipelines).

Each node will have a unique version of opensearch.yml due to certificate file names, the role(s) for the node, and the temperature (for data nodes). The significant parameters in the opensearch.yml file are:
* <code>cluster.name</code> - must be identical in all nodes
* <code>node.name</code>- can be the hostname, but doesn't have to be (must be unique)
* <code>path.data</code> - data nodes only -- must be unique to the node (i.e if using shared storage, each node must have its own directory on the shared volume)
* <code>path.logs</code> - must be unique to the node (same as above)
* <code>network.host</code> -- host's FQDN where all traffic will be received
**Can be a single hostname or an array of hostnames in <code>["name", "name"]</code> format
**If you are doing transport-level host verification this CANNOT be an IP -- OpenSearch can't handle the SANs in the certificates needed to make it work
* <code>http.port</code> -- leave it as the default (<code>9200</code>) but uncomment it to lock it in
* <code>discovery.seed_hosts</code> -- FQDN for all the masters (same for all nodes) -- only used for cluster bootstrap
* <code>cluster.initial_master_nodes</code> -- put the master node names here (not the hostnames) -- same for all nodes
* <code>node.[master|data|ingest|remote]</code> - set the type of the node (specify all four)
* <code>node.attr.temp</code> - (data nodes only) defines data storage tiers (<code>hot</code>/<code>warm</code>/<code>cold</code>)
The stock configuration file has most of these specified -- edit the file to fit the configuration of the node.

The security configuration must be added next. Since we are not using the demo security as our template, all the relevant entries must be added now. Copy this block into the end of the stock opensearch.yml, editing as needed to provide the relevant certificate file names:
node.master: true
node.ingest: true
node.data: false
#node.remote: false

#node.attr.temp: hot

######## Start OpenSearch Security Configuration ########
plugins.security.ssl.transport.pemcert_filepath: <node>.crt
plugins.security.ssl.transport.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.transport.pemtrustedcas_filepath: <CAcert>
plugins.security.ssl.transport.enforce_hostname_verification: true

plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: <node>.crt
plugins.security.ssl.http.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.http.pemtrustedcas_filepath: <CAcert>

plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true

# at least one cert -- must be an EXACT copy of the DN for the certificates to be used for admin auth
plugins.security.authcz.admin_dn:
- CN=Admin,O=OpenSearch,C=US

# these must be EXACT copies of the DN for the node certificates
plugins.security.nodes_dn:
- CN=master-1.opensearch.localnet,O=OpenSearch,C=US
- CN=data-hot-1.opensearch.localnet,O=OpenSearch,C=US

# this is all boiler plate ... no modifications needed
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
# this next bit should ALL be on one line ...
plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*", ".replication-metadata-store"]

node.max_local_storage_nodes: 3

######## End OpenSearch Security Configuration ########
Note that there are two sections where the DNs from the certificates must be provided to enable admin authentication for security configuration, and for validating the nodes that will be part of the cluster. You can easily show the DNs for your certificates using this command:
openssl x509 -subject -nameopt RFC2253 -noout -in <cert file name>
The DN for the admin cert and the DNs for the node certs go in separate sections as shown above. Add all the cluster nodes into the nodes_dn section -- this must be the same for all nodes or they will not be able to communicate.

=== Certificates ===
Copy the following certificates/keys that were created previously into the config directory of each node:

* <CAcert>
* <node>.crt
* <node>-pkcs8.key
* admin.crt
* admin-pkcs8.crt
Verify that the <code>opensearch</code> user is able to at least read all of the certs/keys, or the cluster will not function.

=== Security Plugin Configuration ===
While the security configuration files exist on all the nodes in the cluster, they are only used once and only by the node where the security initialization is performed. With that in mine, pick one node (the first master node will do nicely) to edit the configuration files as needed. There are many things that can be done with these files, but two items in particular should be addressed. The files are located in the <code>plugins/opensearch-security/securityconfig</code> directory in the opensearch distribution.

==== Authentication ====
<code>config.yml</code> is where the authentication options are established. This will need to be configured to support your authentication options. The OpenSearch documentation describes the different options [https://opensearch.org/docs/latest/security-plugin/configuration/configuration/ here].

The default configuration supports basic username/password authentication, but if you wish to use certificate authentication for API clients or ingest pipelines using logstash, you need to enable both '''http''' and '''transport''' under <code>clientcert_auth_domain</code>. Proxy authentication and OpenID Connect authentication is possible as well ... see the documentation.

==== Admin user password ====
The default password for the admin user is well-known, and should be changed for a production environment. The OpenSearch documentation provides instructions on how to [https://opensearch.org/docs/latest/security-plugin/configuration/yaml/#internal_usersyml generate passwords] for the <code>internal_users.yml</code> file. Note that '''the admin password cannot be set through the dashboard or the API''' -- it must be set in this file, and then (re)loaded to take effect.

=== Security Plugin Initialization ===
Once all security configuration files are finished, they must be loaded into the OpenSearch cluster. This can only be done once enough of the cluster is up to provide a master node and at least one data node, as the security configuration is loaded into an index in the cluster. See the instructions below for specific sequencing. This command will initialize the security plugin:
plugins/opensearch-security/tools/securityadmin.sh \
-cacert config/<CA cert> -cert config/admin.crt -key config/admin-pkcs8.key \
-cd plugins/opensearch-security/securityconfig/ -host master-1.opensearch.localnet -icl -nhnv
... but again -- wait to do this until the right time below.

=== Starting The Node ===
Subject to the constraints described in the next section, the only thing left is to start the opensearch service:
sudo systemctl enable --now opensearch
Again, this must be done on each node in the correct sequence ... see below.

== Cluster Bootstrapping ==
Starting up a multi-node cluster is a bit tricky:

* Master-only nodes have no data
* Data-only nodes aren't master eligible
* Cluster initialization requires indexes to be available

'''Note that this process does NOT use the demo configuration -- put all the security plugin items in the <code>opensearch.yml</code> file from the beginning'''. This will cause the <code>opensearch-tar-install.sh</code> script to skip the security initialization -- this is intentional.

What this means is that you have to stand up multiple nodes before anything starts working ... the process should be:

# Install and start what will be the 'primary' master ... reference point for cluster startup operations
# Install and start one of the 'hot' data nodes
# Run the 'securityadmin.sh' script (using the command above) to initialize the security subsystem
# Install/start the other master nodes so that there is resiliency as nodes go up and down
# install/start the other data nodes to allow replication of the system indexes.

== Install OpenSearch Dashboards ==
Finally, install the OpenSearch Dashboards using the tarball as well.

Run the following commands as opensearch user:
wget [https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.1.0/opensearch-dashboards-1.1.0-linux-x64.tar.gz https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.2.0/opensearch-dashboards-1.2.0-linux-x64.tar.gz]
tar -zxf opensearch-dashboards-1.2.0-linux-x64.tar.gz

Configure the <code>opensearch-dashboards.yml</code> file as needed for your situation.

Create file <code>/etc/systemd/system/opensearch-dashboards.service</code> (as root) with the following contents:
[Unit]
Description=OpenSearch-Dasboards

[Service]
Type=simple
User=opensearch
Group=opensearch
ExecStart=/opt/opensearch/opensearch-dashboards-1.2.0/bin/opensearch-dashboards
Restart=always
WorkingDirectory=/opt/opensearch/opensearch-dashboards-1.2.0

[Install]
WantedBy=multi-user.target
Run the following commands as root:
systemctl daemon-reload
systemctl start opensearch-dashboards.service
systemctl enable opensearch-dashboards.service

== References ==

* https://cryptopoolparty.com/blog/cardano-stake-pool-monitoring-ofo-stack/
** Describes how to use OpenSearch to monitor another application -- but portions were used to create this document
*https://github.com/opensearch-project/dashboards-reports/releases/tag/chromium-1.12.0.0
**download for headless chrome for generating PDF/PNG reports

OpenSearch Cluster Installation

2021-11-30T20:32:32Z

DrEdWilliams: /* Systemd Unit File */

These instructions describe the creation of a multi-node OpenSearch cluster using the tarball distribution. The configuration instructions below assume a cluster with the following components:

* Multiple master/ingest nodes
* Multiple hot data nodes
* Multiple cold data nodes

The instructions should be adaptable to any multi-node configuration, and even to a uniform node configuration (i.e. three master/ingress/data nodes) -- though that simple configuration does not need the multiple steps required for the more complex deployment.

Read through the entire document before doing anything, as there is a specific order in which things must be done for a successful installation.

== Node Preparation ==
Each node to be used in the cluster must be prepared to support the OpenSearch installation.

=== Create Opensearch User ===
Debian/Ubuntu:
sudo adduser --system -uid 968 --shell /bin/bash --gecos 'OpenSearch User' --group --disabled-password --home /opt/opensearch opensearch
Fedora/RHEL:
sudo adduser --system --uid 968 --shell /bin/bash --home-dir /opt/opensearch -m opensearch
Then add your normal user to the group:
sudo usermod -aG opensearch ewilliam

=== Set vm.max_map_count ===
Create <code>/etc/sysctl.d/vm.maxmap_count.conf</code> (as root) and add the following line:
vm.max_map_count=262144
Run the following command as root to load the new configuration:
sysctl --system

=== Systemd Unit File ===
One of the big things missing from the tarball distribution is a systemd unit file. Fortunately, it isn't difficult to create a simple one that will work.

Create a file <code>/etc/systemd/system/opensearch.service</code> with the following contents:
[Unit]
Description=OpenSearch
Documentation=<nowiki>https://opensearch.org/docs/</nowiki>
Wants=network-online.target
After=network-online.target

[Service]
Type=exec
RuntimeDirectory=opensearch
PrivateTmp=true
Environment=OPENSEARCH_HOME=/opt/opensearch/opensearch
Environment=OPENSEARCH_PATH_CONF=/opt/opensearch/opensearch/config
Environment=PID_DIR=/opt/opensearch/opensearch
Environment=OPENSEARCH_SD_NOTIFY=true
EnvironmentFile=-/opt/opensearch/opensearch/opensearch.env

WorkingDirectory=/opt/opensearch/opensearch

User=opensearch
Group=opensearch

ExecStart=/opt/opensearch/opensearch/bin/opensearch -p ${PID_DIR}/opensearch.pid --quiet

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

# Allow a slow startup before the systemd notifier module kicks in to extend the timeout
TimeoutStartSec=75

[Install]
WantedBy=multi-user.target

== Create Certificates ==
There are three types of certificates used in OpenSearch:

* CA certificate - one for the whole cluster to validate everything else
* Node certificate/key - identifies the node for internal cluster communications (transport layer) and data ingest (http layer)
** the CN must be the FQDN of the node
** even though OpenSearch doesn't seem to use them at this point, there should be SANs for the full FQDN, the short hostname, and the node's IP address
* User certificate/key - identifies a specific OpenSearch user
** the CN is a username in OpenSearch (explicit or implicit)
** username is used to determine roles/privileges in the cluster
** Need at least one cert for admin authentication

Node and User certificates must all be signed by the cluster CA, and '''all keys must be in PKCS8 format'''. If user certificates are to be used for authentication by logstash, they must be imported (along with the key and CA) into a PKCS12 (.p12) keystore. The OpenSearch documentation has a good piece on [https://opensearch.org/docs/latest/security-plugin/configuration/generate-certificates/ how to create certificates].

In summary, create the following certificates/keys

# CA certificate/key
# For each cluster node:
#* <node>.crt
#* <node>-pkcs8.key
# For each user:
#* <user>.crt
#* <user>-pkcs8.key
#* <user>.p12

All these should be created and available for the configuration of each cluster node below.

== Install OpenSearch Tarball ==
The OpenSearch tarball must be installed on each node in the cluster; there is no difference at this point between the installations on the different types of nodes. Assuming the deployment will use the <code>opensearch</code> user created above ... change to the <code>opensearch</code> user (from root):
su - opensearch
cd ~
Run the following commands as user <code>opensearch</code> (from the opensearch home directory <code>/opt/opensearch</code>):
wget https://artifacts.opensearch.org/releases/bundle/opensearch/1.2.0/opensearch-1.2.0-linux-x64.tar.gz
tar -xvzf opensearch-1.2.0-linux-x64.tar.gz
== Configure OpenSearch ==
Configuration varies depending on the type of node, but these are the places that need to be touched.

=== <code>jvm.options</code> ===
Set the heap size. Common practice is that the heap should be between 4GB and half the available memory, but no higher than 32GB. A minimum configuration might be (assuming the hot data nodes have 32GB installed RAM:

* 4G for master and cold data nodes
* 16G for hot data nodes

This must be done individually for each node -- create the file <code>config/jvm.options.d/heapsize.options</code> with the heap size options:
-Xms4g
-Xmx4g
... with, of course, the proper heap size for that node.

=== <code>opensearch.yml</code> ===
While the OpenSearch software can be rather forgiving for low-security environments, if you turn on full hostname verification, it is rather strict. The biggest thing to remember is that '''SAN entries in the node certificates aren't always respected.''' What this boils down to is that if full verification is turned on, you cannot use IP addresses in the <code>opensearch.yml</code> (or any other configuration file such as the logstash pipelines).

Each node will have a unique version of opensearch.yml due to certificate file names, the role(s) for the node, and the temperature (for data nodes). The significant parameters in the opensearch.yml file are:
* <code>cluster.name</code> - must be identical in all nodes
* <code>node.name</code>- can be the hostname, but doesn't have to be (must be unique)
* <code>path.data</code> - data nodes only -- must be unique to the node (i.e if using shared storage, each node must have its own directory on the shared volume)
* <code>path.logs</code> - must be unique to the node (same as above)
* <code>network.host</code> -- host's FQDN where all traffic will be received
**Can be a single hostname or an array of hostnames in <code>["name", "name"]</code> format
**If you are doing transport-level host verification this CANNOT be an IP -- OpenSearch can't handle the SANs in the certificates needed to make it work
* <code>http.port</code> -- leave it as the default (<code>9200</code>) but uncomment it to lock it in
* <code>discovery.seed_hosts</code> -- FQDN for all the masters (same for all nodes) -- only used for cluster bootstrap
* <code>cluster.initial_master_nodes</code> -- put the master node names here (not the hostnames) -- same for all nodes
* <code>node.[master|data|ingest|remote]</code> - set the type of the node (specify all four)
* <code>node.attr.temp</code> - (data nodes only) defines data storage tiers (<code>hot</code>/<code>warm</code>/<code>cold</code>)
The stock configuration file has most of these specified -- edit the file to fit the configuration of the node.

The security configuration must be added next. Since we are not using the demo security as our template, all the relevant entries must be added now. Copy this block into the end of the stock opensearch.yml, editing as needed to provide the relevant certificate file names:
node.master: true
node.ingest: true
node.data: false
#node.remote: false

#node.attr.temp: hot

######## Start OpenSearch Security Configuration ########
plugins.security.ssl.transport.pemcert_filepath: <node>.crt
plugins.security.ssl.transport.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.transport.pemtrustedcas_filepath: <CAcert>
plugins.security.ssl.transport.enforce_hostname_verification: true

plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: <node>.crt
plugins.security.ssl.http.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.http.pemtrustedcas_filepath: <CAcert>

plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true

# at least one cert -- must be an EXACT copy of the DN for the certificates to be used for admin auth
plugins.security.authcz.admin_dn:
- CN=Admin,O=OpenSearch,C=US

# these must be EXACT copies of the DN for the node certificates
plugins.security.nodes_dn:
- CN=master-1.opensearch.localnet,O=OpenSearch,C=US
- CN=data-hot-1.opensearch.localnet,O=OpenSearch,C=US

# this is all boiler plate ... no modifications needed
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
# this next bit should ALL be on one line ...
plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*", ".replication-metadata-store"]

node.max_local_storage_nodes: 3

######## End OpenSearch Security Configuration ########
Note that there are two sections where the DNs from the certificates must be provided to enable admin authentication for security configuration, and for validating the nodes that will be part of the cluster. You can easily show the DNs for your certificates using this command:
openssl x509 -subject -nameopt RFC2253 -noout -in <cert file name>
The DN for the admin cert and the DNs for the node certs go in separate sections as shown above. Add all the cluster nodes into the nodes_dn section -- this must be the same for all nodes or they will not be able to communicate.

=== Certificates ===
Copy the following certificates/keys that were created previously into the config directory of each node:

* <CAcert>
* <node>.crt
* <node>-pkcs8.key
* admin.crt
* admin-pkcs8.crt
Verify that the <code>opensearch</code> user is able to at least read all of the certs/keys, or the cluster will not function.

=== Security Plugin Configuration ===
While the security configuration files exist on all the nodes in the cluster, they are only used once and only by the node where the security initialization is performed. With that in mine, pick one node (the first master node will do nicely) to edit the configuration files as needed. There are many things that can be done with these files, but two items in particular should be addressed. The files are located in the <code>plugins/opensearch-security/securityconfig</code> directory in the opensearch distribution.

==== Authentication ====
<code>config.yml</code> is where the authentication options are established. This will need to be configured to support your authentication options. The OpenSearch documentation describes the different options [https://opensearch.org/docs/latest/security-plugin/configuration/configuration/ here].

The default configuration supports basic username/password authentication, but if you wish to use certificate authentication for API clients or ingest pipelines using logstash, you need to enable both '''http''' and '''transport''' under <code>clientcert_auth_domain</code>. Proxy authentication and OpenID Connect authentication is possible as well ... see the documentation.

==== Admin user password ====
The default password for the admin user is well-known, and should be changed for a production environment. The OpenSearch documentation provides instructions on how to [https://opensearch.org/docs/latest/security-plugin/configuration/yaml/#internal_usersyml generate passwords] for the <code>internal_users.yml</code> file. Note that '''the admin password cannot be set through the dashboard or the API''' -- it must be set in this file, and then (re)loaded to take effect.

=== Security Plugin Initialization ===
Once all security configuration files are finished, they must be loaded into the OpenSearch cluster. This can only be done once enough of the cluster is up to provide a master node and at least one data node, as the security configuration is loaded into an index in the cluster. See the instructions below for specific sequencing. This command will initialize the security plugin:
plugins/opensearch-security/tools/securityadmin.sh \
-cacert config/<CA cert> -cert config/admin.crt -key config/admin-pkcs8.key \
-cd plugins/opensearch-security/securityconfig/ -host master-1.opensearch.localnet -icl -nhnv
... but again -- wait to do this until the right time below.

=== Starting The Node ===
Subject to the constraints described in the next section, the only thing left is to start the opensearch service:
sudo systemctl enable --now opensearch
Again, this must be done on each node in the correct sequence ... see below.

== Cluster Bootstrapping ==
Starting up a multi-node cluster is a bit tricky:

* Master-only nodes have no data
* Data-only nodes aren't master eligible
* Cluster initialization requires indexes to be available

'''Note that this process does NOT use the demo configuration -- put all the security plugin items in the <code>opensearch.yml</code> file from the beginning'''. This will cause the <code>opensearch-tar-install.sh</code> script to skip the security initialization -- this is intentional.

What this means is that you have to stand up multiple nodes before anything starts working ... the process should be:

# Install and start what will be the 'primary' master ... reference point for cluster startup operations
# Install and start one of the 'hot' data nodes
# Run the 'securityadmin.sh' script (using the command above) to initialize the security subsystem
# Install/start the other master nodes so that there is resiliency as nodes go up and down
# install/start the other data nodes to allow replication of the system indexes.

== Install OpenSearch Dashboards ==
Finally, install the OpenSearch Dashboards using the tarball as well.

Run the following commands as opensearch user:
wget [https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.1.0/opensearch-dashboards-1.1.0-linux-x64.tar.gz https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.2.0/opensearch-dashboards-1.2.0-linux-x64.tar.gz]
tar -zxf opensearch-dashboards-1.2.0-linux-x64.tar.gz

Configure the <code>opensearch-dashboards.yml</code> file as needed for your situation.

Create file <code>/etc/systemd/system/opensearch-dashboards.service</code> (as root) with the following contents:
[Unit]
Description=OpenSeach-Dasboards

[Service]
Type=simple
User=opensearch
Group=opensearch
ExecStart=/opt/opensearch/opensearch-dashboards-1.2.0/bin/opensearch-dashboards
Restart=always
WorkingDirectory=/opt/opensearch/opensearch-dashboards-1.2.0

[Install]
WantedBy=multi-user.target
Run the following commands as root:
systemctl daemon-reload
systemctl start opensearch-dashboards.service
systemctl enable opensearch-dashboards.service

== References ==

* https://cryptopoolparty.com/blog/cardano-stake-pool-monitoring-ofo-stack/
** Describes how to use OpenSearch to monitor another application -- but portions were used to create this document
*https://github.com/opensearch-project/dashboards-reports/releases/tag/chromium-1.12.0.0
**download for headless chrome for generating PDF/PNG reports

Cluster Operations

2021-11-26T16:47:56Z

DrEdWilliams: /* Rolling Restart */

OpenSearch Cluster Installation

2021-11-26T15:55:06Z

DrEdWilliams: /* Systemd Unit File */

These instructions describe the creation of a multi-node OpenSearch cluster using the tarball distribution. The configuration instructions below assume a cluster with the following components:

* Multiple master/ingest nodes
* Multiple hot data nodes
* Multiple cold data nodes

The instructions should be adaptable to any multi-node configuration, and even to a uniform node configuration (i.e. three master/ingress/data nodes) -- though that simple configuration does not need the multiple steps required for the more complex deployment.

Read through the entire document before doing anything, as there is a specific order in which things must be done for a successful installation.

== Node Preparation ==
Each node to be used in the cluster must be prepared to support the OpenSearch installation.

=== Create Opensearch User ===
Debian/Ubuntu:
sudo adduser --system -uid 968 --shell /bin/bash --gecos 'OpenSearch User' --group --disabled-password --home /opt/opensearch opensearch
Fedora/RHEL:
sudo adduser --system --uid 968 --shell /bin/bash --home-dir /opt/opensearch -m opensearch
Then add your normal user to the group:
sudo usermod -aG opensearch ewilliam

=== Set vm.max_map_count ===
Create <code>/etc/sysctl.d/vm.maxmap_count.conf</code> (as root) and add the following line:
vm.max_map_count=262144
Run the following command as root to load the new configuration:
sysctl --system

=== Systemd Unit File ===
One of the big things missing from the tarball distribution is a systemd unit file. Fortunately, it isn't difficult to create a simple one that will work.

Create a file <code>/etc/systemd/system/opensearch.service</code> with the following contents:
[Unit]
Description=OpenSearch
Documentation=<nowiki>https://opensearch.org/docs/</nowiki>
Wants=network-online.target
After=network-online.target

[Service]
RuntimeDirectory=opensearch
WorkingDirectory=/opt/opensearch/opensearch
User=opensearch
Group=opensearch

ExecStart=/opt/opensearch/opensearch/opensearch-tar-install.sh

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

# Allow a slow startup before the systemd notifier module kicks in to extend the timeout
TimeoutStartSec=75

[Install]
WantedBy=multi-user.target

== Create Certificates ==
There are three types of certificates used in OpenSearch:

* CA certificate - one for the whole cluster to validate everything else
* Node certificate/key - identifies the node for internal cluster communications (transport layer) and data ingest (http layer)
** the CN must be the FQDN of the node
** even though OpenSearch doesn't seem to use them at this point, there should be SANs for the full FQDN, the short hostname, and the node's IP address
* User certificate/key - identifies a specific OpenSearch user
** the CN is a username in OpenSearch (explicit or implicit)
** username is used to determine roles/privileges in the cluster
** Need at least one cert for admin authentication

Node and User certificates must all be signed by the cluster CA, and '''all keys must be in PKCS8 format'''. If user certificates are to be used for authentication by logstash, they must be imported (along with the key and CA) into a PKCS12 (.p12) keystore. The OpenSearch documentation has a good piece on [https://opensearch.org/docs/latest/security-plugin/configuration/generate-certificates/ how to create certificates].

In summary, create the following certificates/keys

# CA certificate/key
# For each cluster node:
#* <node>.crt
#* <node>-pkcs8.key
# For each user:
#* <user>.crt
#* <user>-pkcs8.key
#* <user>.p12

All these should be created and available for the configuration of each cluster node below.

== Install OpenSearch Tarball ==
The OpenSearch tarball must be installed on each node in the cluster; there is no difference at this point between the installations on the different types of nodes. Assuming the deployment will use the <code>opensearch</code> user created above ... change to the <code>opensearch</code> user (from root):
su - opensearch
cd ~
Run the following commands as user <code>opensearch</code> (from the opensearch home directory <code>/opt/opensearch</code>):
wget https://artifacts.opensearch.org/releases/bundle/opensearch/1.2.0/opensearch-1.2.0-linux-x64.tar.gz
tar -xvzf opensearch-1.2.0-linux-x64.tar.gz
== Configure OpenSearch ==
Configuration varies depending on the type of node, but these are the places that need to be touched.

=== <code>jvm.options</code> ===
Set the heap size. Common practice is that the heap should be between 4GB and half the available memory, but no higher than 32GB. A minimum configuration might be (assuming the hot data nodes have 32GB installed RAM:

* 4G for master and cold data nodes
* 16G for hot data nodes

This must be done individually for each node -- create the file <code>config/jvm.options.d/heapsize.options</code> with the heap size options:
-Xms4g
-Xmx4g
... with, of course, the proper heap size for that node.

=== <code>opensearch.yml</code> ===
While the OpenSearch software can be rather forgiving for low-security environments, if you turn on full hostname verification, it is rather strict. The biggest thing to remember is that '''SAN entries in the node certificates aren't always respected.''' What this boils down to is that if full verification is turned on, you cannot use IP addresses in the <code>opensearch.yml</code> (or any other configuration file such as the logstash pipelines).

Each node will have a unique version of opensearch.yml due to certificate file names, the role(s) for the node, and the temperature (for data nodes). The significant parameters in the opensearch.yml file are:
* <code>cluster.name</code> - must be identical in all nodes
* <code>node.name</code>- can be the hostname, but doesn't have to be (must be unique)
* <code>path.data</code> - data nodes only -- must be unique to the node (i.e if using shared storage, each node must have its own directory on the shared volume)
* <code>path.logs</code> - must be unique to the node (same as above)
* <code>network.host</code> -- host's FQDN where all traffic will be received
**Can be a single hostname or an array of hostnames in <code>["name", "name"]</code> format
**If you are doing transport-level host verification this CANNOT be an IP -- OpenSearch can't handle the SANs in the certificates needed to make it work
* <code>http.port</code> -- leave it as the default (<code>9200</code>) but uncomment it to lock it in
* <code>discovery.seed_hosts</code> -- FQDN for all the masters (same for all nodes) -- only used for cluster bootstrap
* <code>cluster.initial_master_nodes</code> -- put the master node names here (not the hostnames) -- same for all nodes
* <code>node.[master|data|ingest|remote]</code> - set the type of the node (specify all four)
* <code>node.attr.temp</code> - (data nodes only) defines data storage tiers (<code>hot</code>/<code>warm</code>/<code>cold</code>)
The stock configuration file has most of these specified -- edit the file to fit the configuration of the node.

The security configuration must be added next. Since we are not using the demo security as our template, all the relevant entries must be added now. Copy this block into the end of the stock opensearch.yml, editing as needed to provide the relevant certificate file names:
node.master: true
node.ingest: true
node.data: false
#node.remote: false

#node.attr.temp: hot

######## Start OpenSearch Security Configuration ########
plugins.security.ssl.transport.pemcert_filepath: <node>.crt
plugins.security.ssl.transport.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.transport.pemtrustedcas_filepath: <CAcert>
plugins.security.ssl.transport.enforce_hostname_verification: true

plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: <node>.crt
plugins.security.ssl.http.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.http.pemtrustedcas_filepath: <CAcert>

plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true

# at least one cert -- must be an EXACT copy of the DN for the certificates to be used for admin auth
plugins.security.authcz.admin_dn:
- CN=Admin,O=OpenSearch,C=US

# these must be EXACT copies of the DN for the node certificates
plugins.security.nodes_dn:
- CN=master-1.opensearch.localnet,O=OpenSearch,C=US
- CN=data-hot-1.opensearch.localnet,O=OpenSearch,C=US

# this is all boiler plate ... no modifications needed
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
# this next bit should ALL be on one line ...
plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*", ".replication-metadata-store"]

node.max_local_storage_nodes: 3

######## End OpenSearch Security Configuration ########
Note that there are two sections where the DNs from the certificates must be provided to enable admin authentication for security configuration, and for validating the nodes that will be part of the cluster. You can easily show the DNs for your certificates using this command:
openssl x509 -subject -nameopt RFC2253 -noout -in <cert file name>
The DN for the admin cert and the DNs for the node certs go in separate sections as shown above. Add all the cluster nodes into the nodes_dn section -- this must be the same for all nodes or they will not be able to communicate.

=== Certificates ===
Copy the following certificates/keys that were created previously into the config directory of each node:

* <CAcert>
* <node>.crt
* <node>-pkcs8.key
* admin.crt
* admin-pkcs8.crt
Verify that the <code>opensearch</code> user is able to at least read all of the certs/keys, or the cluster will not function.

=== Security Plugin Configuration ===
While the security configuration files exist on all the nodes in the cluster, they are only used once and only by the node where the security initialization is performed. With that in mine, pick one node (the first master node will do nicely) to edit the configuration files as needed. There are many things that can be done with these files, but two items in particular should be addressed. The files are located in the <code>plugins/opensearch-security/securityconfig</code> directory in the opensearch distribution.

==== Authentication ====
<code>config.yml</code> is where the authentication options are established. This will need to be configured to support your authentication options. The OpenSearch documentation describes the different options [https://opensearch.org/docs/latest/security-plugin/configuration/configuration/ here].

The default configuration supports basic username/password authentication, but if you wish to use certificate authentication for API clients or ingest pipelines using logstash, you need to enable both '''http''' and '''transport''' under <code>clientcert_auth_domain</code>. Proxy authentication and OpenID Connect authentication is possible as well ... see the documentation.

==== Admin user password ====
The default password for the admin user is well-known, and should be changed for a production environment. The OpenSearch documentation provides instructions on how to [https://opensearch.org/docs/latest/security-plugin/configuration/yaml/#internal_usersyml generate passwords] for the <code>internal_users.yml</code> file. Note that '''the admin password cannot be set through the dashboard or the API''' -- it must be set in this file, and then (re)loaded to take effect.

=== Security Plugin Initialization ===
Once all security configuration files are finished, they must be loaded into the OpenSearch cluster. This can only be done once enough of the cluster is up to provide a master node and at least one data node, as the security configuration is loaded into an index in the cluster. See the instructions below for specific sequencing. This command will initialize the security plugin:
plugins/opensearch-security/tools/securityadmin.sh \
-cacert config/<CA cert> -cert config/admin.crt -key config/admin-pkcs8.key \
-cd plugins/opensearch-security/securityconfig/ -host master-1.opensearch.localnet -icl -nhnv
... but again -- wait to do this until the right time below.

=== Starting The Node ===
Subject to the constraints described in the next section, the only thing left is to start the opensearch service:
sudo systemctl enable --now opensearch
Again, this must be done on each node in the correct sequence ... see below.

== Cluster Bootstrapping ==
Starting up a multi-node cluster is a bit tricky:

* Master-only nodes have no data
* Data-only nodes aren't master eligible
* Cluster initialization requires indexes to be available

'''Note that this process does NOT use the demo configuration -- put all the security plugin items in the <code>opensearch.yml</code> file from the beginning'''. This will cause the <code>opensearch-tar-install.sh</code> script to skip the security initialization -- this is intentional.

What this means is that you have to stand up multiple nodes before anything starts working ... the process should be:

# Install and start what will be the 'primary' master ... reference point for cluster startup operations
# Install and start one of the 'hot' data nodes
# Run the 'securityadmin.sh' script (using the command above) to initialize the security subsystem
# Install/start the other master nodes so that there is resiliency as nodes go up and down
# install/start the other data nodes to allow replication of the system indexes.

== Install OpenSearch Dashboards ==
Finally, install the OpenSearch Dashboards using the tarball as well.

Run the following commands as opensearch user:
wget [https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.1.0/opensearch-dashboards-1.1.0-linux-x64.tar.gz https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.2.0/opensearch-dashboards-1.2.0-linux-x64.tar.gz]
tar -zxf opensearch-dashboards-1.2.0-linux-x64.tar.gz

Configure the <code>opensearch-dashboards.yml</code> file as needed for your situation.

Create file <code>/etc/systemd/system/opensearch-dashboards.service</code> (as root) with the following contents:
[Unit]
Description=OpenSeach-Dasboards

[Service]
Type=simple
User=opensearch
Group=opensearch
ExecStart=/opt/opensearch/opensearch-dashboards-1.2.0/bin/opensearch-dashboards
Restart=always
WorkingDirectory=/opt/opensearch/opensearch-dashboards-1.2.0

[Install]
WantedBy=multi-user.target
Run the following commands as root:
systemctl daemon-reload
systemctl start opensearch-dashboards.service
systemctl enable opensearch-dashboards.service

== References ==

* https://cryptopoolparty.com/blog/cardano-stake-pool-monitoring-ofo-stack/
** Describes how to use OpenSearch to monitor another application -- but portions were used to create this document
*https://github.com/opensearch-project/dashboards-reports/releases/tag/chromium-1.12.0.0
**download for headless chrome for generating PDF/PNG reports

Cluster Operations

2021-11-26T14:26:15Z

DrEdWilliams: Created page with "== Rolling Restart == The OpenSearch cluster is very resilient -- it tries to make sure it has all of its indices appropriately replicated at all times. It can handle nodes d..."

OpenSearch

2021-11-26T12:51:19Z

DrEdWilliams: /* OpenSearch How-Tos */

== WilliamsNet OpenSearch Architecture ==
For convenience, and to enable this cluster to be usable as a Test and Development System for the HPCMP OpenSearch cluster, the configuration and naming conventions here will mirror that used in the HPCMP environment:

* Indices will be named using this style: <code><nowiki><data source>-<schema version>-<system name>-YYYY.MM.DD</nowiki></code>
* Data will be routed through logstash to enable processing and augmentation prior to indexing in the cluster

=== OpenSearch Cluster & Dashboards ===
The OpenSearch Cluster is composed of these node types:
{| class="wikitable"
!Node Type
!Servers
!System
RAM (GB)
!Heap
Size (GB)
|-
|Master/Ingest
|poggin
|16
|4
|-
|
|aravis
|8
|4
|-
|
|lamppost
|8
|4
|-
|Hot Data
|caspian
|72
|16
|-
|
|belisar
|72
|16
|-
|
|uvilas
|72
|16
|-
|Cold Data
|calormen
|16
|6
|-
|
|trumpkin
|12
|6
|-
|Dashboards
|pro1
|40
|
|-
|Monitoring
|pro1 (cerebro)
|40
|
|}
All nodes have access to the <code>/work</code> filesystem (NFS shared by '''pro6'''), and use <code>/work/osdata/<hostname></code> for logs and data (if appropriate).

All communications with (port 9200) and within (port 9300) the cluster is SSL encrypted using certificates generated based on a self-signed CA. All certs are located in <code>/work/osdata/certs</code>.

=== Aggregation and Processing ===
Logstash is installed on '''calormen''' and serves as the primary ingest route for the cluster. Additionally, another instance of logstash is installed on '''aslan''' for testing/validation purposes.

=== Data Collection ===

* All servers should have metricbeat installed, feeding data to logstash on calormen
* Servers with CUDA-capable NVIDIA GPUs have the NVIDIA collection script installed:
** '''poggin''' (Quadro GP100)
** '''aslan''' (GTX 1080 Ti)
** '''rogin''' (GTX 1080)
** '''compute1''' (GTX 1070)
** '''griffle''' (GTX 1060, GTX 950)
** '''trumpkin''' (GTX 950)

=== Cluster Monitoring ===
The OpenSearch cluster is monitored using the methods described in the [[Cluster Monitoring]] page:

* OpenSearch node logs are collected by a single instance of filebeat running on '''poggin''', accessing the logs directly from the shared filesystem. The log data is sent through logstash on '''calormen''', where the JSON log data is expanded and time-tagged
* Dashboards and visualizations are set up to view the log data and the audit data

=== Authentication ===

* User authentication to Dashboards is username/password using the OpenSearch internal user database
* API authentication uses username/password and certificate authentication methods
* Authentication of nodes in the cluster uses node certificates and the DNs contained in the <code>opensearch.yml</code> configuration files
*All certs are located in <code>/work/osdata/certs</code>

== OpenSearch How-Tos ==
Topics related to creating, maintaining, and using an elastic cluster

* [[OpenSearch Cluster|Installing an OpenSearch Cluster & Dashboards]]
* [[Data Collection Methods]]
* [[OpenSearch Cluster Monitoring|Cluster Monitoring]]
*[[Cluster Operations]]

MediaWiki:Sidebar

2021-11-26T12:49:52Z

DrEdWilliams:

Aslan

2021-11-26T12:18:42Z

DrEdWilliams:

This system is the primary linux workstation in the office and needs to have the basic apps for a graphic workstation:

* KDE Plasma Desktop
* Mate terminal (Ubuntu app)
* VideoLAN media player (https://videolan.org/vlc/download-redhat.html)
* VMware Horizon Client (for SAIC VDI access)
* Visual Studio Code (vscode - https://code.visualstudio.com/download)
* Calibre library manager (download)
* HPCMP Kerberos Kit (https://centers.hpc.mil)
** CAC software (CA CoolKey?)
* Kontena Lens (download latest)
* Thunderbird
* Slack
* Synergy (v1)
* NVIDIA drivers (default with kubuntu/nonfree)
* Zoom client
* Firefox
* rclone (for google & dropbox mounts)
*pdsh
*Oracle Java (the real thing) for the Dell Storage manager
*minio client (mc) for gitlab backup -- there is a copy in /work/bin if that will work for the platform.
*Dell Equalogic Group Manager (java webstart - requires iSCSInet interface below and Oracle Java)

Configurations:
* mount /work and /files filesystems
*configure extra network devices
**StorageNet (10.1.0.8) - use the second ethernet interface
**iSCSInet (10.2.0.200) - done via [[VLAN Interfaces|VLAN]] 15 on the StorageNet interface
* set up [[Mounting Google Drive and Dropbox|dropbox/google drive mounts]] (gdrives script in /work/bin)
Tasks/Applications (in addition to normal housekeeping):

* cron daily:
**check-all (moved to pro6)
** Backup GitLab
***mount storage1:/backup/systems to /backup_systems
***install the cert for the gitlab instance (since it is self-signed)

Pages that need to be updated

2021-11-26T12:15:37Z

DrEdWilliams:

[[SSH Host Key Management]]

[[Diggle]] - manage configuration in GitLab

[[MediaWiki Deployment]] (needs to be created)

Update [[WilliamsNet Architecture|Architecture]] to reflect [[OpenSearch]] and [[Elastic Cluster|Elastic]] clusters

[[Synergy Installation]] (needs to be created)

[[Log Accumulator]] (needs to be created)

OpenSearch Cluster Installation

2021-11-26T01:16:44Z

DrEdWilliams: /* jvm.options */

These instructions describe the creation of a multi-node OpenSearch cluster using the tarball distribution. The configuration instructions below assume a cluster with the following components:

* Multiple master/ingest nodes
* Multiple hot data nodes
* Multiple cold data nodes

The instructions should be adaptable to any multi-node configuration, and even to a uniform node configuration (i.e. three master/ingress/data nodes) -- though that simple configuration does not need the multiple steps required for the more complex deployment.

Read through the entire document before doing anything, as there is a specific order in which things must be done for a successful installation.

== Node Preparation ==
Each node to be used in the cluster must be prepared to support the OpenSearch installation.

=== Create Opensearch User ===
Debian/Ubuntu:
sudo adduser --system -uid 968 --shell /bin/bash --gecos 'OpenSearch User' --group --disabled-password --home /opt/opensearch opensearch
Fedora/RHEL:
sudo adduser --system --uid 968 --shell /bin/bash --home-dir /opt/opensearch -m opensearch
Then add your normal user to the group:
sudo usermod -aG opensearch ewilliam

=== Set vm.max_map_count ===
Create <code>/etc/sysctl.d/vm.maxmap_count.conf</code> (as root) and add the following line:
vm.max_map_count=262144
Run the following command as root to load the new configuration:
sysctl --system

=== Systemd Unit File ===
One of the big things missing from the tarball distribution is a systemd unit file. Fortunately, it isn't difficult to create a simple one that will work.

Create a file <code>/etc/systemd/system/opensearch.service</code> with the following contents:
[Unit]
Description=OpenSearch
Documentation=<nowiki>https://opensearch.org/docs/</nowiki>
Wants=network-online.target
After=network-online.target

[Service]
RuntimeDirectory=opensearch
WorkingDirectory=/opt/opensearch/opensearch-1.1.0
User=opensearch
Group=opensearch

ExecStart=/opt/opensearch/opensearch-1.1.0/opensearch-tar-install.sh

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

# Allow a slow startup before the systemd notifier module kicks in to extend the timeout
TimeoutStartSec=75

[Install]
WantedBy=multi-user.target

== Create Certificates ==
There are three types of certificates used in OpenSearch:

* CA certificate - one for the whole cluster to validate everything else
* Node certificate/key - identifies the node for internal cluster communications (transport layer) and data ingest (http layer)
** the CN must be the FQDN of the node
** even though OpenSearch doesn't seem to use them at this point, there should be SANs for the full FQDN, the short hostname, and the node's IP address
* User certificate/key - identifies a specific OpenSearch user
** the CN is a username in OpenSearch (explicit or implicit)
** username is used to determine roles/privileges in the cluster
** Need at least one cert for admin authentication

Node and User certificates must all be signed by the cluster CA, and '''all keys must be in PKCS8 format'''. If user certificates are to be used for authentication by logstash, they must be imported (along with the key and CA) into a PKCS12 (.p12) keystore. The OpenSearch documentation has a good piece on [https://opensearch.org/docs/latest/security-plugin/configuration/generate-certificates/ how to create certificates].

In summary, create the following certificates/keys

# CA certificate/key
# For each cluster node:
#* <node>.crt
#* <node>-pkcs8.key
# For each user:
#* <user>.crt
#* <user>-pkcs8.key
#* <user>.p12

All these should be created and available for the configuration of each cluster node below.

== Install OpenSearch Tarball ==
The OpenSearch tarball must be installed on each node in the cluster; there is no difference at this point between the installations on the different types of nodes. Assuming the deployment will use the <code>opensearch</code> user created above ... change to the <code>opensearch</code> user (from root):
su - opensearch
cd ~
Run the following commands as user <code>opensearch</code> (from the opensearch home directory <code>/opt/opensearch</code>):
wget https://artifacts.opensearch.org/releases/bundle/opensearch/1.2.0/opensearch-1.2.0-linux-x64.tar.gz
tar -xvzf opensearch-1.2.0-linux-x64.tar.gz
== Configure OpenSearch ==
Configuration varies depending on the type of node, but these are the places that need to be touched.

=== <code>jvm.options</code> ===
Set the heap size. Common practice is that the heap should be between 4GB and half the available memory, but no higher than 32GB. A minimum configuration might be (assuming the hot data nodes have 32GB installed RAM:

* 4G for master and cold data nodes
* 16G for hot data nodes

This must be done individually for each node -- create the file <code>config/jvm.options.d/heapsize.options</code> with the heap size options:
-Xms4g
-Xmx4g
... with, of course, the proper heap size for that node.

=== <code>opensearch.yml</code> ===
While the OpenSearch software can be rather forgiving for low-security environments, if you turn on full hostname verification, it is rather strict. The biggest thing to remember is that '''SAN entries in the node certificates aren't always respected.''' What this boils down to is that if full verification is turned on, you cannot use IP addresses in the <code>opensearch.yml</code> (or any other configuration file such as the logstash pipelines).

Each node will have a unique version of opensearch.yml due to certificate file names, the role(s) for the node, and the temperature (for data nodes). The significant parameters in the opensearch.yml file are:
* <code>cluster.name</code> - must be identical in all nodes
* <code>node.name</code>- can be the hostname, but doesn't have to be (must be unique)
* <code>path.data</code> - data nodes only -- must be unique to the node (i.e if using shared storage, each node must have its own directory on the shared volume)
* <code>path.logs</code> - must be unique to the node (same as above)
* <code>network.host</code> -- host's FQDN where all traffic will be received
**Can be a single hostname or an array of hostnames in <code>["name", "name"]</code> format
**If you are doing transport-level host verification this CANNOT be an IP -- OpenSearch can't handle the SANs in the certificates needed to make it work
* <code>http.port</code> -- leave it as the default (<code>9200</code>) but uncomment it to lock it in
* <code>discovery.seed_hosts</code> -- FQDN for all the masters (same for all nodes) -- only used for cluster bootstrap
* <code>cluster.initial_master_nodes</code> -- put the master node names here (not the hostnames) -- same for all nodes
* <code>node.[master|data|ingest|remote]</code> - set the type of the node (specify all four)
* <code>node.attr.temp</code> - (data nodes only) defines data storage tiers (<code>hot</code>/<code>warm</code>/<code>cold</code>)
The stock configuration file has most of these specified -- edit the file to fit the configuration of the node.

The security configuration must be added next. Since we are not using the demo security as our template, all the relevant entries must be added now. Copy this block into the end of the stock opensearch.yml, editing as needed to provide the relevant certificate file names:
node.master: true
node.ingest: true
node.data: false
#node.remote: false

#node.attr.temp: hot

######## Start OpenSearch Security Configuration ########
plugins.security.ssl.transport.pemcert_filepath: <node>.crt
plugins.security.ssl.transport.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.transport.pemtrustedcas_filepath: <CAcert>
plugins.security.ssl.transport.enforce_hostname_verification: true

plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: <node>.crt
plugins.security.ssl.http.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.http.pemtrustedcas_filepath: <CAcert>

plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true

# at least one cert -- must be an EXACT copy of the DN for the certificates to be used for admin auth
plugins.security.authcz.admin_dn:
- CN=Admin,O=OpenSearch,C=US

# these must be EXACT copies of the DN for the node certificates
plugins.security.nodes_dn:
- CN=master-1.opensearch.localnet,O=OpenSearch,C=US
- CN=data-hot-1.opensearch.localnet,O=OpenSearch,C=US

# this is all boiler plate ... no modifications needed
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
# this next bit should ALL be on one line ...
plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*", ".replication-metadata-store"]

node.max_local_storage_nodes: 3

######## End OpenSearch Security Configuration ########
Note that there are two sections where the DNs from the certificates must be provided to enable admin authentication for security configuration, and for validating the nodes that will be part of the cluster. You can easily show the DNs for your certificates using this command:
openssl x509 -subject -nameopt RFC2253 -noout -in <cert file name>
The DN for the admin cert and the DNs for the node certs go in separate sections as shown above. Add all the cluster nodes into the nodes_dn section -- this must be the same for all nodes or they will not be able to communicate.

=== Certificates ===
Copy the following certificates/keys that were created previously into the config directory of each node:

* <CAcert>
* <node>.crt
* <node>-pkcs8.key
* admin.crt
* admin-pkcs8.crt
Verify that the <code>opensearch</code> user is able to at least read all of the certs/keys, or the cluster will not function.

=== Security Plugin Configuration ===
While the security configuration files exist on all the nodes in the cluster, they are only used once and only by the node where the security initialization is performed. With that in mine, pick one node (the first master node will do nicely) to edit the configuration files as needed. There are many things that can be done with these files, but two items in particular should be addressed. The files are located in the <code>plugins/opensearch-security/securityconfig</code> directory in the opensearch distribution.

==== Authentication ====
<code>config.yml</code> is where the authentication options are established. This will need to be configured to support your authentication options. The OpenSearch documentation describes the different options [https://opensearch.org/docs/latest/security-plugin/configuration/configuration/ here].

The default configuration supports basic username/password authentication, but if you wish to use certificate authentication for API clients or ingest pipelines using logstash, you need to enable both '''http''' and '''transport''' under <code>clientcert_auth_domain</code>. Proxy authentication and OpenID Connect authentication is possible as well ... see the documentation.

==== Admin user password ====
The default password for the admin user is well-known, and should be changed for a production environment. The OpenSearch documentation provides instructions on how to [https://opensearch.org/docs/latest/security-plugin/configuration/yaml/#internal_usersyml generate passwords] for the <code>internal_users.yml</code> file. Note that '''the admin password cannot be set through the dashboard or the API''' -- it must be set in this file, and then (re)loaded to take effect.

=== Security Plugin Initialization ===
Once all security configuration files are finished, they must be loaded into the OpenSearch cluster. This can only be done once enough of the cluster is up to provide a master node and at least one data node, as the security configuration is loaded into an index in the cluster. See the instructions below for specific sequencing. This command will initialize the security plugin:
plugins/opensearch-security/tools/securityadmin.sh \
-cacert config/<CA cert> -cert config/admin.crt -key config/admin-pkcs8.key \
-cd plugins/opensearch-security/securityconfig/ -host master-1.opensearch.localnet -icl -nhnv
... but again -- wait to do this until the right time below.

=== Starting The Node ===
Subject to the constraints described in the next section, the only thing left is to start the opensearch service:
sudo systemctl enable --now opensearch
Again, this must be done on each node in the correct sequence ... see below.

== Cluster Bootstrapping ==
Starting up a multi-node cluster is a bit tricky:

* Master-only nodes have no data
* Data-only nodes aren't master eligible
* Cluster initialization requires indexes to be available

'''Note that this process does NOT use the demo configuration -- put all the security plugin items in the <code>opensearch.yml</code> file from the beginning'''. This will cause the <code>opensearch-tar-install.sh</code> script to skip the security initialization -- this is intentional.

What this means is that you have to stand up multiple nodes before anything starts working ... the process should be:

# Install and start what will be the 'primary' master ... reference point for cluster startup operations
# Install and start one of the 'hot' data nodes
# Run the 'securityadmin.sh' script (using the command above) to initialize the security subsystem
# Install/start the other master nodes so that there is resiliency as nodes go up and down
# install/start the other data nodes to allow replication of the system indexes.

== Install OpenSearch Dashboards ==
Finally, install the OpenSearch Dashboards using the tarball as well.

Run the following commands as opensearch user:
wget [https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.1.0/opensearch-dashboards-1.1.0-linux-x64.tar.gz https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.2.0/opensearch-dashboards-1.2.0-linux-x64.tar.gz]
tar -zxf opensearch-dashboards-1.2.0-linux-x64.tar.gz

Configure the <code>opensearch-dashboards.yml</code> file as needed for your situation.

Create file <code>/etc/systemd/system/opensearch-dashboards.service</code> (as root) with the following contents:
[Unit]
Description=OpenSeach-Dasboards

[Service]
Type=simple
User=opensearch
Group=opensearch
ExecStart=/opt/opensearch/opensearch-dashboards-1.2.0/bin/opensearch-dashboards
Restart=always
WorkingDirectory=/opt/opensearch/opensearch-dashboards-1.2.0

[Install]
WantedBy=multi-user.target
Run the following commands as root:
systemctl daemon-reload
systemctl start opensearch-dashboards.service
systemctl enable opensearch-dashboards.service

== References ==

* https://cryptopoolparty.com/blog/cardano-stake-pool-monitoring-ofo-stack/
** Describes how to use OpenSearch to monitor another application -- but portions were used to create this document
*https://github.com/opensearch-project/dashboards-reports/releases/tag/chromium-1.12.0.0
**download for headless chrome for generating PDF/PNG reports

OpenSearch Cluster Installation

2021-11-25T19:15:34Z

DrEdWilliams: /* Install Opensearch Tarball */

These instructions describe the creation of a multi-node OpenSearch cluster using the tarball distribution. The configuration instructions below assume a cluster with the following components:

* Multiple master/ingest nodes
* Multiple hot data nodes
* Multiple cold data nodes

The instructions should be adaptable to any multi-node configuration, and even to a uniform node configuration (i.e. three master/ingress/data nodes) -- though that simple configuration does not need the multiple steps required for the more complex deployment.

Read through the entire document before doing anything, as there is a specific order in which things must be done for a successful installation.

== Node Preparation ==
Each node to be used in the cluster must be prepared to support the OpenSearch installation.

=== Create Opensearch User ===
Debian/Ubuntu:
sudo adduser --system -uid 968 --shell /bin/bash --gecos 'OpenSearch User' --group --disabled-password --home /opt/opensearch opensearch
Fedora/RHEL:
sudo adduser --system --uid 968 --shell /bin/bash --home-dir /opt/opensearch -m opensearch
Then add your normal user to the group:
sudo usermod -aG opensearch ewilliam

=== Set vm.max_map_count ===
Create <code>/etc/sysctl.d/vm.maxmap_count.conf</code> (as root) and add the following line:
vm.max_map_count=262144
Run the following command as root to load the new configuration:
sysctl --system

=== Systemd Unit File ===
One of the big things missing from the tarball distribution is a systemd unit file. Fortunately, it isn't difficult to create a simple one that will work.

Create a file <code>/etc/systemd/system/opensearch.service</code> with the following contents:
[Unit]
Description=OpenSearch
Documentation=<nowiki>https://opensearch.org/docs/</nowiki>
Wants=network-online.target
After=network-online.target

[Service]
RuntimeDirectory=opensearch
WorkingDirectory=/opt/opensearch/opensearch-1.1.0
User=opensearch
Group=opensearch

ExecStart=/opt/opensearch/opensearch-1.1.0/opensearch-tar-install.sh

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

# Allow a slow startup before the systemd notifier module kicks in to extend the timeout
TimeoutStartSec=75

[Install]
WantedBy=multi-user.target

== Create Certificates ==
There are three types of certificates used in OpenSearch:

* CA certificate - one for the whole cluster to validate everything else
* Node certificate/key - identifies the node for internal cluster communications (transport layer) and data ingest (http layer)
** the CN must be the FQDN of the node
** even though OpenSearch doesn't seem to use them at this point, there should be SANs for the full FQDN, the short hostname, and the node's IP address
* User certificate/key - identifies a specific OpenSearch user
** the CN is a username in OpenSearch (explicit or implicit)
** username is used to determine roles/privileges in the cluster
** Need at least one cert for admin authentication

Node and User certificates must all be signed by the cluster CA, and '''all keys must be in PKCS8 format'''. If user certificates are to be used for authentication by logstash, they must be imported (along with the key and CA) into a PKCS12 (.p12) keystore. The OpenSearch documentation has a good piece on [https://opensearch.org/docs/latest/security-plugin/configuration/generate-certificates/ how to create certificates].

In summary, create the following certificates/keys

# CA certificate/key
# For each cluster node:
#* <node>.crt
#* <node>-pkcs8.key
# For each user:
#* <user>.crt
#* <user>-pkcs8.key
#* <user>.p12

All these should be created and available for the configuration of each cluster node below.

== Install OpenSearch Tarball ==
The OpenSearch tarball must be installed on each node in the cluster; there is no difference at this point between the installations on the different types of nodes. Assuming the deployment will use the <code>opensearch</code> user created above ... change to the <code>opensearch</code> user (from root):
su - opensearch
cd ~
Run the following commands as user <code>opensearch</code> (from the opensearch home directory <code>/opt/opensearch</code>):
wget https://artifacts.opensearch.org/releases/bundle/opensearch/1.2.0/opensearch-1.2.0-linux-x64.tar.gz
tar -xvzf opensearch-1.2.0-linux-x64.tar.gz
== Configure OpenSearch ==
Configuration varies depending on the type of node, but these are the places that need to be touched.

=== <code>jvm.options</code> ===
Set the heap size. Common practice is that the heap should be between 4GB and half the available memory, but no higher than 32GB. A minimum configuration might be (assuming the hot data nodes have 32GB installed RAM:

* 4G for master and cold data nodes
* 16G for hot data nodes

This must be done individually for each node.

=== <code>opensearch.yml</code> ===
While the OpenSearch software can be rather forgiving for low-security environments, if you turn on full hostname verification, it is rather strict. The biggest thing to remember is that '''SAN entries in the node certificates aren't always respected.''' What this boils down to is that if full verification is turned on, you cannot use IP addresses in the <code>opensearch.yml</code> (or any other configuration file such as the logstash pipelines).

Each node will have a unique version of opensearch.yml due to certificate file names, the role(s) for the node, and the temperature (for data nodes). The significant parameters in the opensearch.yml file are:
* <code>cluster.name</code> - must be identical in all nodes
* <code>node.name</code>- can be the hostname, but doesn't have to be (must be unique)
* <code>path.data</code> - data nodes only -- must be unique to the node (i.e if using shared storage, each node must have its own directory on the shared volume)
* <code>path.logs</code> - must be unique to the node (same as above)
* <code>network.host</code> -- host's FQDN where all traffic will be received
**Can be a single hostname or an array of hostnames in <code>["name", "name"]</code> format
**If you are doing transport-level host verification this CANNOT be an IP -- OpenSearch can't handle the SANs in the certificates needed to make it work
* <code>http.port</code> -- leave it as the default (<code>9200</code>) but uncomment it to lock it in
* <code>discovery.seed_hosts</code> -- FQDN for all the masters (same for all nodes) -- only used for cluster bootstrap
* <code>cluster.initial_master_nodes</code> -- put the master node names here (not the hostnames) -- same for all nodes
* <code>node.[master|data|ingest|remote]</code> - set the type of the node (specify all four)
* <code>node.attr.temp</code> - (data nodes only) defines data storage tiers (<code>hot</code>/<code>warm</code>/<code>cold</code>)
The stock configuration file has most of these specified -- edit the file to fit the configuration of the node.

The security configuration must be added next. Since we are not using the demo security as our template, all the relevant entries must be added now. Copy this block into the end of the stock opensearch.yml, editing as needed to provide the relevant certificate file names:
node.master: true
node.ingest: true
node.data: false
#node.remote: false

#node.attr.temp: hot

######## Start OpenSearch Security Configuration ########
plugins.security.ssl.transport.pemcert_filepath: <node>.crt
plugins.security.ssl.transport.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.transport.pemtrustedcas_filepath: <CAcert>
plugins.security.ssl.transport.enforce_hostname_verification: true

plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: <node>.crt
plugins.security.ssl.http.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.http.pemtrustedcas_filepath: <CAcert>

plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true

# at least one cert -- must be an EXACT copy of the DN for the certificates to be used for admin auth
plugins.security.authcz.admin_dn:
- CN=Admin,O=OpenSearch,C=US

# these must be EXACT copies of the DN for the node certificates
plugins.security.nodes_dn:
- CN=master-1.opensearch.localnet,O=OpenSearch,C=US
- CN=data-hot-1.opensearch.localnet,O=OpenSearch,C=US

# this is all boiler plate ... no modifications needed
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
# this next bit should ALL be on one line ...
plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*", ".replication-metadata-store"]

node.max_local_storage_nodes: 3

######## End OpenSearch Security Configuration ########
Note that there are two sections where the DNs from the certificates must be provided to enable admin authentication for security configuration, and for validating the nodes that will be part of the cluster. You can easily show the DNs for your certificates using this command:
openssl x509 -subject -nameopt RFC2253 -noout -in <cert file name>
The DN for the admin cert and the DNs for the node certs go in separate sections as shown above. Add all the cluster nodes into the nodes_dn section -- this must be the same for all nodes or they will not be able to communicate.

=== Certificates ===
Copy the following certificates/keys that were created previously into the config directory of each node:

* <CAcert>
* <node>.crt
* <node>-pkcs8.key
* admin.crt
* admin-pkcs8.crt
Verify that the <code>opensearch</code> user is able to at least read all of the certs/keys, or the cluster will not function.

=== Security Plugin Configuration ===
While the security configuration files exist on all the nodes in the cluster, they are only used once and only by the node where the security initialization is performed. With that in mine, pick one node (the first master node will do nicely) to edit the configuration files as needed. There are many things that can be done with these files, but two items in particular should be addressed. The files are located in the <code>plugins/opensearch-security/securityconfig</code> directory in the opensearch distribution.

==== Authentication ====
<code>config.yml</code> is where the authentication options are established. This will need to be configured to support your authentication options. The OpenSearch documentation describes the different options [https://opensearch.org/docs/latest/security-plugin/configuration/configuration/ here].

The default configuration supports basic username/password authentication, but if you wish to use certificate authentication for API clients or ingest pipelines using logstash, you need to enable both '''http''' and '''transport''' under <code>clientcert_auth_domain</code>. Proxy authentication and OpenID Connect authentication is possible as well ... see the documentation.

==== Admin user password ====
The default password for the admin user is well-known, and should be changed for a production environment. The OpenSearch documentation provides instructions on how to [https://opensearch.org/docs/latest/security-plugin/configuration/yaml/#internal_usersyml generate passwords] for the <code>internal_users.yml</code> file. Note that '''the admin password cannot be set through the dashboard or the API''' -- it must be set in this file, and then (re)loaded to take effect.

=== Security Plugin Initialization ===
Once all security configuration files are finished, they must be loaded into the OpenSearch cluster. This can only be done once enough of the cluster is up to provide a master node and at least one data node, as the security configuration is loaded into an index in the cluster. See the instructions below for specific sequencing. This command will initialize the security plugin:
plugins/opensearch-security/tools/securityadmin.sh \
-cacert config/<CA cert> -cert config/admin.crt -key config/admin-pkcs8.key \
-cd plugins/opensearch-security/securityconfig/ -host master-1.opensearch.localnet -icl -nhnv
... but again -- wait to do this until the right time below.

=== Starting The Node ===
Subject to the constraints described in the next section, the only thing left is to start the opensearch service:
sudo systemctl enable --now opensearch
Again, this must be done on each node in the correct sequence ... see below.

== Cluster Bootstrapping ==
Starting up a multi-node cluster is a bit tricky:

* Master-only nodes have no data
* Data-only nodes aren't master eligible
* Cluster initialization requires indexes to be available

'''Note that this process does NOT use the demo configuration -- put all the security plugin items in the <code>opensearch.yml</code> file from the beginning'''. This will cause the <code>opensearch-tar-install.sh</code> script to skip the security initialization -- this is intentional.

What this means is that you have to stand up multiple nodes before anything starts working ... the process should be:

# Install and start what will be the 'primary' master ... reference point for cluster startup operations
# Install and start one of the 'hot' data nodes
# Run the 'securityadmin.sh' script (using the command above) to initialize the security subsystem
# Install/start the other master nodes so that there is resiliency as nodes go up and down
# install/start the other data nodes to allow replication of the system indexes.

== Install OpenSearch Dashboards ==
Finally, install the OpenSearch Dashboards using the tarball as well.

Run the following commands as opensearch user:
wget [https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.1.0/opensearch-dashboards-1.1.0-linux-x64.tar.gz https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.2.0/opensearch-dashboards-1.2.0-linux-x64.tar.gz]
tar -zxf opensearch-dashboards-1.2.0-linux-x64.tar.gz

Configure the <code>opensearch-dashboards.yml</code> file as needed for your situation.

Create file <code>/etc/systemd/system/opensearch-dashboards.service</code> (as root) with the following contents:
[Unit]
Description=OpenSeach-Dasboards

[Service]
Type=simple
User=opensearch
Group=opensearch
ExecStart=/opt/opensearch/opensearch-dashboards-1.2.0/bin/opensearch-dashboards
Restart=always
WorkingDirectory=/opt/opensearch/opensearch-dashboards-1.2.0

[Install]
WantedBy=multi-user.target
Run the following commands as root:
systemctl daemon-reload
systemctl start opensearch-dashboards.service
systemctl enable opensearch-dashboards.service

== References ==

* https://cryptopoolparty.com/blog/cardano-stake-pool-monitoring-ofo-stack/
** Describes how to use OpenSearch to monitor another application -- but portions were used to create this document
*https://github.com/opensearch-project/dashboards-reports/releases/tag/chromium-1.12.0.0
**download for headless chrome for generating PDF/PNG reports

OpenSearch Cluster Installation

2021-11-25T19:14:12Z

DrEdWilliams: /* Install Java 11 (optional) */

These instructions describe the creation of a multi-node OpenSearch cluster using the tarball distribution. The configuration instructions below assume a cluster with the following components:

* Multiple master/ingest nodes
* Multiple hot data nodes
* Multiple cold data nodes

The instructions should be adaptable to any multi-node configuration, and even to a uniform node configuration (i.e. three master/ingress/data nodes) -- though that simple configuration does not need the multiple steps required for the more complex deployment.

Read through the entire document before doing anything, as there is a specific order in which things must be done for a successful installation.

== Node Preparation ==
Each node to be used in the cluster must be prepared to support the OpenSearch installation.

=== Create Opensearch User ===
Debian/Ubuntu:
sudo adduser --system -uid 968 --shell /bin/bash --gecos 'OpenSearch User' --group --disabled-password --home /opt/opensearch opensearch
Fedora/RHEL:
sudo adduser --system --uid 968 --shell /bin/bash --home-dir /opt/opensearch -m opensearch
Then add your normal user to the group:
sudo usermod -aG opensearch ewilliam

=== Set vm.max_map_count ===
Create <code>/etc/sysctl.d/vm.maxmap_count.conf</code> (as root) and add the following line:
vm.max_map_count=262144
Run the following command as root to load the new configuration:
sysctl --system

=== Systemd Unit File ===
One of the big things missing from the tarball distribution is a systemd unit file. Fortunately, it isn't difficult to create a simple one that will work.

Create a file <code>/etc/systemd/system/opensearch.service</code> with the following contents:
[Unit]
Description=OpenSearch
Documentation=<nowiki>https://opensearch.org/docs/</nowiki>
Wants=network-online.target
After=network-online.target

[Service]
RuntimeDirectory=opensearch
WorkingDirectory=/opt/opensearch/opensearch-1.1.0
User=opensearch
Group=opensearch

ExecStart=/opt/opensearch/opensearch-1.1.0/opensearch-tar-install.sh

# StandardOutput is configured to redirect to journalctl since
# some error messages may be logged in standard output before
# elasticsearch logging system is initialized. Elasticsearch
# stores its logs in /var/log/elasticsearch and does not use
# journalctl by default. If you also want to enable journalctl
# logging, you can simply remove the "quiet" option from ExecStart.
StandardOutput=journal
StandardError=inherit

# Specifies the maximum file descriptor number that can be opened by this process
LimitNOFILE=65535

# Specifies the maximum number of processes
LimitNPROC=4096

# Specifies the maximum size of virtual memory
LimitAS=infinity

# Specifies the maximum file size
LimitFSIZE=infinity

# Disable timeout logic and wait until process is stopped
TimeoutStopSec=0

# SIGTERM signal is used to stop the Java process
KillSignal=SIGTERM

# Send the signal only to the JVM rather than its control group
KillMode=process

# Java process is never killed
SendSIGKILL=no

# When a JVM receives a SIGTERM signal it exits with code 143
SuccessExitStatus=143

# Allow a slow startup before the systemd notifier module kicks in to extend the timeout
TimeoutStartSec=75

[Install]
WantedBy=multi-user.target

== Create Certificates ==
There are three types of certificates used in OpenSearch:

* CA certificate - one for the whole cluster to validate everything else
* Node certificate/key - identifies the node for internal cluster communications (transport layer) and data ingest (http layer)
** the CN must be the FQDN of the node
** even though OpenSearch doesn't seem to use them at this point, there should be SANs for the full FQDN, the short hostname, and the node's IP address
* User certificate/key - identifies a specific OpenSearch user
** the CN is a username in OpenSearch (explicit or implicit)
** username is used to determine roles/privileges in the cluster
** Need at least one cert for admin authentication

Node and User certificates must all be signed by the cluster CA, and '''all keys must be in PKCS8 format'''. If user certificates are to be used for authentication by logstash, they must be imported (along with the key and CA) into a PKCS12 (.p12) keystore. The OpenSearch documentation has a good piece on [https://opensearch.org/docs/latest/security-plugin/configuration/generate-certificates/ how to create certificates].

In summary, create the following certificates/keys

# CA certificate/key
# For each cluster node:
#* <node>.crt
#* <node>-pkcs8.key
# For each user:
#* <user>.crt
#* <user>-pkcs8.key
#* <user>.p12

All these should be created and available for the configuration of each cluster node below.

== Install Opensearch Tarball ==
The OpenSearch tarball must be installed on each node in the cluster; there is no difference at this point between the installations on the different types of nodes. Assuming the deployment will use the <code>opensearch</code> user created above ... change to the <code>opensearch</code> user (from root):
su - opensearch
cd ~
Run the following commands as user <code>opensearch</code> (from the opensearch home directory <code>/opt/opensearch</code>):
wget https://artifacts.opensearch.org/releases/bundle/opensearch/1.2.0/opensearch-1.2.0-linux-x64.tar.gz
tar -xvzf opensearch-1.2.0-linux-x64.tar.gz
== Configure OpenSearch ==
Configuration varies depending on the type of node, but these are the places that need to be touched.

=== <code>jvm.options</code> ===
Set the heap size. Common practice is that the heap should be between 4GB and half the available memory, but no higher than 32GB. A minimum configuration might be (assuming the hot data nodes have 32GB installed RAM:

* 4G for master and cold data nodes
* 16G for hot data nodes

This must be done individually for each node.

=== <code>opensearch.yml</code> ===
While the OpenSearch software can be rather forgiving for low-security environments, if you turn on full hostname verification, it is rather strict. The biggest thing to remember is that '''SAN entries in the node certificates aren't always respected.''' What this boils down to is that if full verification is turned on, you cannot use IP addresses in the <code>opensearch.yml</code> (or any other configuration file such as the logstash pipelines).

Each node will have a unique version of opensearch.yml due to certificate file names, the role(s) for the node, and the temperature (for data nodes). The significant parameters in the opensearch.yml file are:
* <code>cluster.name</code> - must be identical in all nodes
* <code>node.name</code>- can be the hostname, but doesn't have to be (must be unique)
* <code>path.data</code> - data nodes only -- must be unique to the node (i.e if using shared storage, each node must have its own directory on the shared volume)
* <code>path.logs</code> - must be unique to the node (same as above)
* <code>network.host</code> -- host's FQDN where all traffic will be received
**Can be a single hostname or an array of hostnames in <code>["name", "name"]</code> format
**If you are doing transport-level host verification this CANNOT be an IP -- OpenSearch can't handle the SANs in the certificates needed to make it work
* <code>http.port</code> -- leave it as the default (<code>9200</code>) but uncomment it to lock it in
* <code>discovery.seed_hosts</code> -- FQDN for all the masters (same for all nodes) -- only used for cluster bootstrap
* <code>cluster.initial_master_nodes</code> -- put the master node names here (not the hostnames) -- same for all nodes
* <code>node.[master|data|ingest|remote]</code> - set the type of the node (specify all four)
* <code>node.attr.temp</code> - (data nodes only) defines data storage tiers (<code>hot</code>/<code>warm</code>/<code>cold</code>)
The stock configuration file has most of these specified -- edit the file to fit the configuration of the node.

The security configuration must be added next. Since we are not using the demo security as our template, all the relevant entries must be added now. Copy this block into the end of the stock opensearch.yml, editing as needed to provide the relevant certificate file names:
node.master: true
node.ingest: true
node.data: false
#node.remote: false

#node.attr.temp: hot

######## Start OpenSearch Security Configuration ########
plugins.security.ssl.transport.pemcert_filepath: <node>.crt
plugins.security.ssl.transport.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.transport.pemtrustedcas_filepath: <CAcert>
plugins.security.ssl.transport.enforce_hostname_verification: true

plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: <node>.crt
plugins.security.ssl.http.pemkey_filepath: <node>-pkcs8.key
plugins.security.ssl.http.pemtrustedcas_filepath: <CAcert>

plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true

# at least one cert -- must be an EXACT copy of the DN for the certificates to be used for admin auth
plugins.security.authcz.admin_dn:
- CN=Admin,O=OpenSearch,C=US

# these must be EXACT copies of the DN for the node certificates
plugins.security.nodes_dn:
- CN=master-1.opensearch.localnet,O=OpenSearch,C=US
- CN=data-hot-1.opensearch.localnet,O=OpenSearch,C=US

# this is all boiler plate ... no modifications needed
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
# this next bit should ALL be on one line ...
plugins.security.system_indices.indices: [".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opendistro-notifications-*", ".opendistro-notebooks", ".opendistro-asynchronous-search-response*", ".replication-metadata-store"]

node.max_local_storage_nodes: 3

######## End OpenSearch Security Configuration ########
Note that there are two sections where the DNs from the certificates must be provided to enable admin authentication for security configuration, and for validating the nodes that will be part of the cluster. You can easily show the DNs for your certificates using this command:
openssl x509 -subject -nameopt RFC2253 -noout -in <cert file name>
The DN for the admin cert and the DNs for the node certs go in separate sections as shown above. Add all the cluster nodes into the nodes_dn section -- this must be the same for all nodes or they will not be able to communicate.

=== Certificates ===
Copy the following certificates/keys that were created previously into the config directory of each node:

* <CAcert>
* <node>.crt
* <node>-pkcs8.key
* admin.crt
* admin-pkcs8.crt
Verify that the <code>opensearch</code> user is able to at least read all of the certs/keys, or the cluster will not function.

=== Security Plugin Configuration ===
While the security configuration files exist on all the nodes in the cluster, they are only used once and only by the node where the security initialization is performed. With that in mine, pick one node (the first master node will do nicely) to edit the configuration files as needed. There are many things that can be done with these files, but two items in particular should be addressed. The files are located in the <code>plugins/opensearch-security/securityconfig</code> directory in the opensearch distribution.

==== Authentication ====
<code>config.yml</code> is where the authentication options are established. This will need to be configured to support your authentication options. The OpenSearch documentation describes the different options [https://opensearch.org/docs/latest/security-plugin/configuration/configuration/ here].

The default configuration supports basic username/password authentication, but if you wish to use certificate authentication for API clients or ingest pipelines using logstash, you need to enable both '''http''' and '''transport''' under <code>clientcert_auth_domain</code>. Proxy authentication and OpenID Connect authentication is possible as well ... see the documentation.

==== Admin user password ====
The default password for the admin user is well-known, and should be changed for a production environment. The OpenSearch documentation provides instructions on how to [https://opensearch.org/docs/latest/security-plugin/configuration/yaml/#internal_usersyml generate passwords] for the <code>internal_users.yml</code> file. Note that '''the admin password cannot be set through the dashboard or the API''' -- it must be set in this file, and then (re)loaded to take effect.

=== Security Plugin Initialization ===
Once all security configuration files are finished, they must be loaded into the OpenSearch cluster. This can only be done once enough of the cluster is up to provide a master node and at least one data node, as the security configuration is loaded into an index in the cluster. See the instructions below for specific sequencing. This command will initialize the security plugin:
plugins/opensearch-security/tools/securityadmin.sh \
-cacert config/<CA cert> -cert config/admin.crt -key config/admin-pkcs8.key \
-cd plugins/opensearch-security/securityconfig/ -host master-1.opensearch.localnet -icl -nhnv
... but again -- wait to do this until the right time below.

=== Starting The Node ===
Subject to the constraints described in the next section, the only thing left is to start the opensearch service:
sudo systemctl enable --now opensearch
Again, this must be done on each node in the correct sequence ... see below.

== Cluster Bootstrapping ==
Starting up a multi-node cluster is a bit tricky:

* Master-only nodes have no data
* Data-only nodes aren't master eligible
* Cluster initialization requires indexes to be available

'''Note that this process does NOT use the demo configuration -- put all the security plugin items in the <code>opensearch.yml</code> file from the beginning'''. This will cause the <code>opensearch-tar-install.sh</code> script to skip the security initialization -- this is intentional.

What this means is that you have to stand up multiple nodes before anything starts working ... the process should be:

# Install and start what will be the 'primary' master ... reference point for cluster startup operations
# Install and start one of the 'hot' data nodes
# Run the 'securityadmin.sh' script (using the command above) to initialize the security subsystem
# Install/start the other master nodes so that there is resiliency as nodes go up and down
# install/start the other data nodes to allow replication of the system indexes.

== Install OpenSearch Dashboards ==
Finally, install the OpenSearch Dashboards using the tarball as well.

Run the following commands as opensearch user:
wget [https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.1.0/opensearch-dashboards-1.1.0-linux-x64.tar.gz https://artifacts.opensearch.org/releases/bundle/opensearch-dashboards/1.2.0/opensearch-dashboards-1.2.0-linux-x64.tar.gz]
tar -zxf opensearch-dashboards-1.2.0-linux-x64.tar.gz

Configure the <code>opensearch-dashboards.yml</code> file as needed for your situation.

Create file <code>/etc/systemd/system/opensearch-dashboards.service</code> (as root) with the following contents:
[Unit]
Description=OpenSeach-Dasboards

[Service]
Type=simple
User=opensearch
Group=opensearch
ExecStart=/opt/opensearch/opensearch-dashboards-1.2.0/bin/opensearch-dashboards
Restart=always
WorkingDirectory=/opt/opensearch/opensearch-dashboards-1.2.0

[Install]
WantedBy=multi-user.target
Run the following commands as root:
systemctl daemon-reload
systemctl start opensearch-dashboards.service
systemctl enable opensearch-dashboards.service

== References ==

* https://cryptopoolparty.com/blog/cardano-stake-pool-monitoring-ofo-stack/
** Describes how to use OpenSearch to monitor another application -- but portions were used to create this document
*https://github.com/opensearch-project/dashboards-reports/releases/tag/chromium-1.12.0.0
**download for headless chrome for generating PDF/PNG reports

Pages that need to be updated

2021-11-25T13:14:46Z

DrEdWilliams:

Fedora 33 - Workstation Installation

2021-11-25T13:05:44Z

DrEdWilliams: /* System Configuration */

After doing the [[Basic Fedora 31+ Installation]], these are the steps required to get the workstation up to general functionality.

It is assumed that either the Fedora Workstation (Gnome desktop) or KDE Plasma spin is used as the base distribution and the desktop is configured and functioning properly.

== System Configuration ==
In addition to the basic [[Basic Fedora 31+ Installation]] and [[Common Post-Install]] instructions:
* Select KDE as the desktop environment
*Mount the /files and /backup filesystems from storage1 (NFSv4 mount)
* Configure (or verify) the network devices:
** WilliamsNet (10.0.0.0/24)
** StorageNet (10.1.0.0/24) - optional if needed (if device is available or VLAN device definition)
** iSCSINet (10.2.0.0/24) - optional if needed (may require VLAN device definition)
* [[CUDA + NVIDIA drivers]] (if NVIDIA GPU installed)

== General Utilities ==
A collection of tools that are useful in a workstation; some may already be installed, but just covering the bases:
* [http://www.webmin.com/download.html Webmin] - web-based system administration
* Google Chrome
** use the repo-based installation on the Google instructions page so you get updates
** make it the default browser by searching for 'Application' in the KDE settings app and selecting it for the default web browser.
* pdsh - parallel shell
** see [https://www.admin-magazine.com/HPC/Articles/pdsh-Parallel-Shell pdsh article] in ADMIN Magazine
** in fedora repo -- need to specify the SSH module since it isn't included by default (??):
sudo dnf install -y pdsh pdsh-rcmd-ssh
* Synergy (see [[Synergy Installation]])
* Java
** either the openjdk in the fedora repo or the 'real stuff' from Oracle
** Oracle Java 8 (the real thing) is required to support java web start applications (like the EquaLogic Group Manager tool)
* MATE or gnome Terminal -- it's just better than the KDE konsole app, though konsole has gotten a lot better
sudo dnf install -y gnome-terminal

== Communications and Collaboration Tools ==
There are a lot ... but these cover much of the landscape:
* Microsoft Teams
** download client from https://teams.microsoft.com
** default install will establish a teams repo to facilitate updates
* Slack
** download client from https://slack.com/download
* Zoom
** download client from https://zoom.com
* Kopete (Jabber client)
** in fedora repo
* VMware Horizon Client (for access to SAIC VDS)
** download from SAIC VDS webpage

== Development Tools ==
* VS Code
** download client at https://code.visualstudio.com/download
** will create a repo to get updates
** if possible, copy ".vscode*" directories from an existing configuration
** '''''LOTS''''' of plugins/modules/addons/extensions ... easy to waste a lot of time configuring
* PodMan -- rootless docker replacement ... in the repo
* Docker-CE -- is this even available?

== Extended/Optional Stuff ==
This is not intended to be a complete list of all the random/useful stuff I do on workstations, but just a catch-all for links and reminders

=== Enable RPM fusion ===
If not already done as part of the NVIDIA/CUDA driver debacle, install it -- there are useful things there
sudo dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm
sudo dnf install https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm

Pages that need to be updated

2021-11-25T13:04:10Z

DrEdWilliams:

[[SSH Host Key Management]]

[[Diggle]] - manage configuration in GitLab

[[MediaWiki Deployment]] (needs to be created)

Update Architecture to reflect [[OpenSearch]] and Elastic clusters

[[Synergy Installation]] (needs to be created)

Basic Fedora 31+ Installation

2021-11-25T12:56:42Z

DrEdWilliams:

== Fedora Distribution Install ==
Fedora 31 install from DVD install disk - this process should work for most (if not all) the variants of Fedora (Workstation, server, KDE Plasma).

During install process:
* use custom formatting
* set partition type to 'standard'
* click on 'automatically create partitions'
* delete /home partition and increase size of / to fill the disk
* change all filesystem types to 'xfs'
* set hostname (if known)
* enable all basic network interfaces (do not rename -- it will crash the installer)

== Basic system prep ==
Most of the initial configuration is now contained in a script that can be executed directly from the config server as root on the target system:

curl -s http://config/config/fedora-basic-config.sh | bash

The contents of this script are included here for reference, though updates to the script may occur without updates to this page:

<pre>#!/bin/sh
#
# script to do the basic install of a centos7 headless server
#
# Assumptions:
# - this is run as root immediately after the install has completed
# - the hostname has been set as desired before this script is run
# - an administrator account 'ewilliam' was created during installation

CONFIG=http://kube202/config

# first -- install all the basic necessities (some may already be there)
yum -y install net-tools rsync zsh yum-utils mlocate drpm
yum -y install sshfs nfs-utils ssmtp psmisc git
yum -y remove firewalld postfix

# things that seem to not be included in the server version?
yum -y install cronie cronie-anacron
systemctl enable --now crond

# Enable automatic updates.
# Parameters can be set in the config files '/etc/dnf/automatic.conf'
yum -y install dnf-automatic
systemctl enable --now dnf-automatic.timer

# now get all the updates
yum -y update

# now disable SELinux (no comment)
sed -i 's/=enforcing/=disabled/g' /etc/selinux/config
setenforce 0

# Enable the SSH server (why it's not enabled by default is strange ...)
systemctl enable --now sshd

# make sudo passwordless for group wheel
echo "%wheel ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/91-wheel-NOPASSWD

# now that zsh is installed, make it the default
chsh -s /bin/zsh
chsh -s /bin/zsh ewilliam

# copy the standard zsh config files
curl -s $CONFIG/centos7/.zshrc -o /root/.zshrc
curl -s $CONFIG/centos7/.zshrc -o /home/ewilliam/.zshrc
chown ewilliam.ewilliam /home/ewilliam/.zshrc

# load ssh known_hosts from the config server
mkdir -p /root/.ssh
curl -s $CONFIG/ssh/known_hosts -o /root/.ssh/known_hosts
mkdir -p /home/ewilliam/.ssh
curl -s $CONFIG/ssh/known_hosts -o /home/ewilliam/.ssh/known_hosts
chown -R ewilliam.ewilliam /home/ewilliam/.ssh

# get backup scripts
curl -s $CONFIG/centos7/backup -o /etc/cron.daily/backup
curl -s $CONFIG/centos7/rsync_backup.sh -o /usr/local/bin/rsync_backup.sh
chmod +x /usr/local/bin/rsync_backup.sh

# configure log server
curl -s $CONFIG/centos7/99-remotelog.conf -o /etc/rsyslog.d/99-remotelog.conf
systemctl restart rsyslog

# enable mail to the central email server
echo "root: ewilliam@williams.localnet" >> /etc/aliases
echo "ewilliam: ewilliam@williams.localnet" >> /etc/aliases

# get the standard /etc/hosts file
curl $CONFIG/hosts -o /etc/hosts

# install metricbeat
curl -s $CONFIG/centos7/elastic.repo > /etc/yum.repos.d/elastic.repo
yum -y install metricbeat
curl -s $CONFIG/centos7/metricbeat.yml > /etc/metricbeat/metricbeat.yml

# install the host/known_hosts synchronization
curl -s $CONFIG/host_check.sh > /etc/cron.hourly/host_check
chmod +x /etc/cron.hourly/host_check

# now reboot
reboot</pre>

== Next Steps ==
The rest of the installation process is not OS-dependent and is described [[Common Post-Install|here]].

Basic CentOS 7 Installation

2021-11-25T12:56:11Z

DrEdWilliams:

== CentOS Distribution Install ==
Centos 7 minimal install from DVD install disk
During install process:
* use custom formatting
* set partition type to 'standard'
* click on 'automatically create partitions
* delete /home partition and increase size of / to fill the disk
* set hostname (if known)
* enable all basic network interfaces (do not rename -- it will crash the installer)

== Basic system prep ==
Most of the initial configuration is now contained in a script that can be executed directly from the config server as root on the target system:

curl -s http://config/config/centos7-basic-config.sh | bash

The contents of this script are included here for reference, though updates to the script may occur without updates to this page:

<pre>#!/bin/sh
#
# script to do the basic install of a centos7 headless server
#
# Assumptions:
# - this is run as root immediately after the install has completed
# - the hostname has been set as desired before this script is run
# - an administrator account 'ewilliam' was created during installation

CONFIG=http://config/config

# first -- install all the basic necessities (some may already be there)
yum -y install net-tools rsync zsh epel-release yum-cron yum-utils mlocate deltarpm
yum -y install sshfs nfs-utils ssmtp psmisc
yum -y remove firewalld postfix
yum autoremove -y NetworkManager NetworkManager-libnm
systemctl enable --now yum-cron
yum -y update

# now disable SELinux (no comment)
sed -i 's/=enforcing/=disabled/g' /etc/selinux/config
setenforce 0

# make sudo passwordless for group wheel
echo "%wheel ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers.d/91-wheel-NOPASSWD

# now that zsh is installed, make it the default
chsh -s /bin/zsh
chsh -s /bin/zsh ewilliam

# copy the standard zsh config files
curl -s $CONFIG/centos7/.zshrc -o /root/.zshrc
curl -s $CONFIG/centos7/.zshrc -o /home/ewilliam/.zshrc
chown ewilliam.ewilliam /home/ewilliam/.zshrc

# load ssh known_hosts from the config server
mkdir -p /root/.ssh
curl -s $CONFIG/ssh/known_hosts -o /root/.ssh/known_hosts
mkdir -p /home/ewilliam/.ssh
curl -s $CONFIG/ssh/known_hosts -o /home/ewilliam/.ssh/known_hosts

# get backup scripts
curl -s $CONFIG/centos7/backup -o /etc/cron.daily/backup
curl -s $CONFIG/centos7/rsync_backup.sh -o /usr/local/bin/rsync_backup.sh
chmod +x /usr/local/bin/rsync_backup.sh

# configure log server
curl -s $CONFIG/centos7/99-remotelog.conf -o /etc/rsyslog.d/99-remotelog.conf
systemctl restart rsyslog

# enable mail to the central email server
echo "root: ewilliam@williams.localnet" >> /etc/aliases
echo "ewilliam: ewilliam@williams.localnet" >> /etc/aliases

# get the standard /etc/hosts file
curl $CONFIG/hosts -o /etc/hosts

# install metricbeat
curl -s $CONFIG/centos7/elastic.repo > /etc/yum.repos.d/elastic.repo
yum -y install metricbeat
curl -s $CONFIG/centos7/metricbeat.yml > /etc/metricbeat/metricbeat.yml

# install the host/known_hosts synchronization
curl -s $CONFIG/host_check.sh > /etc/cron.hourly/host_check
chmod +x /etc/cron.hourly/host_check

# now reboot
reboot</pre>

== Next Steps ==
The rest of the installation process is not OS-dependent and is described [[Common Post-Install|here]].