Editing Cluster Operations

== Rolling Restart ==
The OpenSearch cluster is very resilient -- it tries to make sure it has all of its indices appropriately replicated at all times.  It can handle nodes dropping out and being added -- but sometimes we don't want that.  If nodes are simply restarted (to change settings) or the system is rebooted (for patching), we don't want to have OpenSearch move all its indices around.  

This process works for a whole cluster restart (not recommended unless absolutely necessary) or individual node restarts.  If individual nodes are being restarted, only do one node at a time to avoid cluster panic should all shards of an index go offline on the nodes that get restarted.  The cluster '''''should''''' recover, but why stress it unnecessarily.

The process is described in the [https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-cluster.html Elastic documentation], and summarized below.

=== Disable Shard Allocation ===
When you shut down a data node, the allocation process waits for <code>index.unassigned.node_left.delayed_timeout</code> (by default, one minute) before starting to replicate the shards on that node to other nodes in the cluster, which can involve a lot of I/O. Since the node is shortly going to be restarted, this I/O is unnecessary. You can avoid racing the clock by disabling allocation of replicas before shutting down data nodes:
 curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
 {
   "persistent": {
     "cluster.routing.allocation.enable": "primaries"
   }
 }'

=== Flush Cache ===
This will prevent data loss by flushing any in-process activities to disk:
 curl -X POST "localhost:9200/_flush/synced?pretty"

=== Shut Down/Restart Nodes ===
Now that things have settled out, we can stop/restart the opensearch services.  Note that a full cluster shutdown should be avoided, since any activity on the cluster as it is going down or coming up could cause errors to the user.  Use the systemd commands to stop/start/restart the nodes:
 # do this on all nodes
 sudo systemctl stop opensearch
  ... do required activity on all nodes ...
 sudo systemctl start opensearch
... or ...
 # restart a single node
 sudo systemctl restart opensearch
If it was a complete cluster shutdown, all nodes must rejoin the cluster before proceeding.  In either case, wait until the cluster (or node) achieves a "YELLOW" state before proceeding ... this means that all primary shards are accounted for, but replicas may not be.

=== Enable Shard Allocation ===
As each node re-starts, it tries to identify which shards it had before.  Unless something has happened to disrupt the node's data, it will simply reactivate those shards (most likely as replica shards if it is a single-node restart).  To enable the cluster to get back to normal shard management, we remove the block we set above:
 curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
 {
   "persistent": {
     "cluster.routing.allocation.enable": null
   }
 }'

=== Follow Through ===
When doing any activity that disrupts the cluster operations enough to cause it to drop into a "YELLOW" or "RED" state, wait until the cluster is "GREEN" again before proceeding.  In the case of a full rolling restart (one node at a time), the cluster should be "GREEN" before proceeding to the next node.

This process (node rolling restart, anyway) is captured in the script <code>node-restart.sh</code>:
 #!/bin/bash
 #
 # perform a soft restart of a node, minimizing the reindexing that would
 #  normally occur when a node drops/inserts
 
 # set to one of the master nodes
 MASTER=master-node
 
 # replace with either the real admin password or a cert/key combination
 AUTH="-u admin:admin"
 
 # first shut off shard allocation
 curl -XPUT -H "Content-Type: application/json" <nowiki>https://$MASTER:9200/_cluster/settings</nowiki> $AUTH -d '
 {
   "persistent": {
     "cluster.routing.allocation.enable": "primaries"
   }
 }'
 echo
 
 # now flush the cache
 curl -XPOST <nowiki>https://$MASTER:9200/_flush</nowiki> $AUTH
 echo
 
 # restart the opensearch process
 sudo systemctl restart opensearch
 
 # re-enable shard allocation
 curl -XPUT -H "Content-Type: application/json" <nowiki>https://$MASTER:9200/_cluster/settings</nowiki> $AUTH -d '
 {
   "persistent": {
     "cluster.routing.allocation.enable": null
   }
 }'
 echo
This script is intended to be run on the node being restarted.

== Cluster Upgrade ==
Assuming there are no other adjustments due to features and/or bugfixes, the basic process for upgrading a cluster is to upgrade one node at a time:

# Expand tarball in the home directory of the <code>opensearch</code> user
# Copy the certificates, keys, and CAs from the <code>config</code> directory of the current version into the new one
# Copy the <code>opensearch.yml</code> file from the current version into the new one and make any adjustments
# Copy the <code>jvm.options.d/heapsize.options</code> file from the current version into the new one
# Move the <code>opensearch</code> link in the home directory from the current version into the new one
# Reload and restart the opensearch process using the '''Rolling Restart''' process described above
# Wait for the cluster to get to a "GREEN" state before moving to the next node

This assumes that the systemd unit file uses the softlink to access the opensearch executable as described in [[OpenSearch Cluster Installation]].

== Certificate Updates ==
Given that the standard lifetime for certificates these days is one year, this is  something that will need to be accomplished several times during the lifetime of a cluster.

This process does not require downtime for the cluster, but care must be taken to ensure that the mix of old and new certificates does not cripple the cluster.  This results in up to three complete cluster restarts (using the '''Rolling Restart''' method described above), but it does not disrupt the operation of the cluster as long as nothing exceptional happens, and the cluster is allowed to stabilize between each node restart.

=== Create New Certificates ===
Using the process outlined in [https://wiki.williams-net.org/wiki/OpenSearch_Cluster_Installation#Create_Certificates OpenSearch Cluster Creation], create the required certificates and keys.  If at all possible, keep the DNs on all certificates the same as the ones they are replacing ... it reduces the number of restarts and minimizes the edits of the config files.  It also helps if the node certificate/key files have the same names ... for the same reasons.

Send each node its new certificate/keys and the admin certificate/key -- or make them available for remote retrieval.  If user certificates were created, send them to the system(s) where they will be used.

=== Create CA Bundle ===
Using a CA bundle with the old and new CAs will allow us to do a gradual transition and avoid downtime.  Fortunately, creating a bundle for the CAs is easy
 cat <old CA> <new CA> >> rootCAbundle.crt
Again, if there is currently a bundle in place, use the same name for the bundle file and you'll save yourself a lot of editing.  If the REST API uses a different CA bundle already that supports other (external) CAs, create a second CA bundle that includes the external CAs, the old internal CA, and the new internal CA.

Send this new CA bundle to all cluster nodes, dashboard instances, and logstash instances.

=== Update CA Files ===
For each cluster node:

* Copy the new CA bundle in place (in the opensearch <code>config</code> directory) making sure that permissions allow for read access to the <code>opensearch</code> user.  
* DO NOT copy the node certificates yet.  
* Edit <code>opensearch.yml</code> if needed:
** Update the name of the CA to the new CA bundle file (if needed) in both the <code>ssl</code> and <code>transport</code> parameters, using the appropriate bundle for each parameter (see above)
** Add the DNs (DO NOT replace them yet) for the node and admin certificates if the new DNs are different from what is in the file

* Perform a '''Rolling Restart''' of the node, verifying that there are no SSL/certificate errors, and waiting for "GREEN" status before proceeding 

Now do the same for all dashboards, logstash instances, and any other clients that use the CA to validate cluster identify:

* Install the CA bundle as appropriate, ensuring that the client has read access to the CA bundle.
* Update the CA filename in the configuration file (if needed)
* DO NOT update user certificates at this point
* Restart the client (if needed) and verify proper operation

=== Update Node Certificates ===
For each node in the cluster:

* Copy the node certificate and key and the admin certificate and key into the opensearch <code>config</code> directory, verifying that the <code>opensearch</code> user has read permissions
* Edit <code>opensearch.yml</code> if needed to correct certificate file names
* DO NOT remove extra DNs from the admin or node sections of the file yet
* Perform a '''Rolling Restart''' of the node, waiting for the cluster to achieve "GREEN" status before going to the next node

=== Update Client Certificates ===
For each client that uses certificate authentication:

* Copy the client certificate bundle (.p12 file) into place
* Edit the configuration file if needed to reflect the new file
* Restart the client (if needed) to activate the new configuration

=== Clean Up ===
At this point we can remove any old DNs from the <code>opensearch.yml</code> configurations and in general remove all traces of the old certificates in the cluster nodes and clients.  If a restart is warranted to verify that nothing critical was deleted, that should be done at this time.

It is not necessary to replace the new CA bundle with the CA certificate -- it serves no purpose as the old cert is just along for the ride at this point, and once it reaches its expiration date, will not be used.  If it must be done, use the same process above that was used to install the CA bundle, performing a rolling restart on each node as it is completed.

== Troubleshooting ==
Setting debug-level logging for opensearch and opensearch-dashboards:
 # Opensearch
 log4j2.properties:
   rootLogger.level = debug
 
 # Opensearch Dashboards
 opensearch_dashboards.yml:
   logging.verbose: true