(storage) ceph is amazing

If you haven’t tried out ceph yet, and are not yet completely satisfied with your onprem storage system, I recommend giving it a try.  (Note, it does want a lot of cpu, so heads up on that.)

** I acknowledge that I am currently excited and completely captivated by ceph.  I’m still fairly new to using ceph, so you might want to check my facts, a.k.a. I’m tempting you to start investigating. 😉

Ceph is able to use multiple disks on multiple servers and spread out the load, maintaining two or three copies of data to avoid data loss.  Just look at my humble 4 disk system with the data spread out nearly perfectly across the disks.  (2 nodes with 2 disks each)

Ceph provides cephfs (ceph filesystem), cephrbd (ceph block storage).

Block storage gives you something like a disk, it’s useful for say creating a vm that you later want to hot-swap between servers.

Otherwise, the ceph file system is what you want to use, though not directly.

You’ll end up creating a pool which is based on the ceph filesystem (cephfs), then create pvcs that come from the pool. Also you can use iscsi (which uses cephrbd) or nfs, (which uses cephfs), if you have a consumer that is not able to connect with cephfs or cephrbd directly.

In my experiences working with storage systems and kubernetes persistent volumes, I was not having luck with RWX (read write many), even when the providers claimed they worked (nfs using my own linux server, longhorn which uses nfs for rwx). I found two apps ‘plex’ and ‘nextcloud’ would consistently experience database corruption after only a few minutes.

People continually told me “NFS supports RWX” and “Just use iSCSI, that’s perfect for apps that use SQLite”. I tested these claims and did not get the same result using a TrueNAS Core server.

However, with ceph you can allocate a pvc using cephfs, and this works perfectly with RWX! Awesome! And super fast! All my jenkins builds which run in kubernetes sped up by 15 seconds vs TrueNAS iSCSI, of course … this could just mean the physical disks running with ceph are faster than the physical disks running with my truenas server, can’t be sure.

Now I suspect if I created pvcs using NFS and iSCSI, which work on top of cephfs, that they might also support RWX. I’ve very curious, but given how fast cephfs and that everything is working perfectly, I can’t see any reason to use NFS or iSCSI (except for vm disks).

Using the helm install of ceph you end up with a tools pod that you can exec into and run the ‘ceph’ cli. The dashboard gui is pretty great, but the real interaction with ceph happens at the command line level. I’ve been in there breaking & fixing things and the experience feels like a fully fledged product ready for production. There is so much there I can see someone managing ceph as a career with an available deep dive as far as you are interested in going. If you break it enough and then fix it you get to watch it moving data around recovering things which is as cool as can be to watch, from the most geeky perspective.

In any case, given how long cephfs has been around and the popularity it has in use in production environments, I think I’ve found my storage solution for the foreseeable future.

Ceph – Getting Started

thinking out loud: micro service – basic

Micro-service - basic

  Singleton service
    - establish outgoing socket(s)
      - re-establish if dropped
    - listens on port 80
      - re-establish if dropped
    - process network data
      - multiple outgoing websockets
      - multiple incoming websockets
      - rest api methods

  Environment variables
    - api key for outgoing socket(s)
      - if no api key, try anonymous
    - outgoing target host(s)
    - database connection values
    - incoming, allow anonymous?

  Incoming websocket
    /ws
    ... api key (optional / required)

  REST Api
    ... api key (optional / required)
    /auth - oidc logic, get api key
      - api key stored to provided db
      - default api key timeout
      - can specify expiration
    swagger - auto generate

  Cleanup
  - dropped sockets managed
  - dropped db sockets managed
  - disable watching appsettings.json

(centos) k8s-update.sh – script to upgrade a kubernetes cluster

Script to update a kubernetes cluster to the next patch or minor version.

#!/bin/bash

# if no parameter, show versions and syntax
if [ -z $1 ]; then
  # show available versions
  yum list --showduplicates kubeadm --disableexcludes=kubernetes

  # show syntax
  echo ""
  echo "Syntax:"
  echo "$0 <version>, e.g. $0 1.26.x-0"
  exit 1
fi

# remember version
export TARGET_VERSION=$1


# configure kubectl to use admin config
export KUBECONFIG=/etc/kubernetes/admin.conf

# track first control plane node
export IS_FIRST=1

# loop through control plane nodes
#kubectl get nodes --no-headers | xargs -n 5 echo
NODES=`kubectl get nodes --no-headers | awk '{print $1}'`
for NODE in $NODES; do
  # parse kubectl node output into parameters
  NODE_HOSTNAME=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $0'`
  NODE_TYPE=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $2'`
  NODE_VERSION=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $4'`

  # only work on control plane nodes in this loop
  if [ $NODE_TYPE != "control-plane" ]; then
     #echo ""
     #echo "skipping worker node"
     continue
  fi

  echo ""
  echo "***"
  echo "* Next: $NODE_HOSTNAME"

  # upgrade kubeadm
  echo "upgrade to: $TARGET_VERSION"
  ssh root@$NODE_HOSTNAME yum install -y kubeadm-$TARGET_VERSION --disableexcludes=kubernetes

  # verify the download works and has the expected version
  #ssh root@$NODE_HOSTNAME kubeadm version

  # verify the upgrade plan
  #ssh root@$NODE_HOSTNAME kubeadm upgrade plan

  # perform the update
  if [ $IS_FIRST == "0" ]; then
    ssh root@$NODE_HOSTNAME kubeadm upgrade node
  else
    # if this is the first control plane node its command is a little different
    ssh root@$NODE_HOSTNAME kubeadm upgrade apply --yes v$TARGET_VERSION

    # adjust tracking now that we've completed the first control plane node
    export IS_FIRST=0
  fi

  # drain node & prepare for updating
  kubectl drain $NODE_HOSTNAME --delete-emptydir-data --ignore-daemonsets

  # update kubelet & kubectl
  ssh root@$NODE_HOSTNAME yum install -y kubelet-$TARGET_VERSION kubectl-$TARGET_VERSION --disableexcludes=kubernetes

  # restart kubelet
  ssh root@$NODE_HOSTNAME systemctl daemon-reload
  ssh root@$NODE_HOSTNAME systemctl restart kubelet

  # uncordon the node
  kubectl uncordon $NODE_HOSTNAME
  
done


# loop through worker nodes
NODES=`kubectl get nodes --no-headers | awk '{print $1}'`
for NODE in $NODES; do
  # parse kubectl node output into parameters
  NODE_HOSTNAME=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $0'`
  NODE_TYPE=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $2'`
  NODE_VERSION=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $4'`

  # only work on control plane nodes in this loop
  if [ $NODE_TYPE == "control-plane" ]; then
     #echo ""
     #echo "skipping control plane node"
     continue
  fi

  echo ""
  echo "***"
  echo "* Next: $NODE_HOSTNAME"

  # upgrade kubeadm
  echo "upgrade to: $TARGET_VERSION"
  ssh root@$NODE_HOSTNAME yum install -y kubeadm-$TARGET_VERSION --disableexcludes=kubernetes

  # verify the download works and has the expected version
  #ssh root@$NODE_HOSTNAME kubeadm version

  # verify the upgrade plan
  #ssh root@$NODE_HOSTNAME kubeadm upgrade plan

  # perform the update
  if [ $IS_FIRST == "0" ]; then
    ssh root@$NODE_HOSTNAME kubeadm upgrade node
  fi

  # drain node & prepare for updating
  kubectl drain $NODE_HOSTNAME --delete-emptydir-data --ignore-daemonsets

  # update kubelet & kubectl
  ssh root@$NODE_HOSTNAME yum install -y kubelet-$TARGET_VERSION kubectl-$TARGET_VERSION --disableexcludes=kubernetes

  # restart kubelet
  ssh root@$NODE_HOSTNAME systemctl daemon-reload
  ssh root@$NODE_HOSTNAME systemctl restart kubelet

  # uncordon the node
  kubectl uncordon $NODE_HOSTNAME
  
done

cephfs via argocd

Wahoo! Finally got cephfs working in my home lab the way I want it, accessing a source server (running in k8s) via external k8s clusters, all managed by argocd & working with my vclusters.

Looking forward to playing around with a reliable RWX environment. Can finally update pods with 0 downtime, awesome …

Had to basically hack another guys script to get the resources for my argocd deployment, lol, https://github.com/rook/rook/issues/11157, hopefully they can get the changes implemented so the next person doesn’t have to. Hard to imagine anyone with a gitops setup without it, but I can’t be the first to do this?

Gitops-based kubernetes phoenix infrastructure using argocd

A low level strategy …

  • deploy a cluster, we’ll call it ‘root’
  • deploy argocd using helm, use namespace ‘argocd-seed’
  • create a git repo named ‘k-argocd-root’, and set the paths as:
    • /appofapps/base/templates/(symlink to each app)
    • /apps/argocd-clusters
    • /apps/argocd-global
    • /apps/argocd
  • now apply the appofapps application with
    • cd /appofapps
    • kubectl apply -f applicationset.yaml
  • we now have 4 argocd instances running, the first ‘argocd-seed’ manages the three
  • the three have the following goals:
    • argocd-clusters – used to deploy vclusters (or use clusterapi, or tanzu, etc…)
    • argocd-global – used to deploy common addons to all registered clusters
    • argocd – an argocd instance used to deploy all remaining apps to all clusters
  • create a git repo named ‘k-argocd-root-global’, and set the paths as:
    • /appofapps/base/templates/(symlink to each app)
    • /apps/cert-manager
    • /apps/external-dns
  • now apply the appofapps application with
    • cd /appofapps
    • kubectl apply -f applicationset.yaml
  • create a git repo named ‘k-argocd-root-clusters’, and set the paths as:
    • /appofapps/base/templates/(symlink to each cluster)
    • /apps/<cluster>/base/templates/<cluster>.yaml
  • now apply the appofapps application with
    • cd /appofapps
    • kubectl apply -f applicationset.yaml
  • now, whenever you wish to create a new cluster (use a pipeline to do this):
    • create yaml representing new cluster and add to:
      • /apps/<cluster>/base/templates/<cluster>.yaml
    • add the cluster to the ‘argocd-global’ argocd instance
  • a note regarding the 4th ‘argocd’ instance:
    • whenever using argocd ‘appofapps’ & ‘apps’ exist at the root of the git repo, name the repo by the clustername to keep things organized, e.g. ‘k-argocd-<clustername>’, such as:
      • k-argocd-core/appofapps/base/templates/(symlink to each app)
      • k-argocd-core/appofapps/applicationset.yaml
      • k-argocd-core/apps/harbor/base/templates
      • k-argocd-core/apps/harbor/applicationset.yaml
      • k-argocd-core/apps/jenkins/base/templates
      • k-argocd-core/apps/jenkins/applicationset.yaml

Conclusions

This setup allows a gitops based infrastructure with a phoenix style deployment.  In theory, you can delete everything and start over, and with only four ‘kubectl apply -f ./applicationset.yaml’ commands the entire infrastructure will come back up… including the additional clusters and apps deployed to them. Perhaps a script, phoenix-rise.sh …

… except for one thing, though the clusters will come back up, a step is still required to register all the clusters with the ‘argocd-global’ instance so that it can target all the addons to all the clusters.  We need a way to watch for new clusters to come up and automatically register them with ‘argocd-global’. Additionally, it might be nice to create groups to work with RBAC. Something to look into …

Also, applicationsets, where appropriate, need to target a cluster by name (registered via dns, use external-dns for this) rather than ip address, at least … if you want to test out deleting it all and having it all come back up.

Thank you

I hope you have found this discussion useful!

A little script to roll a cluster, useful if managing your own.

#!/bin/bash
# roll-cluster 

# get list of nodes
NODES=(`kubectl get nodes -o jsonpath='{.items[*].status.addresses[1].address}'`)
for NODE in "${NODES[@]}"
do
  echo ""
  echo "[$NODE]"
  TMP=`kubectl get node $NODE | grep Ready | grep -v 'control-plane'`
  if [ "$TMP" == "" ]; then
    continue
  fi

  echo "- draining $NODE"
  kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
  echo "- sending reboot command, enter password if prompted by sudo"
  ssh root@$NODE reboot
  echo "- waiting for node go down (no ping)"
  while [ "$(ping $NODE -c 4 | grep packet | grep -c ' 0\% packet loss')" == 1 ]; do
    sleep 1;
    done
  echo "- wait for node to show as ' Ready'"
  while (true); do
    a=`kubectl get node $NODE | grep " Ready"`
    if [ "$a" != "" ];
      then break;
      else sleep 1;
    fi
    done;
  echo "- uncordon $NODE"
  kubectl uncordon $NODE
done