(storage) ceph is amazing

If you haven’t tried out ceph yet, and are not yet completely satisfied with your onprem storage system, I recommend giving it a try.  (Note, it does want a lot of cpu, so heads up on that.)

** I acknowledge that I am currently excited and completely captivated by ceph.  I’m still fairly new to using ceph, so you might want to check my facts, a.k.a. I’m tempting you to start investigating. ūüėČ

Ceph is able to use multiple disks on multiple servers and spread out the load, maintaining two or three copies of data to avoid data loss.  Just look at my humble 4 disk system with the data spread out nearly perfectly across the disks.  (2 nodes with 2 disks each)

Ceph provides cephfs (ceph filesystem), cephrbd (ceph block storage).

Block storage gives you something like a disk, it’s useful for say creating a vm that you later want to hot-swap between servers.

Otherwise, the ceph file system is what you want to use, though not directly.

You’ll end up creating a pool which is based on the ceph filesystem (cephfs), then create pvcs that come from the pool. Also you can use iscsi (which uses cephrbd) or nfs, (which uses cephfs), if you have a consumer that is not able to connect with cephfs or cephrbd directly.

In my experiences working with storage systems and kubernetes persistent volumes, I was not having luck with RWX (read write many), even when the providers claimed they worked (nfs using my own linux server, longhorn which uses nfs for rwx). I found two apps ‘plex’ and ‘nextcloud’ would consistently experience database corruption after only a few minutes.

People continually told me “NFS supports RWX” and “Just use iSCSI, that’s perfect for apps that use SQLite”. I tested these claims and did not get the same result using a TrueNAS Core server.

However, with ceph you can allocate a pvc using cephfs, and this works perfectly with RWX! Awesome! And super fast! All my jenkins builds which run in kubernetes sped up by 15 seconds vs TrueNAS iSCSI, of course … this could just mean the physical disks running with ceph are faster than the physical disks running with my truenas server, can’t be sure.

Now I suspect if I created pvcs using NFS and iSCSI, which work on top of cephfs, that they might also support RWX. I’ve very curious, but given how fast cephfs and that everything is working perfectly, I can’t see any reason to use NFS or iSCSI (except for vm disks).

Using the helm install of ceph you end up with a tools pod that you can exec into and run the ‘ceph’ cli. The dashboard gui is pretty great, but the real interaction with ceph happens at the command line level. I’ve been in there breaking & fixing things and the experience feels like a fully fledged product ready for production. There is so much there I can see someone managing ceph as a career with an available deep dive as far as you are interested in going. If you break it enough and then fix it you get to watch it moving data around recovering things which is as cool as can be to watch, from the most geeky perspective.

In any case, given how long cephfs has been around and the popularity it has in use in production environments, I think I’ve found my storage solution for the foreseeable future.

Ceph – Getting Started

(centos) k8s-update.sh – script to upgrade a kubernetes cluster

Script to update a kubernetes cluster to the next patch or minor version.

#!/bin/bash

# if no parameter, show versions and syntax
if [ -z $1 ]; then
  # show available versions
  yum list --showduplicates kubeadm --disableexcludes=kubernetes

  # show syntax
  echo ""
  echo "Syntax:"
  echo "$0 <version>, e.g. $0 1.26.x-0"
  exit 1
fi

# remember version
export TARGET_VERSION=$1


# configure kubectl to use admin config
export KUBECONFIG=/etc/kubernetes/admin.conf

# track first control plane node
export IS_FIRST=1

# loop through control plane nodes
#kubectl get nodes --no-headers | xargs -n 5 echo
NODES=`kubectl get nodes --no-headers | awk '{print $1}'`
for NODE in $NODES; do
  # parse kubectl node output into parameters
  NODE_HOSTNAME=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $0'`
  NODE_TYPE=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $2'`
  NODE_VERSION=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $4'`

  # only work on control plane nodes in this loop
  if [ $NODE_TYPE != "control-plane" ]; then
     #echo ""
     #echo "skipping worker node"
     continue
  fi

  echo ""
  echo "***"
  echo "* Next: $NODE_HOSTNAME"

  # upgrade kubeadm
  echo "upgrade to: $TARGET_VERSION"
  ssh root@$NODE_HOSTNAME yum install -y kubeadm-$TARGET_VERSION --disableexcludes=kubernetes

  # verify the download works and has the expected version
  #ssh root@$NODE_HOSTNAME kubeadm version

  # verify the upgrade plan
  #ssh root@$NODE_HOSTNAME kubeadm upgrade plan

  # perform the update
  if [ $IS_FIRST == "0" ]; then
    ssh root@$NODE_HOSTNAME kubeadm upgrade node
  else
    # if this is the first control plane node its command is a little different
    ssh root@$NODE_HOSTNAME kubeadm upgrade apply --yes v$TARGET_VERSION

    # adjust tracking now that we've completed the first control plane node
    export IS_FIRST=0
  fi

  # drain node & prepare for updating
  kubectl drain $NODE_HOSTNAME --delete-emptydir-data --ignore-daemonsets

  # update kubelet & kubectl
  ssh root@$NODE_HOSTNAME yum install -y kubelet-$TARGET_VERSION kubectl-$TARGET_VERSION --disableexcludes=kubernetes

  # restart kubelet
  ssh root@$NODE_HOSTNAME systemctl daemon-reload
  ssh root@$NODE_HOSTNAME systemctl restart kubelet

  # uncordon the node
  kubectl uncordon $NODE_HOSTNAME
  
done


# loop through worker nodes
NODES=`kubectl get nodes --no-headers | awk '{print $1}'`
for NODE in $NODES; do
  # parse kubectl node output into parameters
  NODE_HOSTNAME=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $0'`
  NODE_TYPE=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $2'`
  NODE_VERSION=`kubectl get node $NODE --no-headers | xargs -n 5 bash -c 'echo $4'`

  # only work on control plane nodes in this loop
  if [ $NODE_TYPE == "control-plane" ]; then
     #echo ""
     #echo "skipping control plane node"
     continue
  fi

  echo ""
  echo "***"
  echo "* Next: $NODE_HOSTNAME"

  # upgrade kubeadm
  echo "upgrade to: $TARGET_VERSION"
  ssh root@$NODE_HOSTNAME yum install -y kubeadm-$TARGET_VERSION --disableexcludes=kubernetes

  # verify the download works and has the expected version
  #ssh root@$NODE_HOSTNAME kubeadm version

  # verify the upgrade plan
  #ssh root@$NODE_HOSTNAME kubeadm upgrade plan

  # perform the update
  if [ $IS_FIRST == "0" ]; then
    ssh root@$NODE_HOSTNAME kubeadm upgrade node
  fi

  # drain node & prepare for updating
  kubectl drain $NODE_HOSTNAME --delete-emptydir-data --ignore-daemonsets

  # update kubelet & kubectl
  ssh root@$NODE_HOSTNAME yum install -y kubelet-$TARGET_VERSION kubectl-$TARGET_VERSION --disableexcludes=kubernetes

  # restart kubelet
  ssh root@$NODE_HOSTNAME systemctl daemon-reload
  ssh root@$NODE_HOSTNAME systemctl restart kubelet

  # uncordon the node
  kubectl uncordon $NODE_HOSTNAME
  
done

cephfs via argocd

Wahoo! Finally got cephfs working in my home lab the way I want it, accessing a source server (running in k8s) via external k8s clusters, all managed by argocd & working with my vclusters.

Looking forward to playing around with a reliable RWX environment. Can finally update pods with 0 downtime, awesome …

Had to basically hack another guys script to get the resources for my argocd deployment, lol, https://github.com/rook/rook/issues/11157, hopefully they can get the changes implemented so the next person doesn’t have to. Hard to imagine anyone with a gitops setup without it, but I can’t be the first to do this?

A little script to roll a cluster, useful if managing your own.

#!/bin/bash
# roll-cluster 

# get list of nodes
NODES=(`kubectl get nodes -o jsonpath='{.items[*].status.addresses[1].address}'`)
for NODE in "${NODES[@]}"
do
  echo ""
  echo "[$NODE]"
  TMP=`kubectl get node $NODE | grep Ready | grep -v 'control-plane'`
  if [ "$TMP" == "" ]; then
    continue
  fi

  echo "- draining $NODE"
  kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
  echo "- sending reboot command, enter password if prompted by sudo"
  ssh root@$NODE reboot
  echo "- waiting for node go down (no ping)"
  while [ "$(ping $NODE -c 4 | grep packet | grep -c ' 0\% packet loss')" == 1 ]; do
    sleep 1;
    done
  echo "- wait for node to show as ' Ready'"
  while (true); do
    a=`kubectl get node $NODE | grep " Ready"`
    if [ "$a" != "" ];
      then break;
      else sleep 1;
    fi
    done;
  echo "- uncordon $NODE"
  kubectl uncordon $NODE
done

Architecting teamtask / list: As a kubernetes controller

See previous posting for more details on the Teamtask / List algorithm.

Quick summary:  The algorithm has been called List because it can be used in the most basic use case, but frequently desired use case of needing to perform an action on multiple items, given a list of 100 hostnames for example, ping each one and see if it responds, track status of whether the hostname has been processed, so that if the script has to be restarted we don’t have to repeat work.  Turns out though, that if the implementation successfully implements a mutex, such as by the correct use of a database, multiple clients can help to process the list resulting in a well orchestrated distributed processing engine, (hence, Teamtask).

List has been implemented most recently as a webapi covering all the use cases one would expect, and placing this into a container as a microservice along with a database and respective helm chart is the next step.  But what about beyond that?

List works by breaking up a list of items into blocks for processing, by selecting an appropriate size the server will not carry a high cpu load or be busy.  Additionally a main List implementation can be used to hand out very large blocks, which secondary List servers can consume and then break up into smaller chunks for their clients, further reducing the load on the main server.  A misconfigured Job though, say of blocksize one, with millions of items to process and hundreds of clients could produce some peak resource consumption unnecessarily.  With this in mind, what if we were to use a controller in Kubernetes itsself?

Kubernetes at its core is a controller engine.  Controllers recognize yaml defined objects such as deployments, services, and ingresses.  Kubernetes is used around a certain set of controllers related to container management but really, we can create controllers for almost anything.  We could create a controller that knows how to play Tic-Tac-Toe for example, defining a game with a current state in a yaml.  A controller recognizing the yaml could see that the state indicates a move needs to be made and could make a move and update the status of the object / yaml.  One could see how such a system could be used to coordinate many games and through the nature of kubernetes, everything would scale, in the same way kubernetes manages the state of hundreds of deployments it could manage the state of multiple instances of a game.

As an excuse to write a kubernetes controller just for fun we could implement List.  A yaml with an api could define a List to process.  The controller could then create a block, or a few blocks, for clients to process.  The original List could define how many items to process, the size of a block to use, who has permissions to work on the blocks, etc …  The idea of a “client” could also be a controller within kubernetes.  This could be implemented in a similar way like how argocd recognizes an applicationset and upon processing creates one or more applications, which it also knows how to process.  By using kubernetes we could use its database and not have to deploy our own.  Potentially, resources could get out of hand if misconfigured, so safety checks would need to be put in place, but with our merging algorithm as described before the “behind the scenes kubernetes database” use should and cpu use should minimal on the list management side.

Such a controller would get all the benefits of using kubernetes, we could take advantage of built in error checking and status of the type we see with pods and scaling.  Such an implementation would lead to some fun investigations, how exactly does kubernetes manage all the pods it manages, is it checking them one at a time or all at one, a few at a time.  Whatever algorithm kubernetes uses to manage pods would be the same algorithm used to manage the list blocks.  Probably there are some built in limits to keep things sane, and perhaps we could take advantage of those.

Maybe controllers to process blocks wouldn’t bring any benefit, perhaps it would be better to just implement server side as a controller and clients could be run as Kubernetes Jobs, or just deployments setup to scale as desired, perhaps within resource quotas.  Still, in either case, it might make sense to define a Block type which upon processing would get an index & size added.  The Block could show as pending, in the same way an Ingress does while waiting for a loadbalancer ip, consumers could wait for the status to change and then work on the block.  Upon completion the status could be advanced to ‘completed’ when finished or something similar to communicate to the controller that the block is done.

How cool would it be do to do a ‘kubectl -n <namespace> get blocks’ and get the blocks currently being worked on displayed in the familiar kubectl style with current status?

$ k get blocks -o wide
NAME                           READY     STATUS        RESTARTS        INDEX      SIZE
primenumber-578b4958fc-cvtbm   1/1       Ready         2               0          1000
primenumber-578b4958fc-segcfs  1/1       Ready         0               1001       1000
primenumber-578b4958fc-wersw   0/1       Pending       0               Pending    Pending

If the List implementation were implemented as a controller within Kubernetes, we could still use it outside of the kubernetes cluster without having to implement a webapi because kubernetes itself can be accessed and used via a webapi, no kubectl required.  Sweet!!!  (course, we might not want users to access the kubeapi directly, wrap that api!)

Architecting teamtask / list: The early years

Teamtask (a.k.a. List) is a pet project / algorithm I developed back in 2000 as part of a brute force password cracking experiment.  Actually though, now that I think about it, I originally started working on the algorithm in 5th grade.

Back in my early years I wanted to password protect my computer, which wasn’t a thing back then.  I set about writing a program with a prompt for a username and password.  It worked.  I could start it up when the computer started, and though you could just ctrl-c out of it (not super sophisticated), my next thought was how could someone get around it.  I began investigating how to generate all passwords so they all could be tested one after another until the password was guessed.  I figured out the following two algorithms given a string of 3 characters ‘abc’ and length ‘3’:

aaa 111 000
aab 112 001
aac 113 002

abc 123
acb 132
bac 213
bca 231
cab 312
cba 321

The algorithms were thus: one generating all combinations with reusing characters and one without reusing characters.  The first worked best for brute force password cracking (though I didn’t know the term at the time, if it even existed).  But, in 5th grade I wasn’t able to create the algorithm to generate the strings.  Later in life though, I was able to create an algorithm for both using iteration with a base equal to the number of characters, rather than base 10, along with factorials(!).

With these two algorithms the following became possible:  If there were 6 possible arrangement of characters I could give a client the number, such as 1 along with a string of characters ‘abc’ and the client could translate that into ‘abc’, and do something with it (test if the password works).  Since the client only needed the index, and the string of characters, we could also give out a block of characters such as index=0, size=3.  This would result in two blocks that two different clients could work on simultaneously.  Each client would take a block, process three combinations, then report back the result.

Implementing the algorithm there’s one more magic that occurs.  One might initially implement the algorithm above in the following way, given 100 items to complete, and breaking those into chucks of 10, you could add 10 records to a database to reflect these blocks pending completion:

index = 0, size = 10, status = pending
index = 10, size = 10, status = pending

index = 90, size = 10, status = pending

After each completes you could mark the status as ‘completed’ and once all blocks are completed flag the Job as done.

However, this would mean when processing more extreme lists with thousands of blocks, think testing for the largest prime number ever found, you wouldn’t want to hold the status of all blocks.  With one more algorithm this concern disappears, what you do is merge sibling blocks, so if you have three blocks in a row and clients are working on them: (0, 10, 0), (10, 10, 0), (20, 10, 0), and the second two complete (0, 10, 0), (10, 10, 1), (20, 10, 1) , you can merge them for tracking purposes: (0, 10, 0), (10, 20, 1), and if the first completes you can merge again, (0, 30, 1), indicating from position 0 of size 30, all of those have been completed.  This conveniently means that when the whole list has been processed you will have one block with the whole size in completed status (0, 100000000, 1).

The algorithm has evolved to have timeouts with blocks, to handle the use case of a client disappearing while working on a block (or crashes), limiting the number of blocks a client can have at one time (to avoid some level of someone trying to interfer with processing by requesting blocks and not working on them), and work with OIDC to work within an enterprise infrastructure.

Roadmap:
– implement teamtask (a.k.a. list) as a container
– implement webapp gui & mobile gui, both with single implementation using flutter