Rearchitecting lido for kubernetes


broker-tdameritrade-www (oidc login)
broker-kraken-www (oidc login)
tradermanager-www (gui)

git-based homedir folder structure (and git repos) using lessons learned

After reinstalling everything including my main linux workbench system it became the right time to finally get my home directory into git.  Taking all lessons learned up till this point it seemed a good idea to cleanup my git repo strategy as well.  The revised strategy:

[Git repos]

- workbench-<user>

Team (i for infrastructure):
- i-ansible
- i-jenkins (needed ?)
- i-kubernetes (needed?)
- i-terraform
- i-tanzu

Project related: (source code)
- p-lido (use tagging dev/test/prod)

Jenkins project pipelines:
- j-lifecycle-cluster-decommission
- j-lifecycle-cluster-deploy
- j-lifecycle-cluster-update
- j-lido-dev
- j-lido-test
- j-lido-prod

Cluster app deployments:
- k-core
- k-dev
- k-exp
- k-prod

[Folder structure]

i-ansible (git repo)
  plays ( ~/a )

i-jenkins (git repo) (needed ?)
  pipelines ( ~/j )

i-kubernetes (git repo) (needed ?)
  manage ( ~/k )

i-terraform (git repo)
  plans (~/p)

i-tanzu (git repo)
  application.yaml (-> appofapps)
  apps (~/t)
    appofapps/ (inc all clusters)

  <gitrepo>/<user> (~/mysrc) (these are each git repos)
  <gitrepo>/<team> (~/s) (these are each git repos)
    - deploy cluster
    - create git repo
    - create adgroups
    - register with argocd global
      application.yaml (-> appofapps)
        appofapps/ (inc all apps)
      application.yaml (-> appofapps)
        appofapps/ (inc all apps)
      application.yaml (-> appofapps)
        appofapps/ (inc all apps)

workbench-<user> (git repo)

kubernetes disaster recovery

Deploying via git using argocd or flux makes disaster recovery fairly straightforward.

Using gitops means you can delete a kubernetes cluster, spin up a new one,  and have everything deployed back out in minutes.  But what about recovering the pvcs used before?

If you are using an infrastructure which implements csi, then you are able to allocate pvcs using storage managed outside of the cluster.  And, it turns out, reattaching to those pvcs is possible but you have to plan ahead.

Instead of writing yaml to spin up a pvc automatically, create the pv and pvc using manually set values.  Or spin up the pvcs automatically and then go back and modify the yaml to set recoverable values.  The howto is right up top in the csi documentation:

Similarly, it is common for applications to spin up with randomly set admin passwords and such.  However, imagine a recovery scenario where a new cluster is stood up, you don’t want a new password stood up.  Use a vault with a password and reference the vault.

These two steps do add a little work, it’s the idea of taking a little more time to do things right, and in a production environment you want this.

Infrastructure side solution:

Todo:  Create a video deleting a cluster and recovering all apps with a new cluster, recovering pvcs also (without any extra work on the recovery side).

and then we have argocd, with its ability to restore an entire cluster, even multiple clusters simultaneously simply due to its inherent declarative nature

more kubernetes magic


Helm chart check list – work flow from dev to prod (always evolving)

Helm chart setup workflow


  1. Initially get things working with a default helm install into dev.
  2. Does it have ldap or oidc integration or some other reason to need to verify onprem CA chain? Yes, figure out how to get CA chain installed
  3. Setup oidc if possible, setup ldap if needed and oidc is not available.
  4. Is it possible to set values in helm chart to get CA chain installed as part of the helm install? Yes, modify yaml, no fork helm install and add steps, see if project owner is OK with pull request or not. (better in Integrate with helm chart than manage a fork).
  5. Configure something on server you want to persist.
  6. Helm uninstall and reinstall, did you lose the setting? … figure out steps to get data to persist.
  7. Try to increase the storage size of PVCs which might need to be increased in production?  Is this possible without taking down the application?  Figure out what is required for this use case which will inevitably come up in the future.  Document and be ready.  It may be wise to implement a pipeline for this purpose.
  8. Cordon involved node(s) and drain, uncordon node(s), was data lost when things came back up? Figure out how to ensure data persists.
  9. Does server have a method to export configuration or otherwise backup the server. Configure automated backups.
  10. Is it possible to Configure server as HA? Can it be configured for HA later or must it be configured from initial setup? Can a single instance be migrated to HA. Decide if HA needs to be setup or is a single instance good enough. If HA is desired then figure out how to set that up and go through this list again.
  11. Are there options to configure metrics for the application?  Often these exist in helm installs.  (lower priority when initially working to get something up)
  12. If there is an option to use a log aggregator set that up or possibly setup a sidecar with logging.  (lower priority when initially working to get something up)
  13. Server is now ready to release into test.


  1. Configure permissions of those with accounts accessed via oidc / ldap. Note a program which supports ldap but not oidc is not as evolved. Check for a plugin/extension if oidc does not appear to be available. Oidc enables almost any identity provider & SSO, and is always preferred.
  2. Does the minimum requested CPU and Memory match what’s actually needed?
  3. Someone needs to perform some manual testing or work on automated testing.
  4. If no one ever tries restoring from a backup, there is a good chance the process might not work, might want to try that out before there is a fire.
  5. No system may be released into production without an automated method of registering its ip in dns (e.g. external-dns) and also an automated method of updating its ssl certificates (e.g. cert-man), verify these work.
  6. Be sure to test rolling up to the next release of a helm chart as well as rolling back (and all tests still pass).
  7. If all testing passes then ready for production.


  1. An update strategy needs to be established and followed just prior to release into production. Schedule: Monthly, quarterly, every 6 months, or upon release of a new version. Version: always run the latest, or version just prior to latest major release (and with all the updates).  Some programs such as WordPress can/will update plugins automatically… is this ok?
  2. Generally automation is desired to roll something out into production. When an update is ready automation should be used to update first in dev, perform automated testing, then roll out into dev with the ok of someone (or automatically rolled out into prod if all tests passed and its decided that is good enough).
  3. Also, a pipeline for rolling back to a previous version is a good idea, in case a deployment to production fails.

(pull request) Contributing to open source, helm chart for taiga, ability to import an on prem certificate authority certificate chain

OIDC is always preferred if possible.  At this time in history not all projects have OIDC support, though some can be extended via an extension or plugin to accomplish the goal.  I’ve got enough experience to help projects get over this hurdle and get OIDC working.  If I could be paid just to help out open source projects I might go for it.

Here’s a pull request for a taiga helm chart I’ve been using.  I’ve been using taiga for years via docker and am happy to be able to help out in this way now that I’m using kubernetes and helm charts.  In this case a borrowed a technique from a nextcloud helm chart and works perfectly for this taiga helm chart:

Like rebuilding a Shinto shrine

Traditionally Shinto shrines are rebuilt exactly the same next to the old shrine every so many years.  The old shrine is removed and when the time comes it will be rebuilt again.

Something similar can apply to home environments.  Recently I nuked everything and rebuilt from the ground up.  Something I’ve always done after 6 months or a year, for security reasons and to ensure I am always getting the fastest performance from infrastructure.

Such reinstalling is a natural fit for kubernetes.  There are several methods for spinning up a cluster, and after that just by the nature of kubernetes being yaml files it is easy to spin up the services you had running before, and watch them self register in the new dns and self generate certificates with the new active directory certificate authority.  Amazing.   Kubernetes is truly, a work of art.