Bloopers during development¶

Terraform¶

Delete all resources except one¶

There is no --except feature in terraform destroy command currently. If you really want to do that, and you know what you are doing, here is the workaround.

# list all resources
terraform state list

# remove that resource you don't want to destroy
# you can add more to be excluded if required
terraform state rm <resource_to_be_deleted> 

# destroy the whole stack except above excluded resource(s)
terraform destroy

So why do these commands work for your idea?

The state (*.tfstate) is used by Terraform to map real world resources to your configuration, keep track of metadata.

terraform state rm cleans a record (resource) from the state file (*.tfstate) only. It doesn't destroy the real resource.

Since you don't run terraform apply or terraform refresh, after terraform state rm, terraform doesn't know the excluded resource was created at all.

When you run terraform destroy, it has no detail about that excluded resource’s state and will not destroy it. It will destroy the rest.

By the way, later you still have chance to import the resource back with terraform import command if you want.

terraform import module.vpc.aws_vpc.vpc vpc-0933149e01f1136aa

ECS cluster with capacity providers cannot be destroyed¶

The problem is that the capacity_provider property on the aws_ecs_cluster introduces a new dependency: aws_ecs_cluster depends on aws_ecs_capacity_provider depends on aws_autoscaling_group.

This causes terraform to destroy the ECS cluster before the autoscaling group, which is the wrong way around: the autoscaling group must be destroyed first because the cluster must contain zero instances before it can be destroyed.

This leads to Terraform error out with the cluster partly alive and the capacity providers fully alive.

References:

I haven't found a proper workaround, yet...

EKS¶

Unable to delete ingress¶

kubectl delete ValidatingWebhookConfiguration aws-load-balancer-webhook
kubectl patch ingress $ingressname -n $namespace -p '{"metadata":{"finalizers":[]}}' --type=merge

Unable to delete namespace¶

To delete a namespace, Kubernetes must first delete all the resources in the namespace. Then, it must check registered API services for the status. A namespace gets stuck in Terminating status for the following reasons:

The namespace contains resources that Kubernetes can't delete.
An API service has a False status.

Scripted:

k8s-ns-finalizer <NAMESPACE>

Manual:

Save a JSON file like in the following example:

namespace=<NAMESPACE>

kubectl get namespace $namespace -o json > tempfile.json

Remove the finalizers array block from the spec section of the JSON file:

"spec": {
        "finalizers": [
            "kubernetes"
        ]
    }

After you remove the finalizers array block, the spec section of the JSON file looks like this:

"spec" : {
    }

To apply the changes, run the following command:

kubectl replace --raw "/api/v1/namespaces/$namespace/finalize" -f ./tempfile.json

Verify that the terminating namespace is removed:

kubectl get namespaces

Repeat these steps for any remaining namespaces that are stuck in the Terminating status.

EKSWorkerNode is not joining Node Group¶

This does help to identify the problem:

EKS Service behind ALB shows only HTML¶

resource "kubernetes_ingress_v1" "openssl3_ingress" {
  wait_for_load_balancer = true

  metadata {
    annotations = {
      "alb.ingress.kubernetes.io/scheme"        = "internet-facing"
      "alb.ingress.kubernetes.io/target-type"   = "ip"
      "kubernetes.io/ingress.class"             = "alb"
      "alb.ingress.kubernetes.io/inbound-cidrs" = var.access_ip
    }
    labels = {
      app = "web-app"
    }
    name      = "web-app-ingress"
    namespace = var.namespace
  }
  spec {
    rule {
      http {
        path {
          backend {
            service {
              name = "web-app-service"
              port {
                number = 80
              }
            }
          }
          path = "/*"
        }
      }
    }
  }
}

The * in path is important :-)

Route53¶

If a hosted zone is destroyed and re-provisioned, new name server records are associated with the new hosted zone. However, the domain name might still have the previous name server records associated with it.

If AWS Route 53 is used as the domain name registrar, head to Route 53 > Registered domains > ${your-domain-name} > Add or edit name servers and add the newly associated name server records from the hosted zone to the registered domain.

XDR for Containers¶

Initially, I thought I just need to leave the VPC alone when changing/destroying part of the network configuration. This was a failure...

Misc commands which helped at some point¶

aws kms delete-alias --alias-name alias/eks/playground-one-eks

EKS EC2 Autoscaler¶

In some cases creating the EKS cluster with EC2 instances failes just before finishing with the following error:

╷
│ Warning: Helm release "cluster-autoscaler" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
│ 
│   with module.eks.helm_release.cluster_autoscaler[0],
│   on eks-ec2/autoscaler.tf line 25, in resource "helm_release" "cluster_autoscaler":
│   25: resource "helm_release" "cluster_autoscaler" {
│ 
╵
...
╷
│ Error: 1 error occurred:
│       * Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
│ 
│ 
│ 
│   with module.eks.helm_release.cluster_autoscaler[0],
│   on eks-ec2/autoscaler.tf line 25, in resource "helm_release" "cluster_autoscaler":
│   25: resource "helm_release" "cluster_autoscaler" {
│ 
╵

This looks like a timing issue for me which I need to investigate further. If you run into this problem just rerun

pgo --apply eks-ec2

This should complete the cluster creation within seconds then.

Autoscaler Logs:

kubectl logs -f -n kube-system -l app=cluster-autoscaler