OpenShift Installation Baremetal on AWS

INPG - DO NOT FOLLOW BLINDLY!

Other notes:

Table of Content

Architecture Overview
Prerequisites and AWS Infrastructure
Set up bastion
Download your Pull Secret
Create Target Groups and NLB
Setup DNS
- Cloudflare (Public DNS)
- Route53 Private Hosted Zone (Internal DNS)
NAT Gateway for Outbound Internet
Security Groups
- ocp-sg (Master, Bootstrap, Worker nodes)
Finding CoreOS AMI
Prepare install-config.yaml
Prepare Deployment
Launch EC2 Instances
Register Instances to Target Groups
Monitor Installation Progress
Post Bootstrap Cleanup
Troubleshooting
TODO

Architecture Overview

                    ┌─────────────────────────────────────────┐
                    │              AWS VPC (10.0.0.0/16)       │
                    │                                          │
  Internet ─────── │  ┌─────────┐    ┌──────────────────────┐ │
                    │  │ IGW     │    │  Public Subnets       │ │
                    │  └────┬────┘    │  - Bastion            │ │
                    │       │         │  - Bootstrap (+EIP)   │ │
                    │  ┌────▼────┐    └──────────────────────┘ │
                    │  │ Ext NLB │                              │
                    │  │ api.*   │    ┌──────────────────────┐ │
                    │  └────┬────┘    │  Private Subnets      │ │
                    │       │         │  - Masters (x3)       │ │
                    │  ┌────▼────┐    │  - Workers (x3)       │ │
                    │  │ Int NLB │    └──────────────────────┘ │
                    │  │api-int.*│              │               │
                    │  └─────────┘    ┌─────────▼────────┐    │
                    │                 │   NAT Gateway     │    │
                    │                 └──────────────────┘    │
                    └─────────────────────────────────────────┘

Key Design Decisions:

Bootstrap and Bastion in public subnets — require direct internet access (EIP needed since MapPublicIpOnLaunch=false)
Masters and Workers in private subnets — outbound via NAT Gateway, no public IPs needed
Two NLBs — external (internet-facing) for api.*, internal for api-int.*
api-int must resolve to private IPs (internal NLB) — masters can’t reach public IPs without EIP
Route53 private hosted zone for internal DNS (api-int) — not Cloudflare

Prerequisites and AWS Infrastructure

The following AWS resources should be created before starting (ideally via Terraform):

Resource	Details
VPC	`10.0.0.0/16`
Public Subnets	2x (one per AZ) — for bastion, bootstrap, external LBs
Private Subnets	2x (one per AZ) — for masters, workers
Internet Gateway	Attached to VPC
NAT Gateway	In public subnet with EIP — for private subnet outbound internet
Route Tables	Public subnets → IGW, Private subnets → NAT GW
Security Groups	See Security Groups section

Important: Ensure all subnet route tables are correctly configured before launching instances. Private subnets MUST have a NAT Gateway route for 0.0.0.0/0, otherwise masters/workers cannot pull images from quay.io during installation.

Set up bastion

$ ssh-keygen -t ed25519 -N '' -f ~/.ssh/id_rsa

# Set the version environment variable
export VERSION=4.20.0

# Download and extract the OpenShift CLI (oc)
curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/$VERSION/openshift-client-linux.tar.gz | tar zxvf - oc

# Move the oc binary to a directory on your PATH
sudo mv oc /usr/local/bin/

# Download and extract the OpenShift Installer
curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/$VERSION/openshift-install-linux.tar.gz | tar zxvf - openshift-install

# Move the installer binary to your PATH
sudo mv openshift-install /usr/local/bin/

# Verify that the tools are installed correctly
oc version
openshift-install version

Install aws CLI: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html

Install dig for DNS troubleshooting:

sudo dnf install -y bind-utils

Download your Pull Secret

Log in to the Red Hat OpenShift Cluster Manager: https://console.redhat.com/openshift/install/pull-secret
Download or copy your pull secret to the bastion host (e.g., save it as pull-secret.txt)

Create Target Groups and NLB

Note: Protocol must be strictly TCP (Layer 4) for all OCP NLBs — no SSL termination.

External API NLB (internet-facing)

Create target groups first (leave targets empty — register instances later):

Target Group	Protocol	Port	Health Check
`ocp-api`	TCP	6443	HTTPS `/readyz`
`ocp-api-int`	TCP	22623	HTTPS `/readyz`

# Create external API NLB in PUBLIC subnets
aws elbv2 create-load-balancer \
  --name ocp-api \
  --type network \
  --scheme internet-facing \
  --subnets <public-subnet-2a> <public-subnet-2b>

# Create target group for 6443
aws elbv2 create-target-group \
  --name ocp-api \
  --protocol TCP \
  --port 6443 \
  --vpc-id <vpc-id> \
  --target-type instance \
  --health-check-protocol HTTPS \
  --health-check-path /readyz \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 10

# Create target group for 22623
aws elbv2 create-target-group \
  --name ocp-api-int \
  --protocol TCP \
  --port 22623 \
  --vpc-id <vpc-id> \
  --target-type instance \
  --health-check-protocol HTTPS \
  --health-check-path /readyz \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 10

# Create listeners
aws elbv2 create-listener \
  --load-balancer-arn <external-nlb-arn> \
  --protocol TCP --port 6443 \
  --default-actions Type=forward,TargetGroupArn=<ocp-api-tg-arn>

aws elbv2 create-listener \
  --load-balancer-arn <external-nlb-arn> \
  --protocol TCP --port 22623 \
  --default-actions Type=forward,TargetGroupArn=<ocp-api-int-tg-arn>

Internal API NLB (internal)

Critical: Masters and workers resolve api-int to this NLB’s private IP. This is how they reach the Machine Config Server (MCS) during bootstrap without needing a public IP.

# Create internal NLB in PRIVATE subnets
aws elbv2 create-load-balancer \
  --name ocp-api-internal \
  --type network \
  --scheme internal \
  --subnets <private-subnet-2a> <private-subnet-2b>

# Create target groups for internal NLB
aws elbv2 create-target-group \
  --name ocp-api-int-6443 \
  --protocol TCP \
  --port 6443 \
  --vpc-id <vpc-id> \
  --target-type instance \
  --health-check-protocol HTTPS \
  --health-check-path /readyz \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 10

aws elbv2 create-target-group \
  --name ocp-api-int-22623 \
  --protocol TCP \
  --port 22623 \
  --vpc-id <vpc-id> \
  --target-type instance \
  --health-check-protocol HTTPS \
  --health-check-path /readyz \
  --healthy-threshold-count 2 \
  --health-check-interval-seconds 10

# Create listeners
aws elbv2 create-listener \
  --load-balancer-arn <internal-nlb-arn> \
  --protocol TCP --port 6443 \
  --default-actions Type=forward,TargetGroupArn=<ocp-api-int-6443-tg-arn>

aws elbv2 create-listener \
  --load-balancer-arn <internal-nlb-arn> \
  --protocol TCP --port 22623 \
  --default-actions Type=forward,TargetGroupArn=<ocp-api-int-22623-tg-arn>

Application Ingress NLB

# Create ingress NLB in PUBLIC subnets
aws elbv2 create-load-balancer \
  --name ocp-app-ingress \
  --type network \
  --scheme internet-facing \
  --subnets <public-subnet-2a> <public-subnet-2b>

# Target groups
aws elbv2 create-target-group \
  --name ocp-app-ingress \
  --protocol TCP --port 443 \
  --vpc-id <vpc-id> \
  --target-type instance \
  --health-check-protocol HTTP \
  --health-check-port 1936 \
  --health-check-path /healthz/ready

aws elbv2 create-target-group \
  --name ocp-app-ingress-http \
  --protocol TCP --port 80 \
  --vpc-id <vpc-id> \
  --target-type instance \
  --health-check-protocol HTTP \
  --health-check-port 1936 \
  --health-check-path /healthz/ready

Setup DNS

Cloudflare (Public DNS)

Create the following CNAME records under gineesh.com:

Name	Target	Purpose
`api.ocp420`	External NLB DNS name	Kubernetes API for external clients
`*.apps.ocp420`	Ingress NLB DNS name	Wildcard routes (console, apps)

Do NOT put api-int in Cloudflare. Use Route53 private hosted zone instead (see below). api-int is for internal cluster communication only and must resolve to private IPs.

Route53 Private Hosted Zone (Internal DNS)

This is required so that masters and workers can resolve api-int to the private IP of the internal NLB — they cannot reach public IPs without an EIP.

# Create private hosted zone attached to your VPC
aws route53 create-hosted-zone \
  --name ocp420.gineesh.com \
  --caller-reference $(date +%s) \
  --hosted-zone-config PrivateZone=true \
  --vpc VPCRegion=ap-southeast-2,VPCId=<vpc-id>

# Note the HostedZoneId from output e.g. /hostedzone/XXXXXXXXXXXXX

# Add DNS records pointing to INTERNAL NLB
aws route53 change-resource-record-sets \
  --hosted-zone-id <private-zone-id> \
  --change-batch '{
    "Changes": [
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api-int.ocp420.gineesh.com",
          "Type": "CNAME",
          "TTL": 300,
          "ResourceRecords": [{"Value": "<internal-nlb-dns-name>"}]
        }
      },
      {
        "Action": "CREATE",
        "ResourceRecordSet": {
          "Name": "api.ocp420.gineesh.com",
          "Type": "CNAME",
          "TTL": 300,
          "ResourceRecords": [{"Value": "<internal-nlb-dns-name>"}]
        }
      }
    ]
  }'

Verify DNS resolves to private IPs from within the VPC:

dig api-int.ocp420.gineesh.com
# Should return private IPs like 10.0.x.x — NOT public IPs

NAT Gateway for Outbound Internet

All nodes need outbound internet access to pull images from quay.io and registry.redhat.io during installation.

Bootstrap (public subnet) → needs an Elastic IP (EIP) directly on the instance
Masters/Workers (private subnets) → use NAT Gateway (no EIP needed per instance)

# Allocate EIP for NAT Gateway
aws ec2 allocate-address --domain vpc
# Note AllocationId

# Create NAT Gateway in a PUBLIC subnet
aws ec2 create-nat-gateway \
  --subnet-id <public-subnet-id> \
  --allocation-id <eip-alloc-id>

# Wait until available
aws ec2 wait nat-gateway-available \
  --filter "Name=state,Values=available"

# Add NAT route to EACH private subnet route table
aws ec2 create-route \
  --route-table-id <private-rtb-2a> \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id <nat-gw-id>

aws ec2 create-route \
  --route-table-id <private-rtb-2b> \
  --destination-cidr-block 0.0.0.0/0 \
  --nat-gateway-id <nat-gw-id>

Important: Public subnets route via Internet Gateway (0.0.0.0/0 → igw-xxx). Bootstrap is in a public subnet and requires its own EIP because MapPublicIpOnLaunch=false — the IGW requires a public IP to SNAT outbound traffic.

# Allocate and associate EIP for bootstrap instance
aws ec2 allocate-address --domain vpc
aws ec2 associate-address \
  --instance-id <bootstrap-instance-id> \
  --allocation-id <eip-alloc-id>

Security Groups

ocp-sg (Master, Bootstrap, Worker nodes)

Inbound rules:

Protocol	Port	Source	Purpose
TCP	6443	`0.0.0.0/0`	Kubernetes API
TCP	22623	`0.0.0.0/0`	Machine Config Server
TCP	443	`0.0.0.0/0`	HTTPS
TCP	80	`0.0.0.0/0`	HTTP
TCP	22	VPC CIDR	SSH (from bastion only)
TCP	19531	VPC CIDR	Bootstrap journal (bootstrap only)

Outbound rules:

Protocol	Port	Destination	Purpose
All	All	`0.0.0.0/0`	Allow all outbound

Important: Inbound rules for ports 6443 and 22623 are required for NLB health checks to work. Without these, the NLB will report all targets as unhealthy even if the service is running.

Finding CoreOS AMI

Always use the openshift-install binary to find the correct AMI — it embeds the exact RHCOS version for your OCP release. This works offline (no internet required).

$ openshift-install coreos print-stream-json | \
  python3 -c "
import json, sys
data = json.load(sys.stdin)
amis = data['architectures']['x86_64']['images']['aws']['regions']
print('ap-southeast-2 AMI:', amis['ap-southeast-2']['image'])
"
ap-southeast-2 AMI: ami-007439e088223214a

Note: OCP 4.20 uses a new AMI naming convention: rhcos-9.6.YYYYMMDD-N-x86_64 instead of the old RHEL-9.4-RHCOS-4.xx_HVM_GA-... format. The owner ID also changed to 531415883065. Always use the installer binary output rather than searching manually.

To select the AMI in the AWS Console:

Click “Browse more AMIs”
Select “Community AMIs” tab
Paste the AMI ID in the search box
Ensure the console region is set to your target region

Prepare install-config.yaml

mkdir $HOME/clusterconfig

Create $HOME/clusterconfig/install-config.yaml:

apiVersion: v1
baseDomain: gineesh.com
metadata:
  name: ocp420
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 0  # Must be 0 for UPI — do not change
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14   # Pod IPs — internal only, must not overlap VPC CIDR
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16     # Must match your AWS VPC CIDR
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  none: {}
fips: false
pullSecret: '{"auths": ...}'  # Paste your pull secret JSON here
sshKey: 'ssh-ed25519 AAAA...' # Your bastion public key

Key values to update:

machineNetwork.cidr — must match your AWS VPC CIDR
pullSecret — from console.redhat.com
sshKey — your SSH public key (from ~/.ssh/id_rsa.pub)

Prepare Deployment

Generate the manifests

Ensure compute.replicas is set to 0 before running this.

$ openshift-install create manifests --dir $HOME/clusterconfig

INFO Consuming Install Config from target directory
WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings
INFO Manifests created in: /home/ec2-user/clusterconfig/manifests and /home/ec2-user/clusterconfig/openshift

Generate the Ignition configuration files

$ openshift-install create ignition-configs --dir $HOME/clusterconfig

This generates:

clusterconfig/
├── auth/
│   ├── kubeadmin-password
│   └── kubeconfig
├── bootstrap.ign      ← large file, must be hosted externally
├── master.ign         ← pointer to MCS, small enough for user-data
├── worker.ign         ← pointer to MCS, small enough for user-data
└── metadata.json

Important: Ignition configs contain certificates valid for 24 hours. If installation takes longer than 24 hours, regenerate ignition configs from scratch.

Host bootstrap.ign file

bootstrap.ign is too large for EC2 User Data (often >1MB vs 16KB limit). Host it on the bastion using Python’s built-in HTTP server.

mkdir -p ~/ignition-files
cp $HOME/clusterconfig/bootstrap.ign ~/ignition-files/

cd ~/ignition-files
nohup python3 -m http.server 8080 &

# Verify it's serving
curl http://localhost:8080/bootstrap.ign | head -c 100

Open firewall on bastion if needed:

sudo firewall-cmd --add-port=8080/tcp --permanent
sudo firewall-cmd --reload

Ensure port 8080 is allowed in the security group from the VPC CIDR (10.0.0.0/16). Never expose to 0.0.0.0/0 — bootstrap.ign contains sensitive cluster material.

Cleanup after bootstrap completes:

kill $(lsof -t -i:8080)
rm -rf ~/ignition-files/

Encode the Master and Worker Ignition files

base64 -w0 $HOME/clusterconfig/master.ign > $HOME/clusterconfig/master.64
base64 -w0 $HOME/clusterconfig/worker.ign > $HOME/clusterconfig/worker.64

Launch EC2 Instances

Bootstrap Node

Launch in a public subnet. In the User Data field (under Advanced Details), paste this JSON pointer (update the IP to your bastion’s private IP):

{
  "ignition": {
    "config": {
      "replace": {
        "source": "http://10.0.27.231:8080/bootstrap.ign"
      }
    },
    "version": "3.2.0"
  }
}

Recommended instance type: m5.xlarge or larger Root volume: 120 GB gp3

Immediately after launch — assign an EIP:

aws ec2 allocate-address --domain vpc
aws ec2 associate-address \
  --instance-id <bootstrap-instance-id> \
  --allocation-id <eip-alloc-id>

Bootstrap needs a public IP because it’s in a public subnet — the IGW requires a public IP to route outbound traffic. Without this, node-image-pull.service will fail trying to reach quay.io.

Master Nodes

Launch 3 master nodes in private subnets (distribute across AZs).

In the User Data field, paste the entire base64-encoded contents of master.64:

cat $HOME/clusterconfig/master.64

Recommended instance type: m5.xlarge or larger Root volume: 120 GB gp3

Masters must be in private subnets with NAT Gateway access. They resolve api-int to the internal NLB private IP to fetch config from the Machine Config Server (MCS).

Worker Nodes

Launch 3 worker nodes in private subnets.

In the User Data field, paste the entire base64-encoded contents of worker.64:

cat $HOME/clusterconfig/worker.64

Recommended instance type: m5.xlarge or larger Root volume: 120 GB gp3

Register Instances to Target Groups

BOOTSTRAP_ID=<bootstrap-instance-id>
MASTER1_ID=<master1-instance-id>
MASTER2_ID=<master2-instance-id>
MASTER3_ID=<master3-instance-id>
WORKER1_ID=<worker1-instance-id>
WORKER2_ID=<worker2-instance-id>
WORKER3_ID=<worker3-instance-id>

# External NLB — bootstrap + masters on 6443 and 22623
for TG_ARN in <ext-6443-tg-arn> <ext-22623-tg-arn>; do
  aws elbv2 register-targets \
    --target-group-arn $TG_ARN \
    --targets \
      Id=$BOOTSTRAP_ID \
      Id=$MASTER1_ID \
      Id=$MASTER2_ID \
      Id=$MASTER3_ID
done

# Internal NLB — bootstrap + masters on 6443 and 22623
for TG_ARN in <int-6443-tg-arn> <int-22623-tg-arn>; do
  aws elbv2 register-targets \
    --target-group-arn $TG_ARN \
    --targets \
      Id=$BOOTSTRAP_ID \
      Id=$MASTER1_ID \
      Id=$MASTER2_ID \
      Id=$MASTER3_ID
done

# Ingress NLB — workers on 443 and 80
for TG_ARN in <ingress-443-tg-arn> <ingress-80-tg-arn>; do
  aws elbv2 register-targets \
    --target-group-arn $TG_ARN \
    --targets \
      Id=$WORKER1_ID \
      Id=$WORKER2_ID \
      Id=$WORKER3_ID
done

Monitor Installation Progress

Watch bootstrap complete

openshift-install --dir $HOME/clusterconfig \
  wait-for bootstrap-complete \
  --log-level=info

Expected success output:

INFO API v1.33.x up
INFO Waiting up to 30m0s for bootstrapping to complete...
INFO It is now safe to remove the bootstrap resources

Approve CSRs (run in a separate terminal)

Masters need two rounds of CSR approval — run this loop throughout the install:

export KUBECONFIG=$HOME/clusterconfig/auth/kubeconfig

while true; do
  oc get csr -o go-template='{{range .items}}{{if not .status.certificate}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' \
    | xargs --no-run-if-empty oc adm certificate approve
  echo "$(date) - CSR check done"
  sleep 15
done

Watch nodes join

watch -n 15 'oc get nodes'

Watch install complete

openshift-install --dir $HOME/clusterconfig \
  wait-for install-complete \
  --log-level=info

Post Bootstrap Cleanup

Once bootstrap is complete:

# 1. Remove bootstrap from all target groups
for TG_ARN in <ext-6443-tg-arn> <ext-22623-tg-arn> <int-6443-tg-arn> <int-22623-tg-arn>; do
  aws elbv2 deregister-targets \
    --target-group-arn $TG_ARN \
    --targets Id=<bootstrap-instance-id>
done

# 2. Release bootstrap EIP
aws ec2 disassociate-address --association-id <assoc-id>
aws ec2 release-address --allocation-id <eip-alloc-id>

# 3. Terminate bootstrap instance
aws ec2 terminate-instances --instance-ids <bootstrap-instance-id>

# 4. Stop bastion HTTP server
kill $(lsof -t -i:8080)
rm -rf ~/ignition-files/

# 5. Revoke port 8080 security group rule if added
aws ec2 revoke-security-group-ingress \
  --group-id <sg-id> \
  --protocol tcp \
  --port 8080 \
  --cidr 10.0.0.0/16

Troubleshooting

Check Bootstrap Health

ssh -i <key.pem> core@<bootstrap-private-ip>

# Check ignition completed successfully
sudo journalctl -b -u ignition* --no-pager | tail -20

# Check node-image-pull (new in OCP 4.20 — pulls RHCOS node image before bootkube)
sudo systemctl status node-image-pull.service --no-pager
sudo journalctl -b -u node-image-pull.service --no-pager | tail -30

# Check bootkube (bootstraps etcd + kube-apiserver)
sudo journalctl -b -u bootkube.service -f

# Check kubelet
sudo systemctl status kubelet

# Check all running containers
sudo crictl ps

# Check if API port is listening
ss -tlnp | grep 6443

# Test API directly (bypass LB)
curl -k https://localhost:6443/version
curl -k https://localhost:6443/readyz

Ensure Bootstrap is Completed

sudo systemctl status bootkube.service --no-pager
sudo crictl ps | grep -E "etcd|apiserver|controller|scheduler"
curl -k https://localhost:6443/version
sudo systemctl --failed

bootkube showing inactive (dead) with no failed units = success. It’s a oneshot service that exits cleanly when done.

Check Basic OC Resources

export KUBECONFIG=$HOME/clusterconfig/auth/kubeconfig
oc get nodes
oc get csr
oc get clusteroperators

node-image-pull Failures (OCP 4.20+)

If node-image-pull.service fails with i/o timeout connecting to quay.io:

# Bootstrap has no internet access — check:
# 1. Does bootstrap have a public IP?
curl http://169.254.169.254/latest/meta-data/public-ipv4

# 2. Can it reach quay.io?
curl -v https://quay.io

# 3. If no public IP — assign an EIP from bastion:
aws ec2 allocate-address --domain vpc --region ap-southeast-2
aws ec2 associate-address \
  --instance-id <bootstrap-id> \
  --allocation-id <eip-alloc-id>

If node-image-pull.service fails with ref coreos/node-image already exists (from a previous failed attempt):

sudo ostree refs --repo /ostree/repo --delete coreos/node-image
sudo rm -rf /ostree/repo/tmp/node-image
sudo systemctl restart node-image-pull.service
sudo journalctl -b -u node-image-pull.service -f

Masters Stuck in Ignition Fetch

If masters show A start job is running for Ignition (fetch) for more than 5 minutes:

# Check if MCS is reachable from bastion
curl -k https://api-int.ocp420.gineesh.com:22623/config/master | head -c 200

# Check what api-int resolves to (must be private IP)
dig api-int.ocp420.gineesh.com
# Should return 10.x.x.x — if returning public IP, fix Route53 private zone

# Check NLB target health
aws elbv2 describe-target-health \
  --target-group-arn <int-22623-tg-arn> \
  --query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State]' \
  --output table

# Check master console output
aws ec2 get-console-output \
  --instance-id <master-instance-id> \
  --output text | tail -30

Re-apply Master User-Data if Ignition Config is Invalid

If console output shows error: invalid character or config is not valid:

# Stop masters
aws ec2 stop-instances \
  --instance-ids <master1-id> <master2-id> <master3-id>

aws ec2 wait instance-stopped \
  --instance-ids <master1-id> <master2-id> <master3-id>

# Re-encode and apply correct ignition
MASTER_IGN=$(base64 -w0 $HOME/clusterconfig/master.ign)

for INSTANCE in <master1-id> <master2-id> <master3-id>; do
  aws ec2 modify-instance-attribute \
    --instance-id $INSTANCE \
    --attribute userData \
    --value "$MASTER_IGN"
  echo "Updated $INSTANCE"
done

# Start masters
aws ec2 start-instances \
  --instance-ids <master1-id> <master2-id> <master3-id>

Check NLB Target Health

# Check all target groups
aws elbv2 describe-target-groups \
  --query 'TargetGroups[*].[TargetGroupName,TargetGroupArn,Port]' \
  --output table

# Check health of specific target group
aws elbv2 describe-target-health \
  --target-group-arn <tg-arn> \
  --query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State,TargetHealth.Reason]' \
  --output table

Common reasons for unhealthy targets:

Security group missing inbound rule for that port
Service not yet running on the instance (still booting)
Wrong health check path or protocol

Check LB Health Check Config

aws elbv2 describe-target-groups \
  --target-group-arns <tg-arn> \
  --query 'TargetGroups[*].[TargetGroupName,HealthCheckProtocol,HealthCheckPort,HealthCheckPath,HealthyThresholdCount,HealthCheckIntervalSeconds]' \
  --output table

Expected for port 6443: HTTPS / traffic-port / /readyz Expected for port 22623: HTTPS / traffic-port / /readyz

TODO

Add Terraform code to provision all AWS infrastructure (VPC, subnets, NLBs, security groups, NAT Gateway, Route53)
Add AAP Hub S3 RWX storage configuration
Add worker node CSR approval automation via MachineApprover
Add cluster operator verification steps post-install