INPG - DO NOT FOLLOW BLINDLY!
Other notes:
- OpenShift - Tools, Learning Guides and References
- OpenShift and Service Mesh
- OpenShift Installation Methods - Examples
- OpenShift Installation on GCP with Nested Virtualization
- OpenShift Local (Formerly CodeReady Containers - CRC)
Table of Content
- Architecture Overview
- Prerequisites and AWS Infrastructure
- Set up bastion
- Download your Pull Secret
- Create Target Groups and NLB
- Setup DNS
- NAT Gateway for Outbound Internet
- Security Groups
- Finding CoreOS AMI
- Prepare install-config.yaml
- Prepare Deployment
- Launch EC2 Instances
- Register Instances to Target Groups
- Monitor Installation Progress
- Post Bootstrap Cleanup
- Troubleshooting
- TODO
Architecture Overview
┌─────────────────────────────────────────┐
│ AWS VPC (10.0.0.0/16) │
│ │
Internet ─────── │ ┌─────────┐ ┌──────────────────────┐ │
│ │ IGW │ │ Public Subnets │ │
│ └────┬────┘ │ - Bastion │ │
│ │ │ - Bootstrap (+EIP) │ │
│ ┌────▼────┐ └──────────────────────┘ │
│ │ Ext NLB │ │
│ │ api.* │ ┌──────────────────────┐ │
│ └────┬────┘ │ Private Subnets │ │
│ │ │ - Masters (x3) │ │
│ ┌────▼────┐ │ - Workers (x3) │ │
│ │ Int NLB │ └──────────────────────┘ │
│ │api-int.*│ │ │
│ └─────────┘ ┌─────────▼────────┐ │
│ │ NAT Gateway │ │
│ └──────────────────┘ │
└─────────────────────────────────────────┘
Key Design Decisions:
- Bootstrap and Bastion in public subnets — require direct internet access (EIP needed since
MapPublicIpOnLaunch=false) - Masters and Workers in private subnets — outbound via NAT Gateway, no public IPs needed
- Two NLBs — external (internet-facing) for
api.*, internal forapi-int.* api-intmust resolve to private IPs (internal NLB) — masters can’t reach public IPs without EIP- Route53 private hosted zone for internal DNS (
api-int) — not Cloudflare
Prerequisites and AWS Infrastructure
The following AWS resources should be created before starting (ideally via Terraform):
| Resource | Details |
|---|---|
| VPC | 10.0.0.0/16 |
| Public Subnets | 2x (one per AZ) — for bastion, bootstrap, external LBs |
| Private Subnets | 2x (one per AZ) — for masters, workers |
| Internet Gateway | Attached to VPC |
| NAT Gateway | In public subnet with EIP — for private subnet outbound internet |
| Route Tables | Public subnets → IGW, Private subnets → NAT GW |
| Security Groups | See Security Groups section |
Important: Ensure all subnet route tables are correctly configured before launching instances. Private subnets MUST have a NAT Gateway route for
0.0.0.0/0, otherwise masters/workers cannot pull images from quay.io during installation.
Set up bastion
$ ssh-keygen -t ed25519 -N '' -f ~/.ssh/id_rsa
# Set the version environment variable
export VERSION=4.20.0
# Download and extract the OpenShift CLI (oc)
curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/$VERSION/openshift-client-linux.tar.gz | tar zxvf - oc
# Move the oc binary to a directory on your PATH
sudo mv oc /usr/local/bin/
# Download and extract the OpenShift Installer
curl -s https://mirror.openshift.com/pub/openshift-v4/clients/ocp/$VERSION/openshift-install-linux.tar.gz | tar zxvf - openshift-install
# Move the installer binary to your PATH
sudo mv openshift-install /usr/local/bin/
# Verify that the tools are installed correctly
oc version
openshift-install version
Install aws CLI: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
Install dig for DNS troubleshooting:
sudo dnf install -y bind-utils
Download your Pull Secret
- Log in to the Red Hat OpenShift Cluster Manager: https://console.redhat.com/openshift/install/pull-secret
- Download or copy your pull secret to the bastion host (e.g., save it as
pull-secret.txt)
Create Target Groups and NLB
Note: Protocol must be strictly TCP (Layer 4) for all OCP NLBs — no SSL termination.
External API NLB (internet-facing)
Create target groups first (leave targets empty — register instances later):
| Target Group | Protocol | Port | Health Check |
|---|---|---|---|
ocp-api | TCP | 6443 | HTTPS /readyz |
ocp-api-int | TCP | 22623 | HTTPS /readyz |
# Create external API NLB in PUBLIC subnets
aws elbv2 create-load-balancer \
--name ocp-api \
--type network \
--scheme internet-facing \
--subnets <public-subnet-2a> <public-subnet-2b>
# Create target group for 6443
aws elbv2 create-target-group \
--name ocp-api \
--protocol TCP \
--port 6443 \
--vpc-id <vpc-id> \
--target-type instance \
--health-check-protocol HTTPS \
--health-check-path /readyz \
--healthy-threshold-count 2 \
--health-check-interval-seconds 10
# Create target group for 22623
aws elbv2 create-target-group \
--name ocp-api-int \
--protocol TCP \
--port 22623 \
--vpc-id <vpc-id> \
--target-type instance \
--health-check-protocol HTTPS \
--health-check-path /readyz \
--healthy-threshold-count 2 \
--health-check-interval-seconds 10
# Create listeners
aws elbv2 create-listener \
--load-balancer-arn <external-nlb-arn> \
--protocol TCP --port 6443 \
--default-actions Type=forward,TargetGroupArn=<ocp-api-tg-arn>
aws elbv2 create-listener \
--load-balancer-arn <external-nlb-arn> \
--protocol TCP --port 22623 \
--default-actions Type=forward,TargetGroupArn=<ocp-api-int-tg-arn>
Internal API NLB (internal)
Critical: Masters and workers resolve
api-intto this NLB’s private IP. This is how they reach the Machine Config Server (MCS) during bootstrap without needing a public IP.
# Create internal NLB in PRIVATE subnets
aws elbv2 create-load-balancer \
--name ocp-api-internal \
--type network \
--scheme internal \
--subnets <private-subnet-2a> <private-subnet-2b>
# Create target groups for internal NLB
aws elbv2 create-target-group \
--name ocp-api-int-6443 \
--protocol TCP \
--port 6443 \
--vpc-id <vpc-id> \
--target-type instance \
--health-check-protocol HTTPS \
--health-check-path /readyz \
--healthy-threshold-count 2 \
--health-check-interval-seconds 10
aws elbv2 create-target-group \
--name ocp-api-int-22623 \
--protocol TCP \
--port 22623 \
--vpc-id <vpc-id> \
--target-type instance \
--health-check-protocol HTTPS \
--health-check-path /readyz \
--healthy-threshold-count 2 \
--health-check-interval-seconds 10
# Create listeners
aws elbv2 create-listener \
--load-balancer-arn <internal-nlb-arn> \
--protocol TCP --port 6443 \
--default-actions Type=forward,TargetGroupArn=<ocp-api-int-6443-tg-arn>
aws elbv2 create-listener \
--load-balancer-arn <internal-nlb-arn> \
--protocol TCP --port 22623 \
--default-actions Type=forward,TargetGroupArn=<ocp-api-int-22623-tg-arn>
Application Ingress NLB
# Create ingress NLB in PUBLIC subnets
aws elbv2 create-load-balancer \
--name ocp-app-ingress \
--type network \
--scheme internet-facing \
--subnets <public-subnet-2a> <public-subnet-2b>
# Target groups
aws elbv2 create-target-group \
--name ocp-app-ingress \
--protocol TCP --port 443 \
--vpc-id <vpc-id> \
--target-type instance \
--health-check-protocol HTTP \
--health-check-port 1936 \
--health-check-path /healthz/ready
aws elbv2 create-target-group \
--name ocp-app-ingress-http \
--protocol TCP --port 80 \
--vpc-id <vpc-id> \
--target-type instance \
--health-check-protocol HTTP \
--health-check-port 1936 \
--health-check-path /healthz/ready
Setup DNS
Cloudflare (Public DNS)
Create the following CNAME records under gineesh.com:
| Name | Target | Purpose |
|---|---|---|
api.ocp420 | External NLB DNS name | Kubernetes API for external clients |
*.apps.ocp420 | Ingress NLB DNS name | Wildcard routes (console, apps) |
Do NOT put
api-intin Cloudflare. Use Route53 private hosted zone instead (see below).api-intis for internal cluster communication only and must resolve to private IPs.
Route53 Private Hosted Zone (Internal DNS)
This is required so that masters and workers can resolve api-int to the private IP of the internal NLB — they cannot reach public IPs without an EIP.
# Create private hosted zone attached to your VPC
aws route53 create-hosted-zone \
--name ocp420.gineesh.com \
--caller-reference $(date +%s) \
--hosted-zone-config PrivateZone=true \
--vpc VPCRegion=ap-southeast-2,VPCId=<vpc-id>
# Note the HostedZoneId from output e.g. /hostedzone/XXXXXXXXXXXXX
# Add DNS records pointing to INTERNAL NLB
aws route53 change-resource-record-sets \
--hosted-zone-id <private-zone-id> \
--change-batch '{
"Changes": [
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api-int.ocp420.gineesh.com",
"Type": "CNAME",
"TTL": 300,
"ResourceRecords": [{"Value": "<internal-nlb-dns-name>"}]
}
},
{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.ocp420.gineesh.com",
"Type": "CNAME",
"TTL": 300,
"ResourceRecords": [{"Value": "<internal-nlb-dns-name>"}]
}
}
]
}'
Verify DNS resolves to private IPs from within the VPC:
dig api-int.ocp420.gineesh.com
# Should return private IPs like 10.0.x.x — NOT public IPs
NAT Gateway for Outbound Internet
All nodes need outbound internet access to pull images from quay.io and registry.redhat.io during installation.
- Bootstrap (public subnet) → needs an Elastic IP (EIP) directly on the instance
- Masters/Workers (private subnets) → use NAT Gateway (no EIP needed per instance)
# Allocate EIP for NAT Gateway
aws ec2 allocate-address --domain vpc
# Note AllocationId
# Create NAT Gateway in a PUBLIC subnet
aws ec2 create-nat-gateway \
--subnet-id <public-subnet-id> \
--allocation-id <eip-alloc-id>
# Wait until available
aws ec2 wait nat-gateway-available \
--filter "Name=state,Values=available"
# Add NAT route to EACH private subnet route table
aws ec2 create-route \
--route-table-id <private-rtb-2a> \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id <nat-gw-id>
aws ec2 create-route \
--route-table-id <private-rtb-2b> \
--destination-cidr-block 0.0.0.0/0 \
--nat-gateway-id <nat-gw-id>
Important: Public subnets route via Internet Gateway (
0.0.0.0/0 → igw-xxx). Bootstrap is in a public subnet and requires its own EIP becauseMapPublicIpOnLaunch=false— the IGW requires a public IP to SNAT outbound traffic.
# Allocate and associate EIP for bootstrap instance
aws ec2 allocate-address --domain vpc
aws ec2 associate-address \
--instance-id <bootstrap-instance-id> \
--allocation-id <eip-alloc-id>
Security Groups
ocp-sg (Master, Bootstrap, Worker nodes)
Inbound rules:
| Protocol | Port | Source | Purpose |
|---|---|---|---|
| TCP | 6443 | 0.0.0.0/0 | Kubernetes API |
| TCP | 22623 | 0.0.0.0/0 | Machine Config Server |
| TCP | 443 | 0.0.0.0/0 | HTTPS |
| TCP | 80 | 0.0.0.0/0 | HTTP |
| TCP | 22 | VPC CIDR | SSH (from bastion only) |
| TCP | 19531 | VPC CIDR | Bootstrap journal (bootstrap only) |
Outbound rules:
| Protocol | Port | Destination | Purpose |
|---|---|---|---|
| All | All | 0.0.0.0/0 | Allow all outbound |
Important: Inbound rules for ports 6443 and 22623 are required for NLB health checks to work. Without these, the NLB will report all targets as unhealthy even if the service is running.
Finding CoreOS AMI
Always use the openshift-install binary to find the correct AMI — it embeds the exact RHCOS version for your OCP release. This works offline (no internet required).
$ openshift-install coreos print-stream-json | \
python3 -c "
import json, sys
data = json.load(sys.stdin)
amis = data['architectures']['x86_64']['images']['aws']['regions']
print('ap-southeast-2 AMI:', amis['ap-southeast-2']['image'])
"
ap-southeast-2 AMI: ami-007439e088223214a
Note: OCP 4.20 uses a new AMI naming convention:
rhcos-9.6.YYYYMMDD-N-x86_64instead of the oldRHEL-9.4-RHCOS-4.xx_HVM_GA-...format. The owner ID also changed to531415883065. Always use the installer binary output rather than searching manually.
To select the AMI in the AWS Console:
- Click “Browse more AMIs”
- Select “Community AMIs” tab
- Paste the AMI ID in the search box
- Ensure the console region is set to your target region
Prepare install-config.yaml
mkdir $HOME/clusterconfig
Create $HOME/clusterconfig/install-config.yaml:
apiVersion: v1
baseDomain: gineesh.com
metadata:
name: ocp420
compute:
- hyperthreading: Enabled
name: worker
replicas: 0 # Must be 0 for UPI — do not change
controlPlane:
hyperthreading: Enabled
name: master
replicas: 3
networking:
clusterNetwork:
- cidr: 10.128.0.0/14 # Pod IPs — internal only, must not overlap VPC CIDR
hostPrefix: 23
machineNetwork:
- cidr: 10.0.0.0/16 # Must match your AWS VPC CIDR
networkType: OVNKubernetes
serviceNetwork:
- 172.30.0.0/16
platform:
none: {}
fips: false
pullSecret: '{"auths": ...}' # Paste your pull secret JSON here
sshKey: 'ssh-ed25519 AAAA...' # Your bastion public key
Key values to update:
machineNetwork.cidr— must match your AWS VPC CIDRpullSecret— from console.redhat.comsshKey— your SSH public key (from~/.ssh/id_rsa.pub)
Prepare Deployment
Generate the manifests
Ensure
compute.replicasis set to0before running this.
$ openshift-install create manifests --dir $HOME/clusterconfig
INFO Consuming Install Config from target directory
WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings
INFO Manifests created in: /home/ec2-user/clusterconfig/manifests and /home/ec2-user/clusterconfig/openshift
Generate the Ignition configuration files
$ openshift-install create ignition-configs --dir $HOME/clusterconfig
This generates:
clusterconfig/
├── auth/
│ ├── kubeadmin-password
│ └── kubeconfig
├── bootstrap.ign ← large file, must be hosted externally
├── master.ign ← pointer to MCS, small enough for user-data
├── worker.ign ← pointer to MCS, small enough for user-data
└── metadata.json
Important: Ignition configs contain certificates valid for 24 hours. If installation takes longer than 24 hours, regenerate ignition configs from scratch.
Host bootstrap.ign file
bootstrap.ign is too large for EC2 User Data (often >1MB vs 16KB limit). Host it on the bastion using Python’s built-in HTTP server.
mkdir -p ~/ignition-files
cp $HOME/clusterconfig/bootstrap.ign ~/ignition-files/
cd ~/ignition-files
nohup python3 -m http.server 8080 &
# Verify it's serving
curl http://localhost:8080/bootstrap.ign | head -c 100
Open firewall on bastion if needed:
sudo firewall-cmd --add-port=8080/tcp --permanent
sudo firewall-cmd --reload
Ensure port 8080 is allowed in the security group from the VPC CIDR (10.0.0.0/16). Never expose to 0.0.0.0/0 — bootstrap.ign contains sensitive cluster material.
Cleanup after bootstrap completes:
kill $(lsof -t -i:8080)
rm -rf ~/ignition-files/
Encode the Master and Worker Ignition files
base64 -w0 $HOME/clusterconfig/master.ign > $HOME/clusterconfig/master.64
base64 -w0 $HOME/clusterconfig/worker.ign > $HOME/clusterconfig/worker.64
Launch EC2 Instances
Bootstrap Node
Launch in a public subnet. In the User Data field (under Advanced Details), paste this JSON pointer (update the IP to your bastion’s private IP):
{
"ignition": {
"config": {
"replace": {
"source": "http://10.0.27.231:8080/bootstrap.ign"
}
},
"version": "3.2.0"
}
}
Recommended instance type: m5.xlarge or larger
Root volume: 120 GB gp3
Immediately after launch — assign an EIP:
aws ec2 allocate-address --domain vpc
aws ec2 associate-address \
--instance-id <bootstrap-instance-id> \
--allocation-id <eip-alloc-id>
Bootstrap needs a public IP because it’s in a public subnet — the IGW requires a public IP to route outbound traffic. Without this,
node-image-pull.servicewill fail trying to reachquay.io.
Master Nodes
Launch 3 master nodes in private subnets (distribute across AZs).
In the User Data field, paste the entire base64-encoded contents of master.64:
cat $HOME/clusterconfig/master.64
Recommended instance type: m5.xlarge or larger
Root volume: 120 GB gp3
Masters must be in private subnets with NAT Gateway access. They resolve
api-intto the internal NLB private IP to fetch config from the Machine Config Server (MCS).
Worker Nodes
Launch 3 worker nodes in private subnets.
In the User Data field, paste the entire base64-encoded contents of worker.64:
cat $HOME/clusterconfig/worker.64
Recommended instance type: m5.xlarge or larger
Root volume: 120 GB gp3
Register Instances to Target Groups
Register instances before or immediately after launching — masters need to reach MCS via the NLB during boot.
BOOTSTRAP_ID=<bootstrap-instance-id>
MASTER1_ID=<master1-instance-id>
MASTER2_ID=<master2-instance-id>
MASTER3_ID=<master3-instance-id>
WORKER1_ID=<worker1-instance-id>
WORKER2_ID=<worker2-instance-id>
WORKER3_ID=<worker3-instance-id>
# External NLB — bootstrap + masters on 6443 and 22623
for TG_ARN in <ext-6443-tg-arn> <ext-22623-tg-arn>; do
aws elbv2 register-targets \
--target-group-arn $TG_ARN \
--targets \
Id=$BOOTSTRAP_ID \
Id=$MASTER1_ID \
Id=$MASTER2_ID \
Id=$MASTER3_ID
done
# Internal NLB — bootstrap + masters on 6443 and 22623
for TG_ARN in <int-6443-tg-arn> <int-22623-tg-arn>; do
aws elbv2 register-targets \
--target-group-arn $TG_ARN \
--targets \
Id=$BOOTSTRAP_ID \
Id=$MASTER1_ID \
Id=$MASTER2_ID \
Id=$MASTER3_ID
done
# Ingress NLB — workers on 443 and 80
for TG_ARN in <ingress-443-tg-arn> <ingress-80-tg-arn>; do
aws elbv2 register-targets \
--target-group-arn $TG_ARN \
--targets \
Id=$WORKER1_ID \
Id=$WORKER2_ID \
Id=$WORKER3_ID
done
Monitor Installation Progress
Watch bootstrap complete
openshift-install --dir $HOME/clusterconfig \
wait-for bootstrap-complete \
--log-level=info
Expected success output:
INFO API v1.33.x up
INFO Waiting up to 30m0s for bootstrapping to complete...
INFO It is now safe to remove the bootstrap resources
Approve CSRs (run in a separate terminal)
Masters need two rounds of CSR approval — run this loop throughout the install:
export KUBECONFIG=$HOME/clusterconfig/auth/kubeconfig
while true; do
oc get csr -o go-template='{{range .items}}{{if not .status.certificate}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' \
| xargs --no-run-if-empty oc adm certificate approve
echo "$(date) - CSR check done"
sleep 15
done
Watch nodes join
watch -n 15 'oc get nodes'
Watch install complete
openshift-install --dir $HOME/clusterconfig \
wait-for install-complete \
--log-level=info
Post Bootstrap Cleanup
Once bootstrap is complete:
# 1. Remove bootstrap from all target groups
for TG_ARN in <ext-6443-tg-arn> <ext-22623-tg-arn> <int-6443-tg-arn> <int-22623-tg-arn>; do
aws elbv2 deregister-targets \
--target-group-arn $TG_ARN \
--targets Id=<bootstrap-instance-id>
done
# 2. Release bootstrap EIP
aws ec2 disassociate-address --association-id <assoc-id>
aws ec2 release-address --allocation-id <eip-alloc-id>
# 3. Terminate bootstrap instance
aws ec2 terminate-instances --instance-ids <bootstrap-instance-id>
# 4. Stop bastion HTTP server
kill $(lsof -t -i:8080)
rm -rf ~/ignition-files/
# 5. Revoke port 8080 security group rule if added
aws ec2 revoke-security-group-ingress \
--group-id <sg-id> \
--protocol tcp \
--port 8080 \
--cidr 10.0.0.0/16
Troubleshooting
Check Bootstrap Health
ssh -i <key.pem> core@<bootstrap-private-ip>
# Check ignition completed successfully
sudo journalctl -b -u ignition* --no-pager | tail -20
# Check node-image-pull (new in OCP 4.20 — pulls RHCOS node image before bootkube)
sudo systemctl status node-image-pull.service --no-pager
sudo journalctl -b -u node-image-pull.service --no-pager | tail -30
# Check bootkube (bootstraps etcd + kube-apiserver)
sudo journalctl -b -u bootkube.service -f
# Check kubelet
sudo systemctl status kubelet
# Check all running containers
sudo crictl ps
# Check if API port is listening
ss -tlnp | grep 6443
# Test API directly (bypass LB)
curl -k https://localhost:6443/version
curl -k https://localhost:6443/readyz
Ensure Bootstrap is Completed
sudo systemctl status bootkube.service --no-pager
sudo crictl ps | grep -E "etcd|apiserver|controller|scheduler"
curl -k https://localhost:6443/version
sudo systemctl --failed
bootkubeshowinginactive (dead)with no failed units = success. It’s a oneshot service that exits cleanly when done.
Check Basic OC Resources
export KUBECONFIG=$HOME/clusterconfig/auth/kubeconfig
oc get nodes
oc get csr
oc get clusteroperators
node-image-pull Failures (OCP 4.20+)
If node-image-pull.service fails with i/o timeout connecting to quay.io:
# Bootstrap has no internet access — check:
# 1. Does bootstrap have a public IP?
curl http://169.254.169.254/latest/meta-data/public-ipv4
# 2. Can it reach quay.io?
curl -v https://quay.io
# 3. If no public IP — assign an EIP from bastion:
aws ec2 allocate-address --domain vpc --region ap-southeast-2
aws ec2 associate-address \
--instance-id <bootstrap-id> \
--allocation-id <eip-alloc-id>
If node-image-pull.service fails with ref coreos/node-image already exists (from a previous failed attempt):
sudo ostree refs --repo /ostree/repo --delete coreos/node-image
sudo rm -rf /ostree/repo/tmp/node-image
sudo systemctl restart node-image-pull.service
sudo journalctl -b -u node-image-pull.service -f
Masters Stuck in Ignition Fetch
If masters show A start job is running for Ignition (fetch) for more than 5 minutes:
# Check if MCS is reachable from bastion
curl -k https://api-int.ocp420.gineesh.com:22623/config/master | head -c 200
# Check what api-int resolves to (must be private IP)
dig api-int.ocp420.gineesh.com
# Should return 10.x.x.x — if returning public IP, fix Route53 private zone
# Check NLB target health
aws elbv2 describe-target-health \
--target-group-arn <int-22623-tg-arn> \
--query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State]' \
--output table
# Check master console output
aws ec2 get-console-output \
--instance-id <master-instance-id> \
--output text | tail -30
Re-apply Master User-Data if Ignition Config is Invalid
If console output shows error: invalid character or config is not valid:
# Stop masters
aws ec2 stop-instances \
--instance-ids <master1-id> <master2-id> <master3-id>
aws ec2 wait instance-stopped \
--instance-ids <master1-id> <master2-id> <master3-id>
# Re-encode and apply correct ignition
MASTER_IGN=$(base64 -w0 $HOME/clusterconfig/master.ign)
for INSTANCE in <master1-id> <master2-id> <master3-id>; do
aws ec2 modify-instance-attribute \
--instance-id $INSTANCE \
--attribute userData \
--value "$MASTER_IGN"
echo "Updated $INSTANCE"
done
# Start masters
aws ec2 start-instances \
--instance-ids <master1-id> <master2-id> <master3-id>
Check NLB Target Health
# Check all target groups
aws elbv2 describe-target-groups \
--query 'TargetGroups[*].[TargetGroupName,TargetGroupArn,Port]' \
--output table
# Check health of specific target group
aws elbv2 describe-target-health \
--target-group-arn <tg-arn> \
--query 'TargetHealthDescriptions[*].[Target.Id,Target.Port,TargetHealth.State,TargetHealth.Reason]' \
--output table
Common reasons for unhealthy targets:
- Security group missing inbound rule for that port
- Service not yet running on the instance (still booting)
- Wrong health check path or protocol
Check LB Health Check Config
aws elbv2 describe-target-groups \
--target-group-arns <tg-arn> \
--query 'TargetGroups[*].[TargetGroupName,HealthCheckProtocol,HealthCheckPort,HealthCheckPath,HealthyThresholdCount,HealthCheckIntervalSeconds]' \
--output table
Expected for port 6443: HTTPS / traffic-port / /readyz
Expected for port 22623: HTTPS / traffic-port / /readyz
TODO
- Add Terraform code to provision all AWS infrastructure (VPC, subnets, NLBs, security groups, NAT Gateway, Route53)
- Add AAP Hub S3 RWX storage configuration
- Add worker node CSR approval automation via MachineApprover
- Add cluster operator verification steps post-install