V2 AWS CloudFormation template problem

I mentioned this under Adding XTDB 2 to an existing clustered app deployment - #10 by seancorfield but didn’t get a response so I’m making a new thread:

I worked through Getting started with XTDB on AWS | XTDB and ran into a problem:

I’ve built the cluster completely twice now and both times have gone fine until the last step: setting up the ECS piece mostly works but the ECSService never seems to complete (the other 10 resources get created just fine).

MSK took about 30 minutes, as expected in the notes.

The ECSService never completes. It says CREATE_IN_PROGRESS for about two hours and then rolls back the entire ecs stack. All the other pieces worked fine.

This SO post lists a whole bunch of ways this step can fail but I don’t know enough about AWS and templates to be able to figure out what to look for: Cloudformation template for creating ECS service stuck in CREATE_IN_PROGRESS - Stack Overflow

Hey Sean, apologies for the lack of response, I missed this one :man_facepalming: will get Dan to investigate when he’s back on Monday.

Created #3082 for this

Hey Sean - hope you’re well! Aware we have quite a time difference, so I’ve erred on the side of being verbose here :slightly_smiling_face:

Been investigating this today - started out by creating the full set of AWS stacks from scratch. All seemed to work fine for me - including setting up the xtdb-ecs stack.

A few questions/steps to check things from my end (mostly assuming usage of the AWS dashboard for the below):

  • Based on the name of the original discuss thread - “Adding XTDB 2 to an existing clustered app deployment” - curious if you set up all of the stacks in that guide, or reused components from other ones? Ie - is it a new VPC setup via the xtdb-vpc stack, or an existing VPC with existing public/private subnets and security groups?

  • If you look within EC2Instances - do you see any running instances that would have been created by our ECS stack/launch configuration?

    • These should be recently launched and on the whichever VPC security group you specified.
    • There should be 1, by default (as default value is to have DesiredCapacity of 1)
    • They should be in the Running instance state.
    • If there are any issues setting these up (ie, these are not in the Running state) - should be able to get some logs from the startup by doing the following:
      • Clicking on the instance and going to the instance summary page.
      • Clicking on ActionsMonitor and troubleshootGet system log.
  • If there is an EC2 instance present and Running, worth taking a look at the ECS cluster itself:

    • Under Elastic Container ServiceClusters<name of cluster> (defaults to xtdb-cluster if left unchanged)
    • Should see a cluster overview at the top - we expect to see Registered container instances to be equal to 1 (if using DesiredCapacity of 1)
    • If this is not the case - there may be some issue with EC2 registering the running instance to the ECS cluster.
  • If there is a Registered container instances count of 1 - we should look at the service definition within the current cluster:

    • Under Tasks - we should see one running task with the TaskDefinition created by xtdb-ecs.
    • Under Logs - we should see some logs from the starting up of the node - something like the following set of logs:
      • Starting XTDB 2.x (pre-alpha) @ “dev-SNAPSHOT” @ commit-sha
      • Creating AWS resources for watching files on bucket <bucket-name>
      • Creating SQS queue xtdb-object-store-notifs-queue-<uuid>
      • Adding relevant permissions to allow SNS topic with ARN <SNS topic ARN>
      • Subscribing SQS queue xtdb-object-store-notifs-queue-<uuid> to SNS topic with ARN <SNS topic ARN>
      • Initializing filename list from bucket <bucket-name>
      • Watching for filechanges from bucket <bucket-name>
      • HTTP server started on port: 3000
      • Node started
    • If, in the process of creating the ECS service, you see the above logs being outputted multiple times and/or the task is getting restarted a number of times - it may be the case that the service is failing healthchecks and restarting the task.
  • If all of the above is working - should be able to send a request to the node using the LoadBalancerUrl outputted from the xtdb-alb stack:

    • curl -v <LoadBalancerUrl>/status
    • Should receive a status code 200 message with {"latest-completed-tx":null,"latest-submitted-tx":null} returned - if it is anything else, curious to see what response you get?

I have a few suspicions/further questions on the above, based on what you see:

  • If the EC2 instance hasn’t setup correctly/isn’t Running:

    • Curious to know which AWS region you are setting up the template on?
    • There may be an issue with how we select the ImageId for the instance to use based on the region.
  • If the EC2 instance has been setup correctly and is Running:

    • If it has not been registered as a container instance - curious to see if theres anything in the System Log that may suggest why.
    • If it has been registered as a container instance and the task is in a state of constantly restarting / you do not get back anything from curl :
      • I suspect that the AWS resources (in particular, the Application Load Balancer) may not have internal access to the container/node - this would cause healthchecks to fail and the service to never be in the ‘ready’ state.
      • If you were using an existing VPC (ie, not setting up a new one using xtdb-vpc) it might be the case that the security group you are using does not allow the necessary ingress/permissions - worth a look at how it is setup within xtdb-vpc for reference.

Thank you for the extensive reply!

It is a completely new setup, exactly following the entire XTDB/AWS guide.

Yes, an i3.large instance is running and the status checks passed. It “inferred” Amazon Linux and used this AMI: amazon/al2023-ami-ecs-hvm-2023.0.20231219-kernel-6.1-x86_64

This seems to be where it breaks down – no registered container instances:

service xt-trial-ecs-ECSService-… was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster. For more information, see the Troubleshooting section of the Amazon ECS Developer Guide.

us-east-1 (the default selected for our account).

This is the tail end of the system log for the instance:

[   13.140443] cloud-init[2671]: /var/lib/cloud/instance/scripts/part-001: line 18: /opt/aws/bin/cfn-signal: No such file or directory
[   13.149319] cloud-init[2671]: 2024-01-08 20:06:24,107 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[   13.160204] cloud-init[2671]: 2024-01-08 20:06:24,108 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.9/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
ci-info: no authorized SSH keys fingerprints found for user ec2-user.
<14>Jan  8 20:06:24 cloud-init: #############################################################
<14>Jan  8 20:06:24 cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
<14>Jan  8 20:06:24 cloud-init: 256 ...ec2.internal (ECDSA)
<14>Jan  8 20:06:24 cloud-init: 256 ...ec2.internal (ED25519)
<14>Jan  8 20:06:24 cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
<14>Jan  8 20:06:24 cloud-init: #############################################################
ecdsa-sha2-nistp256 ...ec2.internal
ssh-ed25519 ...ec2.internal
[   13.389839] cloud-init[2671]: Cloud-init v. 22.2.2 finished at Mon, 08 Jan 2024 20:06:24 +0000. Datasource DataSourceEc2.  Up 13.38 seconds
[e[0;1;31mFAILEDe[0m] Failed to start e[0;1;39mcloud-fina… Execute cloud user/final scripts.

Amazon Linux 2023
Kernel 6.1.66-91.160.amzn2023.x86_64 on an x86_64 (-)

ip-10-192-20-71 login: 

Hey Sean, thank you for your response!

Following on from yesterday - I’ve attempted to setup the XTDB stacks on us-east-1 on two separate AWS accounts and have not currently ran into the same problem. Looking at the setup ec2 instance - I did see the following:

  • It similarly grabbed and used the same AMI - amazon/al2023-ami-ecs-hvm-2023.0.20231219-kernel-6.1-x86_64
  • It did also output /opt/aws/bin/cfn-signal: No such file or directory - this shouldn’t have much to do with registering the container instances - as it is intended to mark the ECSAutoScalingGroup as ready - but I did push a fix to it onto 2.x.
  • The LaunchConfig for the instance references the ECSCluster within it’s userdata (ie, this is used to mark which ECS cluster it is registered with) - I’ve pushed a commit to 2.x to ensure there is an explicit DependsOn here, in case there was some kind of issue with dependencies causing this to not be properly added in the UserData.

The point at which the EC2 instance is registered as a container instance on the cluster can be seen within the UserData script ran on EC2 initialization - in particular:

  • echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config

Within the EC2 instance system logs - I do see the echo call being outputted, (slightly earlier within the log):

  • [ 12.825388] cloud-init[2672]: + echo ECS_CLUSTER=xtdb_cluster_new

Some more thoughts:

  • Curious to know if see something similar in the EC2 instance’s system log, with the name of your ECS cluster? If not - there may be some kind of an issue there.

  • There is the following knowledge center question on container instances failing to join ECS clusters: Troubleshoot why your ECS or EC2 instance can't join the cluster | AWS re:Post. A couple of points of interest:

    • There are a number of references to VPCs, subnets, and security groups here - could be relevant to the issue.
    • “The EC2 instance doesn’t have the required AWS Identity and Access Management (IAM) permissions. Or, the ecs:RegisterContainerInstance API call is denied.”:
      • We use an AWS managed policy for the role provided to ec2 instances, 'arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role' , so I believe this should not be an issue, but might be worth checking at the IAM Role that is specified on the instance that this assumption holds true.
    • There is a “Resolution” section that steps through executing a task with systems manager to check through the list of potential issues they specify in the short description section. Usage wise, seems like it should be the case that you can enter the ec2 instance-id and the name of the ecs cluster it failed to register with - could provide some useful information?

Hopefully we can find some useful information from some of the steps above :slightly_smiling_face: Appreciate there’s a number of potential things that could be causing a problem here, so if you would like to pair on it at some point let me know. (My timezone is GMT)

Yup, I see this now with the updated template. Thank you. And no errors in the system log for the instance.

Seems like the problem – see below.

Confirmed this is in place as expected.

I went through the Resolutions and the troubleshooter automation said the following – not sure what to look at from this point:

It seems like the container instance doesn’t have communication with ECS service endpoint. Container instances need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your container instances having public IP addresses. For more information about interface VPC endpoints, see Amazon ECS interface VPC endpoints (AWS PrivateLink) - Amazon Elastic Container Service

If you do not have an interface VPC endpoint configured and your container instances do not have public IP addresses, then they must use network address translation (NAT) to provide this access. For more information, see NAT gateways in the Amazon VPC User Guide (NAT gateways - Amazon Virtual Private Cloud) and HTTP proxy configuration in this guide HTTP proxy configuration for Linux container instances - Amazon Elastic Container Service. For more information, see Set up to use Amazon ECS - Amazon Elastic Container Service

I might tear down the whole lot and go back through the instructions from scratch one more time very carefully to ensure everything is correctly input. I’m pretty sure that I got everything right per the XTDB guide but there are a lot of inputs and outputs to track. Will reply back when I’ve completed that (again).

Well, whatever the issue was, when I tore down the entire set of stacks and rebuilt them all from scratch with the new ecs template, it actually worked this time!

We now have an XTDB “staging” cluster on AWS to play with at work.

Glad to hear that you didn’t run into issues when rebuilding! Let me know if you have any further questions from the AWS side :slight_smile: