V2 AWS CloudFormation template problem

Hey Sean - hope you’re well! Aware we have quite a time difference, so I’ve erred on the side of being verbose here :slightly_smiling_face:

Been investigating this today - started out by creating the full set of AWS stacks from scratch. All seemed to work fine for me - including setting up the xtdb-ecs stack.

A few questions/steps to check things from my end (mostly assuming usage of the AWS dashboard for the below):

  • Based on the name of the original discuss thread - “Adding XTDB 2 to an existing clustered app deployment” - curious if you set up all of the stacks in that guide, or reused components from other ones? Ie - is it a new VPC setup via the xtdb-vpc stack, or an existing VPC with existing public/private subnets and security groups?

  • If you look within EC2Instances - do you see any running instances that would have been created by our ECS stack/launch configuration?

    • These should be recently launched and on the whichever VPC security group you specified.
    • There should be 1, by default (as default value is to have DesiredCapacity of 1)
    • They should be in the Running instance state.
    • If there are any issues setting these up (ie, these are not in the Running state) - should be able to get some logs from the startup by doing the following:
      • Clicking on the instance and going to the instance summary page.
      • Clicking on ActionsMonitor and troubleshootGet system log.
  • If there is an EC2 instance present and Running, worth taking a look at the ECS cluster itself:

    • Under Elastic Container ServiceClusters<name of cluster> (defaults to xtdb-cluster if left unchanged)
    • Should see a cluster overview at the top - we expect to see Registered container instances to be equal to 1 (if using DesiredCapacity of 1)
    • If this is not the case - there may be some issue with EC2 registering the running instance to the ECS cluster.
  • If there is a Registered container instances count of 1 - we should look at the service definition within the current cluster:

    • Under Tasks - we should see one running task with the TaskDefinition created by xtdb-ecs.
    • Under Logs - we should see some logs from the starting up of the node - something like the following set of logs:
      • Starting XTDB 2.x (pre-alpha) @ “dev-SNAPSHOT” @ commit-sha
      • Creating AWS resources for watching files on bucket <bucket-name>
      • Creating SQS queue xtdb-object-store-notifs-queue-<uuid>
      • Adding relevant permissions to allow SNS topic with ARN <SNS topic ARN>
      • Subscribing SQS queue xtdb-object-store-notifs-queue-<uuid> to SNS topic with ARN <SNS topic ARN>
      • Initializing filename list from bucket <bucket-name>
      • Watching for filechanges from bucket <bucket-name>
      • HTTP server started on port: 3000
      • Node started
    • If, in the process of creating the ECS service, you see the above logs being outputted multiple times and/or the task is getting restarted a number of times - it may be the case that the service is failing healthchecks and restarting the task.
  • If all of the above is working - should be able to send a request to the node using the LoadBalancerUrl outputted from the xtdb-alb stack:

    • curl -v <LoadBalancerUrl>/status
    • Should receive a status code 200 message with {"latest-completed-tx":null,"latest-submitted-tx":null} returned - if it is anything else, curious to see what response you get?

I have a few suspicions/further questions on the above, based on what you see:

  • If the EC2 instance hasn’t setup correctly/isn’t Running:

    • Curious to know which AWS region you are setting up the template on?
    • There may be an issue with how we select the ImageId for the instance to use based on the region.
  • If the EC2 instance has been setup correctly and is Running:

    • If it has not been registered as a container instance - curious to see if theres anything in the System Log that may suggest why.
    • If it has been registered as a container instance and the task is in a state of constantly restarting / you do not get back anything from curl :
      • I suspect that the AWS resources (in particular, the Application Load Balancer) may not have internal access to the container/node - this would cause healthchecks to fail and the service to never be in the ‘ready’ state.
      • If you were using an existing VPC (ie, not setting up a new one using xtdb-vpc) it might be the case that the security group you are using does not allow the necessary ingress/permissions - worth a look at how it is setup within xtdb-vpc for reference.