Hey Sean - hope you’re well! Aware we have quite a time difference, so I’ve erred on the side of being verbose here
Been investigating this today - started out by creating the full set of AWS stacks from scratch. All seemed to work fine for me - including setting up the xtdb-ecs
stack.
A few questions/steps to check things from my end (mostly assuming usage of the AWS dashboard for the below):
-
Based on the name of the original discuss thread - “Adding XTDB 2 to an existing clustered app deployment” - curious if you set up all of the stacks in that guide, or reused components from other ones? Ie - is it a new VPC setup via the
xtdb-vpc
stack, or an existing VPC with existing public/private subnets and security groups? -
If you look within
EC2
→Instances
- do you see any running instances that would have been created by our ECS stack/launch configuration?- These should be recently launched and on the whichever VPC security group you specified.
- There should be 1, by default (as default value is to have
DesiredCapacity
of 1) - They should be in the Running instance state.
- If there are any issues setting these up (ie, these are not in the Running state) - should be able to get some logs from the startup by doing the following:
- Clicking on the instance and going to the instance summary page.
- Clicking on
Actions
→Monitor and troubleshoot
→Get system log
.
-
If there is an EC2 instance present and Running, worth taking a look at the ECS cluster itself:
- Under
Elastic Container Service
→Clusters
→<name of cluster>
(defaults toxtdb-cluster
if left unchanged) - Should see a cluster overview at the top - we expect to see
Registered container instances
to be equal to 1 (if usingDesiredCapacity
of 1) - If this is not the case - there may be some issue with EC2 registering the running instance to the ECS cluster.
- Under
-
If there is a
Registered container instances
count of 1 - we should look at the service definition within the current cluster:- Under
Tasks
- we should see one running task with the TaskDefinition created byxtdb-ecs
. - Under
Logs
- we should see some logs from the starting up of the node - something like the following set of logs:- Starting XTDB 2.x (pre-alpha) @ “dev-SNAPSHOT” @
commit-sha
- Creating AWS resources for watching files on bucket
<bucket-name>
- Creating SQS queue
xtdb-object-store-notifs-queue-<uuid>
- Adding relevant permissions to allow SNS topic with ARN
<SNS topic ARN>
- Subscribing SQS queue
xtdb-object-store-notifs-queue-<uuid>
to SNS topic with ARN<SNS topic ARN>
- Initializing filename list from bucket
<bucket-name>
- Watching for filechanges from bucket
<bucket-name>
- HTTP server started on port: 3000
- Node started
- Starting XTDB 2.x (pre-alpha) @ “dev-SNAPSHOT” @
- If, in the process of creating the ECS service, you see the above logs being outputted multiple times and/or the task is getting restarted a number of times - it may be the case that the service is failing healthchecks and restarting the task.
- Under
-
If all of the above is working - should be able to send a request to the node using the
LoadBalancerUrl
outputted from thextdb-alb
stack:curl -v <LoadBalancerUrl>/status
- Should receive a status code 200 message with
{"latest-completed-tx":null,"latest-submitted-tx":null}
returned - if it is anything else, curious to see what response you get?
I have a few suspicions/further questions on the above, based on what you see:
-
If the
EC2
instance hasn’t setup correctly/isn’t Running:- Curious to know which AWS region you are setting up the template on?
- There may be an issue with how we select the
ImageId
for the instance to use based on the region.
-
If the
EC2
instance has been setup correctly and is Running:- If it has not been registered as a container instance - curious to see if theres anything in the System Log that may suggest why.
- If it has been registered as a container instance and the task is in a state of constantly restarting / you do not get back anything from
curl
:- I suspect that the AWS resources (in particular, the Application Load Balancer) may not have internal access to the container/node - this would cause healthchecks to fail and the service to never be in the ‘ready’ state.
- If you were using an existing VPC (ie, not setting up a new one using
xtdb-vpc
) it might be the case that the security group you are using does not allow the necessary ingress/permissions - worth a look at how it is setup withinxtdb-vpc
for reference.