The GRAX Virtual Appliance is architected to minimze operations and maintence. It does require routine maintenance for security and software updates. It also could require emergency maintenance in the face of unexpected failures, outages and disasters.
This guide documents how to best maintain security, availability and recover from unexpected problems.
The GRAX Virtual Appliance relies solely on an EC2 Instance Role for the GRAX service to access S3 and ElasticSearch. The Instance Role uses STS temporary credentials which rotate every 12 hours by default.
By design there are no static IAM credentials at all, so no need for rotating keys.
Backend service, AMI and infrastructure configuration patches and upgrades are managed through CloudFormation.
Service updates are configured with the
Redeploy parameter. Set
Redeploy to the current unix timestamp to get the latest release.
AMI updates are configured with the
ImageId parameter. By default this is set to query SSM for the latest Amazon Linux 2 AMI, which will trigger an image and instance replacement if available on the the next CloudFormation update.
Infrastructure configuration changes are all managed through CloudFormation. The latest version of the templates are published here.
All SaaS backup data is immediately written to S3, which requires no backup or restore procedures.
RDS Aurora is configured with the default backup retention period of 1 day. Optionally you can configure periodic RDS snapshots. RDS backups and snapshots are restored through the AWS Web Management Console or through CloudFormation with the DBSnapshotIdentifier and DBPassword parameters.
ElasticSearch is configured with default hourly snapshots that are retained for 14 days. Elastic snapshots are restored through the AWS Web Management Console.
Recovery from instance or AZ failure is automated out of the box by the CloudFormation template.
A failed EC2 instance will be replaced by the multi-AZ AutoScaling Group.
A failed RDS or Elastic instance will be replaced by the multi-AZ cluster configuration.
For internal AWS service failures, if it affects the GRAX backend service, the ALB health check will fail, causing the instance to be replaced periodically until the health check passes again.
While failures and outages can impact a single backup, GRAX is configured to take “incremental backups” — backups of all data changes since the last successful backup — on a schedule. Therefore the next hourly or daily backup after recovery will include data changed during the outage window.
Salesforce generally retains data in the “Recycle Bin” for 15 days, meaning a recovery time of hours or even days, should not result in any backup data loss.
Common fault conditions include:
- Hitting EC2 memory or load limits
- Hitting RDS or Elastic load or capacity limits
In both of these cases:
- See Monitoring the Virtual Appliace for best practices for monitoring
- Increase the
ElasticsearchInstanceType, etc parameters to add additional capacity
See our support documentation for more information on how to receive support, support tiers and SLAs.
Updated 3 months ago