One of the many advantages of the GRAX Virtual Appliance is the ability for your own infrastructure team to monitor (and potentially alert on) the GRAX Application components. This document will cover the monitoring that GRAX Support utilizes to best manage a Virtual Appliance and will provide recommendations for your own monitoring.
Our monitoring suggestions in this article are all based on using AWS Cloudwatch, not to be confused with AWS Cloudtrail. Both of these services are typically used for an organisation's overall monitoring and governance posture so understanding the differences is important when designing your own monitoring process. In short, Cloudwatch will be used to monitor the resources and applications (are they running, what is the performance like etc) whereas Cloudtrail is used to enable governance, compliance and operational risk auditing in your organisation. This article doesn't focus on monitoring requirements for governance or compliance auditing, so you can assume we are talking about Cloudwatch features here.
Once you have completed the installation steps described by the Virtual Appliance Setup, you should validate by accessing
https://[[your domain name here]]/health with either your web browser or cURL. A 200 response indicates:
- The Cloudformation stack has reached the CREATE_COMPLETE status
- The EC2 instance has been provisioned and the GRAX Service has been started
- The Load Balancer has registered the EC2 instance
- API requests from the internet are reaching the GRAX Service.
There are 3 main components of the GRAX Service (Compute (EC2), Storage (S3) and Search (AWS Elasticsearch)). The GRAX AWS Deployment page details all of the components that exist and could be potentially monitored. These components are the critical components of your virtual appliance to monitor. We focus on using Cloudwatch as this will allow you to configure Cloudwatch Alarms that are relevant to your own support processes. Any GRAX deployed alarms you might see should not be adjusted.
The first thing you might want to check is the status of the AWS services in your region. AWS Health events can be used to alert for AWS Service outages.
If the service is performing normally, the services broken-out below should be monitored for a view into how the GRAX Application stack is performing.
The GRAX compute instances are deployed as part of an Autoscaling Group so the monitoring of these resources is best accomplished at that level. AWS has guides on Cloudwatch for Autoscaling Groups and Instances.
When monitoring the EC2 instances, the following fairly generic metrics may prove useful:
- CPU idle, IO wait, system usage
- Disk percentage used
- Disk IO read & write metrics
- Memory used percentage
- Network metrics
- Swap metrics
However, the thing you are probably concerned about is the state of the GRAX Application itself. These instance metrics might be considered:
- Status Check Failed (Any) — StatusCheckFailed
- Status Check Failed (Instance) — StatusCheckFailed_Instance
- Status Check Failed (System) — StatusCheckFailed_System
For closer to real-time monitoring you may need to enable detailed metrics.
GRAX does provide a resource at /health that will return 200 - OK when things are functional. If you have deployed your Virtual Appliance with the GRAX provided Web Application Firewall template these rules will block any external requests to this resource.
AWS S3 is an extremely reliable data storage service that is used by GRAX for storage of backed up data. Options for monitoring S3 are many and varied and will vary depending on your reasons for monitoring. Monitoring for accurate functioning of GRAX shouldn't need to extend beyond the service health alerts.
GRAX uses the AWS Elasticsearch managed service to provide search capabilities within your dataset. This service is one of the more dynamic in the GRAX stack and should be considered key to the monitoring approach. AWS documents how to monitor this service here.
Of all the possible AWS Elasticsearch metrics these are the most relevant signals you should consider for your monitoring:
|Elasticsearch Metric||Alarm Value||GRAX Issue|
|ClusterStatus.red||>= 1||Red cluster status indicates that one of primary shards and its replicas are not allocated to a node. AWS Elasticsearch will stop snapshots in this case.|
|ClusterStatus.yellow||>= 1||This status can self-resolve depending on the cause. Low disk space can cause this as well and would require intervention.|
|FreeStorageSpace||<= 20480||This would indicate that a node in the cluster is < 20GB spare disk space. \nGRAX recommends this be at least 25% of each node.|
|ClusterIndexWritesBlocked||>= 1||Indicates the cluster is blocking write requests. This needs resolving and could be caused by a number of different issues (e.g. FreeStorageSpace is too low or JVMMemoryPressure is too high)|
|Nodes||<= x||Where X is the number of nodes in your cluster. GRAX will default to 2 nodes unless otherwise configured at deployment. The Cloudformation Parameter ElasticsearchInstanceCount will specify this.|
|AutomatedSnapshotFailure||>= 1||An automated snapshot failed, indicates no snapshot has been taken for the previous 36 hours. This failure is often the result of a red cluster health status|
|CPUUtilization / WarmCPUUtilization||>= 80%||100% utilization can occur, but sustained high usage can cause problems. These metrics are available across all nodes in the cluster or individually.|
|JVMMemoryPressure / WarmJVMMemoryPressure||>= 80%||The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. Amazon ES uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances.|
|MasterCPUUtilization||>= 50%||Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower CPU usage than data nodes.|
|MasterJVMMemoryPressure||>= 80%||Consider using larger instance types for your dedicated master nodes. Because of their role in cluster stability and blue/green deployments, dedicated master nodes should have lower CPU usage than data nodes.|
|KMSKeyError||>= 1||The KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations.|
|KMSKeyInaccessible||>= 1||The KMS encryption key that is used to encrypt data at rest in your domain has been deleted or has revoked its grants to Amazon ES. You can't recover domains that are in this state, but if you have a manual snapshot, you can use it to migrate to a new domain.|
GRAX uses AWS RDS Postgres as a relational data store for operational data (i.e. customer data is never stored here). The RDS service is fully managed by AWS and GRAX uses this lightly, so problems in this tier causing service outages are unexpected. The AWS documentation provides an excellent overview of monitoring RDS as well as how to monitor with CloudWatch.
Some metrics that you might consider capturing here are:
- CPU and RAM Consumption
- Disk space consumption
- Network traffic
- Database connections
- IOPS metrics
Updated 10 days ago