Controlling shadow IT, tracking costs, and enforcing compliance in AWS
A large government agency was preparing its Amazon Web Services Cloud Hosting Environment to support 40+ applications expected to be onboarded in a short period of time, anticipating the closure of the Department’s data centers.
AWS offered the ability to reduce shadow IT, track and charge costs more accurately and drive enterprise standards more broadly. However, the existing Application Support Branch was focused on supporting existing data center-based workloads. Managing those environments, combined with a relative dearth of cloud experienced engineers, meant that the customer didn’t have the resources to manage the growing cloud environment in the same way.
At minimum, the customer needed to be able to see all virtual machines in all accounts, as well as the tag data associated, all virtual machines needed to be tagged to a valid cost center, and users needed to be alerted in real time when they created improperly configured resources.
Due to the rapid growth of the cloud hosting environment in its initial phases, there were a large number of unlabeled resources with no owner or clear purpose- shadow IT. These could not always be safely spun down as they could be hosting critical services, however, without accurate tagging, the cost for these resources could not be tracked to a cost center. To stem this problem at its source, a mechanism was needed to prevent creation of shadow IT resources. Furthermore, to increase user understanding, it was important to provide feedback to users to understand how to adhere to requirements in this new environment.
The solution was developed and deployed in three iterations. First, a Lambda Function was developed to eliminate shadow IT. The function would assume a role in each account to collect information about all EC2 instances, including tagging information. This information was aggregated into a report of all accounts’ resources and stored in an S3 bucket. A simple static site was created to pull reports from the bucket for end users to have instant access to up to date information about the agency’s IT infrastructure. The site was designed to be hosted in as S3 bucket, but security considerations led to a pivot and the site was hosted on a t2 micro EC2 instance. This report bucket and front end site also provided flexibility to add additional reports to support other business requirements.
The next step was ensuring that all compute instances were tagged correctly to enable chargeback to the correct cost centers. Each application was assigned a unique billing code, however, development teams did not consistently adhere to enterprise tagging standards. To eliminate this issue, an additional Lambda Function was developed. This function was triggered by EC2 creation and EC2 start events on an Event Bus for all accounts. When an instance was created or started in any of the customer’s AWS accounts, the Lambda function would be triggered- it would ensure the instance had all required tags- including a billing code. If the instance did not have a valid billing code, the Lambda function would stop the instance. Reporting on these actions would also be stored in the reporting S3 bucket.
To provide notifications to users when improperly configured resources were remediated, or need correction, Simple Notification Service (SNS) was proposed. However, business users did not find the formatting of the information emailed to them to be accessible. Simple Email Service (SES) was not FEDRAMP approved, so another solution had to be identified. A Python library, SMTPy, was identified- it would allow a Lambda function to connect to the customer’s SMTP server to send an HTML formatted email. By configuring the Lambda function to connect to the customer’s VPC, the function was able to access the SMTP server without modifying security group or network settings. This solution notified users when resources in AWS were not compliant and provided guidance to remediate the issue.
The solution addressed the customer’s needs- the reporting tools gives stakeholders immediate information about any VM deployed in the environments- eliminating shadow IT infrastructure. The tagging compliance ensures that the information reported is complete and accurate. Additionally, since OCIO is able to more accurately charge individual applications for their cloud spend, it has freed up OCIO funds to invest proactively in high value enterprise solutions. Finally, by providing real-time feedback to users when remediation is needed, rather than on audit, provides an opportunity to train users on Cloud best practices, reducing time to train and deploy a ‘Cloud Competent’ workforce.
By favoring AWS managed services and sticking to an (nearly) serverless architecture, this solution optimized cost for the customer both as a direct cost, and in resource-hours saved from operational maintenance. Managed services are not impacted by customer side outages- increasing reliability and uptime. Managed services also increased the security posture of the solution- with no exposed network endpoints or authentication mechanisms, there were no additional vulnerabilities created. Implementing loosely coupled services with clear design patterns allows the solution to be flexible and efficient- the reporting tool had a number of additional business reports added to address a number of Cloud concerns, including providing visibility into other Cloud Providers.
STS reduced the duplication of work processes and the risk of human error.