An ECR Deployment Script

Below is a simple script to deploy a Docker image to ECR...

set -e

log () {
  local bold=$(tput bold)
  local normal=$(tput sgr0)
  echo "${bold}${1}${normal}" 1>&2;
}

if [ -z "${AWS_ACCOUNT}" ];
then
  log "Missing a valid AWS_ACCOUNT env variable";
  exit 1;
else
  log "Using AWS_ACCOUNT '${AWS_ACCOUNT}'";
fi

AWS_REGION=${AWS_REGION:-us-east-1}
REPO_NAME=${REPO_NAME:-my/repo}

log "πŸ”‘ Authenticating..."
aws ecr get-login-password \
  --region ${AWS_REGION} \
  | docker login \
    --username AWS \
    --password-stdin \
    ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com

log "πŸ“¦ Building image..."
docker build -t ${REPO_NAME} .

log "🏷️ Tagging image..."
docker tag \
  ${REPO_NAME}:latest \
  ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest

log "πŸš€ Pushing to ECR repo..."
docker push \
  ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${REPO_NAME}:latest

log "πŸ’ƒ Deployment Successful. πŸ•Ί"

Using CloudFront as a Reverse Proxy

Alternate title: How to be master of your domain.

The basic idea of this ticket is to demonstrate how CloudFront can be utilized as a serverless reverse-proxy, allowing you to host all of your application's content and services from a single domain. This minimizes a project's TLD footprint while providing project organization and performance along the way.

Why

Within the large organizations, it can be a pain to get a subdomain for a project. At times, we won’t want to have to deal with the process to spin up service-specific subdomains (e.g. api.my-project.big-institution.gov or thumbnails.my-project.big-institution.gov). As a result, we've settled on adopting a pattern where we use CloudFront to proxy all of our domain's incoming requests to their appropriate service.

How it works

CloudFront has the ability to support multiple origin configurations. We can utilize the Path Pattern setting to direct web requests by URL path to their appropriate service. CloudFront behaves like a typical router libraries, wherein it routes traffic to the first path with a pattern matching the incoming request and routes requests that don't match route patterns to a default route. For example, our current infrastructure looks like this:

my-project.big-institution.gov/
β”œβ”€β”€ api/*         <- Application Load Balancer (ALB) that distributes traffic to order
β”‚                    management API service running on Elastic Container Service (ECS).
β”œβ”€β”€ stac/*        <- ALB that distributes traffic to STAC API service running on ECS.
β”‚
β”œβ”€β”€ storage/*     <- Private S3 bucket storing private data. Only URLs that have been
β”‚                    signed with our CloudFront keypair will be successful.
β”œβ”€β”€ thumbnails/*  <- Public S3 bucket storing thumbnail imagery.
β”‚
└── *             <- Public S3 website bucket storing our single page application frontend.

Single Page Applications

An S3 bucket configured for website hosting acts as the origin for our default route. If an incoming request's path does not match routes specified elsewhere within the CloudFront distribution, it is routed to the single page application. To configure the single page application to handle any requests provided (i.e. not just requests sent to paths of existing files within the bucket, such as index.html or app.js), the bucket should be configured with a custom error page in response to 404 errors, returning the applications HTML entrypoint (index.html).

Requirements

To enable the usage of a custom error page, the S3 bucket's website endpoint (i.e. <bucket-name>.s3-website-<region>.amazonaws.com, not <bucket-name>.s3.<region>.amazonaws.com) must be configured as a custom origin for the distribution. Additionally, the bucket must be configured for public access. More information: Using Amazon S3 Buckets Configured as Website Endpoints for Your Origin. Being that the S3 website endpoint does not support SSL, the custom origin's Protocol Policy should be set to HTTP Only.

My bucket is private. Can CloudFront serve a website from this bucket?

If your bucket is private, the website endpoint will not work (source). You could configure CloudFront to send traffic to the buckets REST API endpoint, however this will prevent you from being able to utilize S3's custom error document feature which is essential for hosting single page applications on S3.

CloudFront itself has support for custom error pages. Why can't I use that to enable hosting private S3 buckets as websites?

While it is true that CloudFront can route error responses to custom pages (e.g. sending all 404 responses the contents of s3://my-website-bucket/index.html), these custom error pages apply to the entirety of your CloudFront distribution. This is likely undesirable for any API services hosted by your CloudFront distribution. For example, if a user accesses a RESTful API at http://my-website.com/api/notes/12345 and the API server responds with a 404 of {"details": "Record not found"}, the response body will be re-written to contain the contents of s3://my-website-bucket/index.html. At time of writing, I am unaware of any capability of applying custom error pages to only certain content-types. A feature such as this might make distribution-wide custom error pages a viable solution.

APIs

APIs are served as custom origins, with their Domain Name settings pointing to their an ALB's DNS name.

Does this work with APIs run with Lambda or EC2?

Assuming that the service has a DNS name, it can be set up as an origin for CloudFront. This means that for an endpoint handled by a Lambda function, you would need to have it served behind an API Gateway or an ALB.

  • Disable caching by setting the default, minimum, and maximum TTL to 0 seconds.
  • Set AllowedMethods to forward all requests (i.e. GET, HEAD, OPTIONS, PUT, PATCH, POST, and DELETE).
  • Set ForwardedValues so that querystring and the following headers are fowarded: referer, authorization, origin, accept, host
  • Origin Protocol Policy of HTTP Only.

Data from S3 Buckets

Data from a standard S3 bucket can be configured by pointing to the bucket's REST endpoint (e.g. <bucket-name>.s3.<region>.amazonaws.com). More information: Using Amazon S3 Buckets for Your Origin.

This can be a public bucket, in which case would benefit from the CDN and caching provided by CloudFront.

When using a private bucket, CloudFront additionally can serve as a "trusted signer" to enable an application with access to the CloudFront security keys to create signed URLs/cookies to grant temporary access to particular private content. In order for CloudFront to access content within a private bucket, its Origin Access Identity must be given read privileges within the bucket's policy. More information: Restricting Access to Amazon S3 Content by Using an Origin Access Identity

Caveats

The most substantial issue with this technique is the fact that CloudFront does not have the capability to remove portions of a path from a request's URL. For example, if an API is configured as an origin at https://d1234abcde.cloudfront.net/api, it should be configured to respond to URLs starting with /api. This is often a non-issue, as many server frameworks have builtin support to support being hosted at a non-root path.

Configuring FastAPI to be served under a non-root path
from fastapi import FastAPI, APIRouter

API_BASE_PATH = '/api'

app = FastAPI(
    title="Example API",
    docs_url=API_BASE_PATH,
    swagger_ui_oauth2_redirect_url=f"{API_BASE_PATH}/oauth2-redirect",
    openapi_url=f"{API_BASE_PATH}/openapi.json",
)
api_router = APIRouter()
app.include_router(router, prefix=API_BASE_PATH)

Furthermore, if you have an S3 bucket serving content from https://d1234abcde.cloudfront.net/bucket, only keys that with a prefix of bucket/ will be available to that origin. In the event that keys are not prefixed with a path matching the origins configured path pattern, there are two options:

  1. Move all of the files, likely utilizing something like S3 Batch (see #253 for more details)
  2. Use a Lambda@Edge function to rewrite the path of any incoming request for a non-cached resource to conform to the key structure of the S3 bucket's objects.

Summary

After learning this technique, it feels kind of obvious. I'm honestly not sure if this is AWS 101 level technique or something that is rarely done; however I never knew of it before this project and therefore felt it was worth sharing.

A quick summary of some of the advantages that come with using CloudFront for all application endpoints:

  • It feels generally tidier to have all your endpoints placed behind a single domain. No more dealing with ugly ALB, API Gateway, or S3 URLs. This additionally pays off when you are dealing with multiple stages (e.g. prod and dev) of the same service 🧹.
  • SSL is managed and terminated at CloudFront. Everything after that is port 80 non-SSL traffic, simplifying the management of certificates πŸ”’.
  • All non-SSL traffic can be set to auto-redirect to SSL endpoints ↩️.
  • Out of the box, AWS Shield Standard is applied to CloudFront to provide protection against DDoS attacks 🏰.
  • Static content is regionally cached and served from Edge Locations closer to the viewer 🌏.
  • Dynamic content is also served from Edge Locations, which connect to the origin server via AWS' global private network. This is faster than connecting to an origin server over the public internet πŸš€.
  • All data is served from the same domain origin. Goodbye CORS errors πŸ‘‹!
  • Data egress costs are lower through CloudFront than other services. This can be ensured by only selecting Price Class 100, other price classes can be chosen if enabling a global CDN is worth the higher egress costs πŸ’΄.

Example

An example of a reverse-proxy CloudFront Distribution written with CDK in Python
from aws_cdk import (
    aws_s3 as s3,
    aws_certificatemanager as certmgr,
    aws_iam as iam,
    aws_cloudfront as cf,
    aws_elasticloadbalancingv2 as elbv2,
    core,
)


class CloudfrontDistribution(core.Construct):
    def __init__(
        self,
        scope: core.Construct,
        id: str,
        api_lb: elbv2.ApplicationLoadBalancer,
        assets_bucket: s3.Bucket,
        website_bucket: s3.Bucket,
        domain_name: str = None,
        using_gcc_acct: bool = False,
        **kwargs,
    ) -> None:
        super().__init__(scope, id, **kwargs)

        oai = cf.OriginAccessIdentity(
            self, "Identity", comment="Allow CloudFront to access S3 Bucket",
        )
        if not using_gcc_acct:
            self.grant_oai_read(oai, assets_bucket)

        certificate = (
            certmgr.Certificate(self, "Certificate", domain_name=domain_name)
            if domain_name
            else None
        )

        self.distribution = cf.CloudFrontWebDistribution(
            self,
            core.Stack.of(self).stack_name,
            alias_configuration=(
                cf.AliasConfiguration(
                    acm_cert_ref=certificate.certificate_arn, names=[domain_name]
                )
                if certificate
                else None
            ),
            comment=core.Stack.of(self).stack_name,
            origin_configs=[
                # Frontend Website
                cf.SourceConfiguration(
                    # NOTE: Can't use S3OriginConfig because we want to treat our
                    # bucket as an S3 Website Endpoint rather than an S3 REST API
                    # Endpoint. This allows us to use a custom error document to
                    # direct all requests to a single HTML document (as required
                    # to host an SPA).
                    custom_origin_source=cf.CustomOriginConfig(
                        domain_name=website_bucket.bucket_website_domain_name,
                        origin_protocol_policy=cf.OriginProtocolPolicy.HTTP_ONLY,  # In website-mode, S3 only serves HTTP # noqa: E501
                    ),
                    behaviors=[cf.Behavior(is_default_behavior=True)],
                ),
                # API load balancer
                cf.SourceConfiguration(
                    custom_origin_source=cf.CustomOriginConfig(
                        domain_name=api_lb.load_balancer_dns_name,
                        origin_protocol_policy=cf.OriginProtocolPolicy.HTTP_ONLY,
                    ),
                    behaviors=[
                        cf.Behavior(
                            path_pattern="/api*",  # No trailing slash to permit access to root path of API # noqa: E501
                            allowed_methods=cf.CloudFrontAllowedMethods.ALL,
                            forwarded_values={
                                "query_string": True,
                                "headers": [
                                    "referer",
                                    "authorization",
                                    "origin",
                                    "accept",
                                    "host",  # Required to prevent API's redirects on trailing slashes directing users to ALB endpoint # noqa: E501
                                ],
                            },
                            # Disable caching
                            default_ttl=core.Duration.seconds(0),
                            min_ttl=core.Duration.seconds(0),
                            max_ttl=core.Duration.seconds(0),
                        )
                    ],
                ),
                # Assets
                cf.SourceConfiguration(
                    s3_origin_source=cf.S3OriginConfig(
                        s3_bucket_source=assets_bucket, origin_access_identity=oai,
                    ),
                    behaviors=[
                        cf.Behavior(
                            path_pattern="/storage/*", trusted_signers=["self"],
                        )
                    ],
                ),
            ],
        )
        self.assets_path = f"https://{self.distribution.domain_name}/storage"
        core.CfnOutput(self, "Endpoint", value=self.distribution.domain_name)

    def grant_oai_read(self, oai: cf.OriginAccessIdentity, bucket: s3.Bucket):
        """
        To grant read access to our OAI, at time of writing we can not simply use
        `bucket.grant_read(oai)`. This is due to the fact that we are looking up
        our bucket by its name. For more information, see the following:
        https://stackoverflow.com/a/60917015/728583.

        As a work-around, we can manually assigned a policy statement, however
        this does not work in situations where a policy is already applied to
        the bucket (e.g. in GCC environments).
        """
        policy_statement = iam.PolicyStatement(
            actions=["s3:GetObject*", "s3:List*"],
            resources=[bucket.bucket_arn, f"{bucket.bucket_arn}/storage*"],
            principals=[],
        )
        policy_statement.add_canonical_user_principal(
            oai.cloud_front_origin_access_identity_s3_canonical_user_id
        )
        assets_policy = s3.BucketPolicy(self, "AssetsPolicy", bucket=bucket)
        assets_policy.document.add_statements(policy_statement)

Additional reading


How to generate a database URI from an AWS Secret

A quick note about how to generate a database URI (or any other derived string) from an AWS SecretsManager SecretTargetAttachment (such as what's provided via a RDS DatabaseInstance's secret property).

db = rds.DatabaseInstance(
    # ...
)
db_val = lambda field: db.secret.secret_value_from_json(field).to_string()
task_definition.add_container(
    environment=dict(
        # ...
        PGRST_DB_URI=f"postgres://{db_val('username')}:{db_val('password')}@{db_val('host')}:{db_val('port')}/",
    ),
    # ...
)

Tips for working with a large number of files in S3

I would argue that S3 is basically AWS' best service. It's super cheap, it's basically infinitely scalable, and it never goes down (except for when it does). Part of its beauty is its simplicity. You give it a file and a key to identify that file, you can have faith that it will store it without issue. You give it a key, you can have faith that it will return the file represented by that key, assuming there is one.

However, when you've enlisted S3 to manage a large number of files (1M+), it can get complicated to do anything beyond doing simple writes and retrievals. Fortunately, there are a number of helpers available to make it manageable to working with this scale of data. This post aims to capture some common workflows that may be of use when working with huge S3 buckets.

Listing Files

The mere act of listing all of the data within a huge S3 bucket is a challenge. S3's list-objects API returns a max of 1000 items per request, meaning you'll have to work through thousands of pages of API responses to fully list all items within the bucket. To make this simpler, we can utilize S3's Inventory.

Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC) or Apache Parquet (Parquet) output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix.

Be aware that it can take up to 48 hours to generate an Inventory Report. From that point forward, reports can be generated on a regular interval.

An inventory report serves as a great first-step when attempting to do any processing on an entire bucket of files. Often, you don't need to retrieve the inventory report manually from S3. Instead, it can be fed into Athena or S3 Batch Operations as described below.

However, when you do need to access the data locally, downloading and reading all of the gzipped CSV files that make up an inventory report can be somewhat tedious. The following script was written to help with this process. Its output can be piped to a local CSV file to create a single output or sent to another function for processing.

Stream S3 Inventory Report Python script
import json
import csv
import gzip

import boto3

s3 = boto3.resource('s3')


def list_keys(bucket, manifest_key):
    manifest = json.load(s3.Object(bucket, manifest_key).get()['Body'])
    for obj in manifest['files']:
        gzip_obj = s3.Object(bucket_name=bucket, key=obj['key'])
        buffer = gzip.open(gzip_obj.get()["Body"], mode='rt')
        reader = csv.reader(buffer)
        for row in reader:
            yield row


if __name__ == '__main__':
    bucket = 's3-inventory-output-bucket'
    manifest_key = 'path/to/my/inventory/2019-12-15T00-00Z/manifest.json'

    for bucket, key, *rest in list_keys(bucket, manifest_key):
        print(bucket, key, *rest)

Querying files by S3 Properties

Sometimes you may need a subset of the files within S3, based some metadata property of the object (e.g. the key's extension). While you can use the S3 list-objects API to list files beginning with a particular prefix, you can not filter by suffix. To get around this limitation, we can utilize AWS Athena to query over an S3 Inventory report.

1. Create a table

This example assumes that you chose CSV as the S3 Inventory Output Format. For information on other formats, review the docs.

CREATE EXTERNAL TABLE your_table_name(
  `bucket` string,
  key string,
  version_id string,
  is_latest boolean,
  is_delete_marker boolean,
  size bigint,
  last_modified_date timestamp,
  e_tag string,
  storage_class string,
  is_multipart_uploaded boolean,
  replication_status string,
  encryption_status string,
  object_lock_retain_until_date timestamp,
  object_lock_mode string,
  object_lock_legal_hold_status string
  )
  PARTITIONED BY (dt string)
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://destination-prefix/source-bucket/YOUR_CONFIG_ID/hive/';
2. Add inventory reports partitions
MSCK REPAIR TABLE your_table_name;
3. Query for S3 keys by their filename, size, storage class, etc
SELECT storage_class, count(*) as count
FROM your_table_name
WHERE dt = '2019-12-22-00-00'
GROUP BY storage_class

More information about querying Storage Inventory files with Athena can be found here.

Processing Files

Situations may arise where you need to run all (or a large number) of the files within an S3 bucket through some operation. S3 Batch Operations (not to be confused with AWS Batch) is built to do the following:

copy objects, set object tags or access control lists (ACLs), initiate object restores from Amazon S3 Glacier, or invoke an AWS Lambda function to perform custom actions using your objects.

With that last feature, invoking an AWS Lambda function, we can utilize Batch Operations to process a massive number of files without dealing without any of the complexity associated with data-processing infrastructure. Instead, we provide the Batch Operations with a CSV or S3 Inventory Manifest file and a Lambda function to run over each file.

To work with S3 Batch Operations, the lambda function must return a particular response object to describe if the process succeeded, failed, or failed but should be retried.

S3 Batch Operation Boilerplate Python script
import urllib

import boto3
from botocore.exceptions import ClientError

s3 = boto3.resource("s3")


TMP_FAILURE = "TemporaryFailure"
FAILURE = "PermanentFailure"
SUCCESS = "Succeeded"


def process_object(src_object):
    return "TODO: Populate with processing task..."


def get_task_id(event):
    return event["tasks"][0]["taskId"]


def parse_job_parameters(event):
    # Parse job parameters from Amazon S3 batch operations
    # jobId = event["job"]["id"]
    invocationId = event["invocationId"]
    invocationSchemaVersion = event["invocationSchemaVersion"]
    return dict(
        invocationId=invocationId, invocationSchemaVersion=invocationSchemaVersion
    )


def get_s3_object(event):
    # Parse Amazon S3 Key, Key Version, and Bucket ARN
    s3Key = urllib.parse.unquote(event["tasks"][0]["s3Key"])
    s3VersionId = event["tasks"][0]["s3VersionId"]  # Unused
    s3BucketArn = event["tasks"][0]["s3BucketArn"]
    s3Bucket = s3BucketArn.split(":::")[-1]
    return s3.Object(s3Bucket, s3Key)


def build_result(status: str, msg: str):
    return dict(resultCode=status, resultString=msg)


def handler(event, context):
    task_id = get_task_id(event)
    job_params = parse_job_parameters(event)
    s3_object = get_s3_object(event)

    try:
        output = process_object(s3_object)
        # Mark as succeeded
        result = build_result(SUCCESS, output)
    except ClientError as e:
        # If request timed out, mark as a temp failure
        # and Amazon S3 batch operations will make the task for retry. If
        # any other exceptions are received, mark as permanent failure.
        errorCode = e.response["Error"]["Code"]
        errorMessage = e.response["Error"]["Message"]
        if errorCode == "RequestTimeout":
            result = build_result(
                TMP_FAILURE, "Retry request to Amazon S3 due to timeout."
            )
        else:
            result = build_result(FAILURE, f"{errorCode}: {errorMessage}")
    except Exception as e:
        # Catch all exceptions to permanently fail the task
        result = build_result(FAILURE, f"Exception: {e}")

    return {
        **job_params,
        "treatMissingKeysAs": "PermanentFailure",
        "results": [{**result, "taskId": task_id}],
    }

S3 Batch Operations will then run every key through this Lambda handler, retry temporary failures, and log its results in result files. The result files are conveniently grouped by success/failure status and linked to from a Manifest Result File.

Example Manifest Result File
{
    "Format": "Report_CSV_20180820",
    "ReportCreationDate": "2019-04-05T17:48:39.725Z",
    "Results": [
        {
            "TaskExecutionStatus": "succeeded",
            "Bucket": "my-job-reports",
            "MD5Checksum": "83b1c4cbe93fc893f54053697e10fd6e",
            "Key": "job-f8fb9d89-a3aa-461d-bddc-ea6a1b131955/results/6217b0fab0de85c408b4be96aeaca9b195a7daa5.csv"
        },
        {
            "TaskExecutionStatus": "failed",
            "Bucket": "my-job-reports",
            "MD5Checksum": "22ee037f3515975f7719699e5c416eaa",
            "Key": "job-f8fb9d89-a3aa-461d-bddc-ea6a1b131955/results/b2ddad417e94331e9f37b44f1faf8c7ed5873f2e.csv"
        }
    ],
    "ReportSchema": "Bucket, Key, VersionId, TaskStatus, ErrorCode, HTTPStatusCode, ResultMessage"
}

More information about the Complete Report format can be found here.


At time of writing, S3 Batch Operations cost $0.25 / job + $1 / million S3 objects processed.

Price to process 5 million thumbnails in 2hrs:

  • S3 Batch Operations: $0.25 + (5 * $1) = $5.25
  • Lambda: 128MB * 2000 ms * 5,000,000 = $21.83
  • S3 Get Requests: 5,000,000 / 1000 * $0.0004 = $2
  • S3 Put Requests: 5,000,000 / 1000 * $0.005 = $25
  • TOTAL: $54.08

Things not discussed in this post

If you are looking for more techniques on querying data stored in S3, consider the following: