Cross-Account Private Connection: Integrating AWS Glue with Amazon RDS in Different Accounts

--

Problem statement:

The Data Engineering team can’t access and process data from an Amazon RDS instance within AWS Glue as those two resources are located in different AWS accounts. The AWS accounts are separate to define a boundary between service teams and simple IP-based allow-listing does not work with serverless services such as AWS Glue. This document outlines how we can establish a private and secure cross-account connection between AWS Glue and Amazon RDS using an RDS proxy in combination with a network load balancer.

Setup of 2 AWS accounts, Account A hosting the Database (RDS), Account B hosting AWS Glue.
Setup of 2 AWS accounts, Account A hosting the Database (RDS), Account B hosting AWS Glue.
  • The Amazon RDS instance resides within Account A within a dedicated VPC (RDS VPC).
  • AWS Glue is provisioned in Account B within a private VPC.
  • Both accounts, A and B, are in the same region “eu-central-1”.
  • Enable a private cross-account connection between Amazon RDS and AWS Glue.

Background:

A growing business and different concerns within the company make it challenging to separate resources in a meaningful way to support the business. An AWS option to build isolated services and enable simple access management is to separate services within different AWS accounts, which are associated with different business units. Cross-business unit concerns require data exchange in a secure, private, reliable, and best maintenance-free fashion.

In our case, Account A represents a project that relies on the Amazon RDS instance as the main Database there, without any need for analytical processing (OLAP) within this project. Account B represents the Analytics architecture here, that’s why we need a stable, secure connection between Account B’s data processing service, AWS Glue here, and Amazon RDS on Account A.

Tractive approach:

1. On Account B:

  • Add JDBC connection to AWS Glue
  • Following: (jdbc:protocol://host:port/dbname)

We get: (jdbc:mysql://endpoint_c:rds_host/rds_db_name)

  • Use usernames & passwords from the RDS database
  • Create a VPC Endpoint that enables a private connection to the VPC endpoint service connected to the network load balancer in Account A.

2. On Account A:

  • A network Load Balancer (NLB) inside a private VPC wraps the private RDS Proxy endpoint and defines an endpoint service for Account B to use
  • Endpoint B: defines an endpoint service
  • Endpoint A is the target of the NLB
  • NLB registered a target group that directs traffic for the RDS instance to the RDS proxy.
  • The RDS Proxy enables a private connection to the RDS cluster using provisioned capacity.
  • RDS Proxy can read data from the RDS database using the saved credentials in Secrets Manager. Also acts as database protection since it has its own limits regarding database resources it can use.
  • Any request that comes through the private RDS proxy endpoint will be forwarded to the RDS proxy and then routed to the RDS database.

3. Back to Account B:

  • Create Glue Crawler based on data store: JDBC
  • Use the created JDBC connection
  • In crawler settings, Include path: db_name/table_name

To Conclude: This allows AWS Glue to access an RDS database in a different AWS account using a private connection without any IP based allow-listings.

Alternative Solutions

There are 4 other solutions here to provide RDS readability to AWS Glue, we can summarize all in the table below:

VPC Peering

Make VPC of RDS on Account A revealed for VPC of Glue on Account B with no isolation.
Complexity: This setup needs some networking experience, as stated in this blog.
Cost: 0.02$ per GB of transferred data
Dependencies: High dependency between the two accounts, so any change can reflect easily on the other side
Security Considerations: reveal to VPCs to each other, still within the private network

Data Catalog cross-account access

Run Glue Crawlers on Account A to fill the Data Catalog there.
Then, enable cross-account access for this Data Catalog to Glue on Account B.
Complexity: Straightforward with IAM policies, as stated in this blog.
Cost: No extra cost for cross-account access.
Dependencies: High dependency on Account A owner team, mostly with no data cataloging experience.

Connect through Internet Gateway

Connect Glue to NAT Gateway, then Internet Gateway that is whitelisted on RDS VPC through public internet (VPC > public web > VPC).
Complexity: Needs some networking knowledge, but still doable as stated in this blog.
Cost: Glue crawlers access to rely on Account A.
No dependency at all.
Security: the risk of connection traffic through the public internet
(not secure)

Connect through Proxy (proposed above)

Enable Glue to establish a JDBC connection within an end-to-end private route to RDS.
Complexity: Needs more networking experience, combined with ETL concepts.
Cost: No costs for Gateways, 25$ for RDS Proxy (dependent on RDS size), 20$ for Network Load Balancer
No dependency at all as Glue runs a JDBC connection with read-only access on the RDS instance
Highly secured as the whole route exists within an end-to-end private network.

Summary
For a few source tables, granting cross-account access to the Data Catalog can work properly to make use of the low-cost aspect. For a high number of tables, like the whole RDS instance for example, the proposed solution with RDS Proxy can make more sense, considering the zero dependencies between the two accounts, plus having a fixed monthly cost.

P.S.: With this JDBC connection, we can avoid the variant costs of AWS Glue Crawler by stating the JDBC calls directly inside the Glue ETL Job.

Authors:

Tractive is the world market leader in GPS tracking and activity&health montoring solutions for cats and dogs.

--

--

No responses yet