Data Migration from S3 to Redshift

In my previous job, our data are storing at S3, it give us a option to storing big amount of data in a safe and efficient way under a reasonable price, and make the data easily available to other AWS services, but we need a place where our data analyst can directly write query and do the analysis, so we decided to migrate data to Redshift, which consists of a Data Warehouse service that enables business intelligence operations.

In this story, I will focus on the details of the data migration, and show you how we make it done step by step.

Requirements:

If you are trying to build your own solution to migrate data from S3 to Redshift, make sure you meet the requirements:

  • You have an active AWS account
  • You have access to an AWS user, whose you know credentials (Access Key ID and Secret Access Key)
  • You have Python and Boto3 installed
  • You have an S3 bucket containing original data, they do not need to be CSV files, although using columnar format will allow to replicate easier the content of this story

Action:

The data to be staged on Redshift are originally stored in CSV format. The goal of a staging operation is to copy these data into the data warehouse keeping their structure. The issue will be addressed through the following steps:

  • Create an Identity Access Management (IAM) role to allow Redshift to call other services on behalf of the user
  • Create a policy granting read-only access to the S3 bucket where the data are stored
  • Attach the policy to the IAM role (at this stage, Redshift will be able to read data in the target bucket)
  • Create a Redshift cluster
  • Create a Redshift table whose structure is the same as the original CSV file structure
  • Copy data from the CSV files stored in the S3 bucket to the Redshift table

Let’s start with the first step:

Create an Identity Access Management (IAM) role to allow Redshift to call other services on behalf of the user

When using code that interacts with AWS services, one important thing is storing credentials and other relevant AWS related parameters in a separate file, that will be imported and read by the main code. This allows not exposing sensitive information (like your AWS credentials) to the public. You don’t want to receive a “surprise” bill right ;)

In this case, this information has been stored in the ‘dl.cfg’ file. This file contains four sections:

  • AWS credentials (Access Key ID and Secret Access Key)
  • Name for the IAM role and the IAM policy to be created
  • Redshift cluster settings
  • S3 path where the source file to be copied to Redshift is located.

AWS provides SDKs for the main programming languages that enable the users to interact with AWS services without using the AWS Management Console. The SDK for Python is called Boto3 and can be used by importing the corresponding library. In order to use each AWS service, it is necessary to set up a client for the corresponding service.

In this case, we will set up an IAM role, which defines a set of permissions for making AWS service requests, it is not associated with a specific user, although trusted users can assume the role. The create_role method allows the user to create an IAM role by defining its name, description and policy document (a JSON defining the permissions of the role). If a role with the same name as the one provided already exists, the code below will delete it and create a new one. In order to delete an existing role, AWS requires all the policies attached to it to be detached before the deletion.

Once we define the IAM role, the next step is to extract the ARN for the role we have just created, in order to be able to use the role later.

The role created is just an entity that can be assumed by users, but it does not grant specific permissions. What determines which specific services are accessible is defined by its attached IAM policies. IAM policies are written in JSON and consist of a list of statements, each statement defines one or more actions, an effect (Allow or Deny), and a resource which the statement is applied to.

Create a policy granting read-only access to the S3 bucket where the data are stored

In our use case, Redshift needs to be able to access S3, therefore our IAM policy will grant Redshift permission to get objects and list buckets in S3.

Attach the policy to the IAM role (at this stage, Redshift will be able to read data in the target bucket)

Once the policy has been created, we attached to the IAM role. Similar to the IAM roles, also IAM policies are identified in AWS by ARNs. Therefore, in the next code snippet, the ARN for the above-created policy is extracted and the policy is attached to the role.

Create a Redshift cluster

As already done for the IAM service, interaction with Redshift needs a specific client to be created. Again, the client can be defined using AWS credentials and specifying the wanted AWS region.

AWS allows launching a Redshift cluster via API. When creating a cluster, the following parameters need to be specified:

  • ClusterIdentifier
  • NodeType
  • MasterUsername
  • MasterUserPassword
  • Port

The parameters mentioned above are stored in the same ‘config.json’ file already used to extract AWS credentials.

One thing that may need our attention is, checking whether if any existing Redshift cluster with the same identifier as the one we want to use already exists. Cause AWS required cluster have distinct identifier.

Create a Redshift table whose structure is the same as the original CSV file structure

And here is the time to create the table and ensure when we trying to load the data, we already have a blank ‘cozy’ table waiting for there. In our case, we used Python package psycopg2 to execute Postgres SQL queries on the Redshift tables. After connecting to the database, we can use following script to access the source file in S3 and create a same structure table at Redshift side.

# Define S3 client
s3 = boto3.client( “s3”,
aws_access_key_id = access_key_id,
aws_secret_access_key = secret_access_key)
# Get object containing file to be staged
data_test= s3.get_object(Bucket = “test_data_to_migrate”,
Key = “test_dataset.csv”)
import io
import pandas as pd
# Get the columns info for the dataset
pd.read_csv(io.BytesIO(data_test[“Body”].read())).info()

At this point, the table are created in the Redshift. In general, the Redshift could already contain a table with the same name we are going to use. For this reason, unless the existing table contains relevant data, a good practice consists of deleting the existing table and after that creating a new one.

Copy data from the CSV files stored in the S3 bucket to the Redshift table

Once the table is ready, the final step consists of loading the data from S3 into the Redshift table we just created. This operation can be executed relatively easily in Redshift using the COPY command. The command allows specifying the source of the data to be copied (in this case a csv file hosted on S3) and the table destination.

There are some details checkbox you can select when you doing the loading, for example, whether should you load the first row or not, understand your data and make the wise choice ;)