Migrating PostgreSQL to Aurora with near Zero-Downtime
15CBPJMP-39 by jeaneeem is licensed under CC BY-NC-ND
The goal here is to migrate an Amazon Web Services (AWS) Relational Database Service (RDS) PostgreSQL Database to Aurora with as little downtime as possible. The following documents the initial strategy that was considered using read replica promotion, followed by the end result using AWS’s Database Migration Service (DMS).
From the Aurora Details Page:
Amazon Aurora is designed to offer greater than 99.99% availability, increasing MySQL and PostgreSQL performance and availability by tightly integrating the database engine with an SSD-backed virtualized storage layer purpose-built for database workloads. Amazon Aurora’s storage is fault-tolerant and self-healing and disk failures are repaired in the background without loss of database availability. Amazon Aurora is designed to automatically detect database crashes and restart without the need for crash recovery or to rebuild the database cache. If the entire instance fails, Amazon Aurora will automatically fail over to one of up to 15 read replicas.
Essentially, better performance, reliability, and the ability to scale.
Our configuration changes become easier. Aurora acts as a cluster, so we have two endpoints to deal with, and the replicas underneath will adjust dynamically. We use Makara to spread our database queries between our master and slaves. With RDS PostgreSQL we would have to enumerate all our slave databases so we could take advantage of the replicas. If we added a new replica, we would have to adjust our configuration to take advantage of it. With Aurora, we have two endpoints to deal with now (primary/writer and a read-only URL). These endpoints will dynamically rotate between the available instances that fall under that cluster URL. This is great as it handles a failover to a different primary database, or changing the number of replicas. Effectively, we end up with less configuration, but we still need Makara to direct our write and read queries to the two cluster URLs.
As mentioned earlier, there are two approaches to doing this migration. First, we’ll cover the recommended approach that AWS suggested as we’re on the RDS platform. As we’ll cover, it is easy and quick but it does incur downtime during the migration. The second approach uses DMS and was brought up to us after opening a support ticket with AWS on how to do our migration with zero downtime.
This is the recommended approach for migrating from PostgreSQL to Aurora. It is easy and has a little risk (during the migration), although it has unavoidable downtime.
It is recommended to do a dry run of the promotion (and just throw it away afterwards) to gauge how long that process takes.
The issue here is that you have:
One good thing is that even with the downtime the old read replicas will continue to serve up GET
requests. In addition, any CDN/Varnish caching will also help serve stale data during the downtime.
This approach is much more involved and finicky, but it does offer a near zero-downtime migration strategy. The general process here is:
This is a multi-part process, so the following sections will walk through each part.
<service>—<environment>-aurora
<service>—<environment>-aurora-cluster
) <service>—<environment>-aurora-<##>
At this point, we will have an Aurora read replica that is receiving replicated data changes off the master.
To isolate our newly created Aurora read replica so it can be its own master, we’re going to promote the replica. This process allows the instance to receive writes, and be the master for other replicas. We need to do this as otherwise we are limited to only one Aurora read replica off of a PostgreSQL master database. By starting the promotion process the on-going replication is broken between the PostgreSQL and Aurora databases.
To enable the ability for PostgreSQL to replicate to Aurora we need to make sure we have certain database parameters set.
The first thing you need to do is change the rds.logical_replication
parameter to 1. As described in the documentation this parameter actually configures additional related parameters that allow the replication to process:
As part of applying this parameter, AWS DMS sets the
wal_level
,max_wal_senders
,max_replication_slots
, andmax_connections
parameters.
Another parameter configuration that Amazon recommends is to set wal_sender_timeout
to 0.
The
wal_sender_timeout
parameter terminates replication connections that are inactive longer than the specified number of milliseconds. Although the default is 60 seconds, we recommend that you set this parameter to zero, which disables the timeout mechanism.
Without setting the timeout parameter, the streaming of Write-Ahead Logging (WAL) files can be terminated if the sender is under load and does not send a WAL file within the timeout period. To prevent any risk of this it is best to disable it during the on-going replication phase of the migration process.
There are additional wal
settings available, but we shouldn’t need to configure them.
On the Aurora instance, we want to limit any foreign key constraints from triggering as the full loads for tables are occurring. Fortunately, if we change the session_replication_role
parameter to replica
, it only keeps replica related constraint triggers active. This effectively disables foreign key constraints for us.
DMS uses a dedicated replication instance to help facilitate the migration process. In a PostgreSQL to Aurora migration, this machine is responsible for connecting to both the source and target databases and transforming and transferring the data.
The following steps expand on what is needed to set up a replication instance for the migration.
dms.t2.large
, while for production we used dms.c4.large
default
We need to define the endpoints of our databases in DMS. One to represent the source database (PostgreSQL) and another to represent the target database (Aurora). The following steps are to be applied to each endpoint.
: ; + %
)The DMS task is where we actually configure and use the endpoints and replication instance to accomplish the migration. There are a couple of options and considerations to be made while creating a task. The following steps outline how to create a task, and the settings to configure:
Migrate existing data and replicate ongoing changes
Do nothing
Truncate
could work, but is likely to fail due to constraints in the databaseDo nothing
the target database needs to be truncated manually prior to running the taskDon't stop
. where schema name is like 'public' and table name is like '%', include
)Include LOB columns in replication is an interesting setting, LOBs are Large Objects that exist during the migration. As DMS is possibly migrating to a different database type a transformation of data types occurs. To understand what we’re dealing with you need to look up the source database’s supported types in the documentation. For example, using the DMS documentation you can see what PostgreSQL data types end up being LOBs.
The following section will dig a little deeper into LOBs and the considerations that need to be considered.
There are two options for handling LOBs during the migration: Full or Limited LOB Mode. LOBs are potentially massive objects that reside in the database, and they normally don’t have a fixed size in the column. The following documentation describes the two options:
DMS provides you with the Full LOB Mode:
In full LOB mode, AWS DMS migrates all LOBs from source to target regardless of size. In this configuration, AWS DMS has no information about the maximum size of LOBs to expect. Thus, LOBs are migrated one at a time, piece by piece. Full LOB mode can be quite slow.
DMS also provides the Limited LOB Mode:
In limited LOB mode, you set a maximum size LOB that AWS DMS should accept. Doing so allows AWS DMS to pre-allocate memory and load the LOB data in bulk. LOBs that exceed the maximum LOB size are truncated and a warning is issued to the log file. In limited LOB mode, you get significant performance gains over full LOB mode. We recommend that you use limited LOB mode whenever possible.
Initially, it makes sense to just use Full LOB Mode as it’ll preserve data by migrating LOBs regardless of size. The big issue with this choice is speed. In our migration, we achieve full migration in 2 hours using Limited LOB Mode and we estimated about 90 hours using Full LOB Mode. The issue with using Limited LOB Mode is that you can essentially lose data if the LOB’s size is less than the Max LOB Size
parameter.
To work around this concern, the plan is to determine the max LOB size in the database and set the Max LOB Size
to slightly bigger than that value (i.e., multiply it by 2). By using a larger Max LOB Size
than the largest LOB we have in the database, we are ensured to have the full data migrated without any data loss.
A manual way to check for the size of a LOB column is to use the following query: SELECT max(pg_column_size(column_name)) FROM table_name;
. This will return the max number of bytes used in that column. This is the value you’ll want to be larger than. The following is a Rails rake task that walks through every table’s column and identifies the max sizes for any LOB column (for PostgreSQL):
namespace :scripts do
desc 'Print out the max size of LOBs in the database. Usage: bin/rake scripts:max_lob_size'
task max_lob_size: :environment do
output = {}
LOB_TYPES = %w(
hstore
composite
array
jsonb
json
polygon
path
xml
tsquery
tsvector
bytea
text
).freeze
ActiveRecord::Base.connection.tables.each do |table_name|
next if table_name == 'schema_migrations' || table_name.starts_with?('awsdms_')
columns_sql = <<~HEREDOC
SELECT *
FROM information_schema.columns
WHERE table_schema = 'public'
AND table_name = '#{table_name}'
HEREDOC
columns_results = ActiveRecord::Base.connection.execute(columns_sql)
columns_results.each do |row|
column_name = row['column_name']
column_type = row['data_type']
if LOB_TYPES.include?(column_type)
size_sql = <<~HEREDOC
SELECT max(pg_column_size(#{column_name})) FROM #{table_name};
HEREDOC
size_results = ActiveRecord::Base.connection.execute(size_sql)
table_hash = output[table_name.to_sym] = {}
column_hash = table_hash[column_name.to_sym] = {}
column_hash[:column_type] = column_type
column_hash[:max_bytes_size] = size_results[0]['max'].to_i
end
print '.'
end
end
puts "\nRaw Dump"
pp output
max_bytes = output.values.map { |x| x.values.first[:max_bytes_size] }.compact.max
puts "Max bytes found in a LOB column is #{max_bytes}"
end
end
There are some concerns with NOT NULL
database constraints on columns that become LOBs during the migration. DMS has a certain process for dealing with tables with LOBs:
null
placeholdernull
placeholdersThis two-step process of dealing with LOB data types is where we have to be concerned with NOT NULL
constraints. The migration will fail if any LOB columns have a NOT NULL
constraint. In PostgreSQL, a common case of LOBs is jsonb
and array
columns. So we have to remove these constraints on the target database just until the full table load phase is done. For example, the following statement would do the trick, ALTER TABLE my_table ALTER COLUMN lob_column DROP NOT NULL;
. Do not forget to add the NOT NULL
constraints back after the full table load phase is done.
At this point, the migration process is pretty much ready! Let’s break the process up into before/during/after migration.
Make sure that the:
One thing that is suggested is to run Assess
on the task so you can get a report of potential issues. In our case, there were a couple of Partially supported datatypes : float8 on a few columns. This ended up changing the rounding of floats (i.e., 1.4999999998 –> 1.5). It is worth noting that these differences occur after the migration process is completed and changes are being done on the new database type.
For the migration, ideally, it is done during a low activity period. In addition, if possible stop any background jobs just before the migration, and wait for the current jobs to finish. The jobs can resume processing after the migration, this is to reduce risk.
CDCIncomingChanges
should be as close to 0 as possibleCDCIncomingChanges
will climb as the on-going replication changes are stored until the tables have fully loaded into the target databaseFreeStorageSpace
and FreeableMemory
are not dropping too low The first phase is full table load, where all the source data is dumped into the target. Be aware that large tables can take some time. After a table has been fully loaded into the target, the on-going replication phase starts for that table.
The on-going replication phase is where inserts/deletes/updates are replicated from the source to the target. After all the tables are in this phase (all tables in a Table Completed state), it is now safe to re-enable the LOB nullability constraints that were disabled earlier.
If you have validations enabled for the Task then the validation columns in the Table Statistics will update during the migration. These validations put additional load on the source and target database as the row data is compared after that row has been migrated. It is an on-going validation process.
Personally, I found the validation to be very flaky. It is either really slow and it also reports validation errors that are not actual errors. In our case, we didn’t pay much attention to the validation failures as spot checking proved that the data was fine. There were minor cases where time columns were slightly off. I am unsure how the validation actually works (i.e., when it does the checks), as the on-going replication could be lagging behind. According to the documentation, there are some limitations that might prevent it from being that useful:
In either case under the target’s database, there is a new table called awsdms_validation_failures_v1
that contains information on the failures. The KEY
and TABLE_NAME
columns can be used to identify the record in question. It is then possible to check the source and target record and see if there are any issues. One other problem we had with spot-checking validation is that our PostgreSQL database used UUID
s for primary keys, this resulted in the KEY
column having truncated data on the UUID
.
Hopefully, the migration went successfully, and both the source and target database are in-sync. At this point, the zero-downtime migration can occur – simply point the application at the new database. It is advisable to wait for any replication lag or queued up CDCIncomingChanges
to drain before proceeding. You might have to wait a little bit for the connections of your application to cycle over to the new database, but you can monitor this in CloudWatch, or force the cycle (i.e., restart Rails Unicorn servers). Don’t forget to resume any background queues. After sufficient time, you can decommission the old database resources.
In the event that the migration doesn’t go as planned, it is possible to revert back to the old database. The only issue is that any write data that occurred during the migration process (i.e., only on the Aurora database) wouldn’t be present on the old database (i.e., PostgreSQL database). This is simply an issue that cannot be worked around, and hopefully, the low activity period reduces the amount of data loss.