Since Signeasy’s inception in 2010, the database that powers our backend systems has been MySQL on Amazon Relational Database Service (RDS), which stores user and signature records.
Each table in the database has relevant datetime columns (created time and modified time). Maintaining date and time information for events like document uploads, template uploads, signatures, and requests for signatures is essential to determine the documents’ status and the time taken to turn them around with completed signatures.
The problem
Historically, these datetime columns were of type DATETIME, and their values were stored in the US/Pacific time zone. In other words, the RDS database instance had the timezone set to US/Pacific. As we grew over the years and the user base expanded the world over, this quickly became a problem and time conversions specific to each user’s region had to be handled through code. This situation was less than ideal. Â
To show the datetime for a record in the user's local timezone, we had to convert the datetime value from US/Pacific to UTC using MySQL queries and return the UTC timestamp value to the web and mobile apps. Any new code written in the respective services had to have the timezone conversion from US/Pacific to UTC in queries as clients accept the datetime values in UTC to be shown the user’s timezone.
In addition, on Signeasy’s microservices architecture with service-level schema, we have some databases in certain services that used to store datetime values as UTC even though the MySQL database instance timezone was set to US/Pacific. This prevented us from using certain MySQL functions like  unix_timestamp() as the server interprets datetime as a value in the session/global time zone and converts it to an internal Unix timestamp value in UTC.
We realized that the datetime values were not consistent across databases. This inconsistency in the db records impacted developers' velocity immensely. This technical debt was slowly growing in large proportions and was becoming difficult to live with and scale.
As an engineer, it looks pretty straightforward initially when you take up such challenges, but when you solve the problem, it surprises you with issues and blockers. This project was no exception.
Mitigation plan
Database tables storing the datetime values in columns were of type DATETIME, and we had to think about the best approach to change the datetime values to UTC. Because MySQL’s Datetime column type does not carry time zone information, there was no straightforward way to migrate these datetime values. We addressed this problem by migrating the tables and changing the DATETIME column to a TIMESTAMP type column.
Why TIMESTAMP column:
- MySQL converts the TIMESTAMP value from the current timezone to UTC for storage and then converts it back from UTC to the current timezone on retrieval.
- A TIMESTAMP column does not compel us to make immediate code changes to remove timezone conversion from queries.
We realized that if we could migrate the tables to have the TIMESTAMP column type, all we need to do is change the database Timezone to UTC and release the application code with no explicit timezone conversion in queries.
Though it looked simple initially, we needed this migration to happen not just for one table but for multiple tables and schemas. To add to the complexity, most tables had millions of rows that could increase the query execution time.
When you use MySQL’s ALTER statement to modify a column, it uses the COPY algorithm to modify the table instead of INPLACE (Reason: you cannot change column datatype INPLACE). MySQL COPY algorithm alters the schema of the existing table by creating a temporary table with an altered schema, copies the data to the new table, swaps the tables and drops the old table. This involves write locks to the table for the duration of the query execution (which can be in hours). ALGORITHM=COPY is an expensive operation as it blocks concurrent DML operations, but it allows concurrent read queries when LOCK=SHARED. If the lock mode LOCK=EXCLUSIVE is used, both reads/writes are blocked until the completion of the alter query.
In simpler words, ALTER query migration meant significant downtime which we could not afford. We needed an approach that would cause minimal downtime to the overall migration process.
Note: This database migration is tested and done on MySQL 5.7. Read more on online DDL operations here.
The initial solution: Percona toolkit to the rescue
Then came pt-online-schema-change by Percona. It alters a table’s structure without blocking reads or writes. This is precisely what we were looking for—not impacting our production traffic during migration.
This is what the tool does behind the scenes:
- Create an empty temporary table with the updated schema.
- Create three triggers in the original table for insert, update, and delete operations that will ensure the newly added/updated data in the original table is copied to the new table.Â
- Copy data from the original table to the new table in chunks.
- When the copy is complete it does an atomic rename for the original and new table and drops the original table.
No read and write locks are added to the original table during this operation, and live transactions aren’t impacted.
Installation‍
sudo apt-get install percona-toolkit
wget percona.com/get/percona-toolkit.tar.gz
wget percona.com/get/percona-toolkit.rpm
wget percona.com/get/percona-toolkit.deb
‍
Sample Usage
pt-online-schema-change --host <db_host> --user <db_user> --password <db_password> --alter "MODIFY created_time timestamp not null" D=<db_name>,t=<table_name> --execute
Sample execution result
Not checking slave lag because no slaves were found and --check-slave-lag was not specified.
Operation, tries, wait: Â
analyze_table, 10, 1
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
Altering `<db_name>`.`<table>`...
Creating new table...
Created new table <db_name>._<table>_new OK.
Altering new table...
Altered `<db_name>`.`_<table>_new` OK.
2022-05-11T23:41:27 Creating triggers...
2022-05-11T23:41:27 Created triggers OK.
2022-05-11T23:41:27 Copying approximately 3294741 rows...
Copying `<db_name>`.`<table>`: 77% 00:08 remain
2022-05-11T23:42:08 Copied rows OK.
2022-05-11T23:42:08 Analyzing new table...
2022-05-11T23:42:08 Swapping tables...
2022-05-11T23:42:08 Swapped original and new tables OK.
2022-05-11T23:42:08 Dropping old table...
2022-05-11T23:42:08 Dropped old table `db_name`.`_<table>_old` OK.
2022-05-11T23:42:08 Dropping triggers...
2022-05-11T23:42:08 Dropped triggers OK.
Successfully altered `<db_name>`.`<table>`
This looked quite simple, and our initial tests with some table migration showed great results. The problem started for tables that had referential integrity maintained.
Before explaining the problem, let me explain how Percona handles table alteration that has foreign keys. If there is a foreign key(s) in the table being altered, the tool provides a command-line argument --alter-foreign-keys-method to specify how foreign key(s) will be referenced after the atomic rename operation. As per Percona docs:
“Foreign keys that reference the table to be altered must be treated specially to ensure that they continue to reference the correct table. When the tool renames the original table to let the new one take its place, the foreign keys “follow” the renamed table and must be changed to reference the new table instead.”
Let us understand this with an example.
Let’s say there are two tables, T1 and T2. T1 is the table to be altered, and T2 has a foreign key referenced to T1. When the Percona query runs, it creates a temporary table named _T1_new. To update the foreign key references in the new table after the original and new table swap, Percona provides two approaches, i.e., rebuild_constraints and drop_swap via –alter-foreign-keys-method argument.
How rebuild_constraints approach works:
- After the copy operation is done, it renames T1 to T1_old and _T1_new to T1
- It then rebuilds the foreign key references by altering table T2 with drop fk1 and add _fk1 pointing to the new table
- If table T2 is too large with millions of rows, the alter operation on T2 will take time, which is a blocking operation. All writes to T2 will be blocked. This is where the problem occurs. We can’t afford downtime that is too long in the form of blocking writes. Imagine a scenario when you have multiple such tables with cross-table references where child tables have vast amounts of data, and the blocking of writes it would cause!
Percona has another method of updating foreign keys, i.e. drop_swap, which does not block. This is how it works:
- After the copy operation is done, it disables the foreign key checks (FOREIGN_KEY_CHECKS=0)
- Drop table T1
- Rename _T1_new to T1
- Enable foreign key checks
Overall drop_swap approach is faster and does not block but is riskier for two reasons. First, for a short time between dropping the original table T1 and renaming the temporary table _T1_new, the table to be altered simply does not exist, and queries against it will result in an error. Secondly, if there is an error and the new table _T1_new cannot be renamed in the place of the old one T1, then it is too late to abort because the old table is gone permanently.
Enhanced solution: The best of both worlds
As we had to alter a large set of tables where most tables did not have cross-table references, but few critical tables had it, we decided to go with both rebuild_constraints and drop_swap options wherever needed. Understanding the risk associated with drop_swap, we decided to use a small downtime window for tables involving cross-table references and run the Percona query with drop_swap.
We had to prepare a failover strategy in case the renaming operation failed after deleting the original table and what steps we needed to take. We devised a failover plan with a restore db mechanism where an RDS snapshot can be restored to the state just before the downtime.
This was just a precautionary measure and a worst-case scenario for which we had to be prepared. However, we knew that situation might not arise, and even if the renaming operation failed, we still had an option to try renaming the table manually.
The confidence level increased further when we ran rigorous dry runs on development and pre-production environments, and all succeeded without problems. As part of our dry runs, we prepared multiple Percona queries with rebuild_constraints and drop_swap approaches.
We found the Percona tool very helpful for conducting this level of migration. Imagine if we had to use MySQL default Alter queries to modify each table; what would have been the impact on the live traffic and the level of downtime we would have had to undergo for this!
Though Percona is slow when copying the rows in chunks from the original to a new table, it does not block the read and write while copying. We went through minimal downtimes to overcome the risk of drop_swap queries, which was certain and manageable.
Once all the tables were migrated, we changed the Timezone in RDS to UTC—as simple as that. Changing the Timezone to UTC does not impact any datetime values in tables as we changed those columns to TIMESTAMP now, and the value will remain in UTC.
As I mentioned earlier, the Timestamp type stores the datetime values in UTC, and upon retrieval, the datetime value will still be UTC. We then released our updated code in production with no timezone conversion in queries.
One step closer to lesser technical debt and offering our users an enhanced experience. If you would like to connect and learn more about the engineering processes at Signeasy, feel free to write to us at [email protected] with your questions, thoughts or suggestions!