Random thoughts: Starting replication for a very large MySQL database on Amazon EC2

In this post, I will outline steps used to start replication of a very large MySQL database from scratch.

Here is the problem that we faced:

I had set up Master-Slave replication for a MySQL database in the beginning (using the simple steps outlined here)
On both the Master and the Slave server, I had /var/lib/mysql point to an, IOPS optimized, EBS disk
Since this data was very critical, I had set up a cron job to backup the database on the slave server
I was monitoring the Master MySQL server with NewRelic and set up alerts when CPU, Memory or Disk consumption reached certain thresholds
There were three mistakes that I did (for those of you setting this up from scratch):

Not setting up NewRelic on the Slave Server
Performing the backup on the same EBS disk as the DB partition
Not "cleaning up" or archiving older backups

As a result of these mistakes, the Slave EBS disk became full and replication stopped
I was blissfully unaware of this until a few days passed
I cleaned up the Slave EBS disk, but I did not know where or when the replication stopped. So, I could not just resume the replication

At this time, here were the constraints that we were dealing with:

There were already over 10m records in some tables and 60m records in some other tables.
This was a live, production, system - so shutting it down (with a READ LOCK) was not an option
Performing the replication and doing the database back up remained to be critical
This task had to be completed in a few hours - because without it we ran the risk of being exposed

These constraints ruled out:

Following the replication steps in the beginning of this post - since we could not do a READ LOCK
Backing up the Database (on the Master) and restoring on the Slave - this would take too much time

Given these requirements and constraints, all the documented methods for resuming the Master Slave replication were ruled out. Here are steps taken to accomplish this task:

On the Master server:

Edit my.cnf

 server-id=1  
 binlog-format= mixed  
 log-bin=/var/lib/mysql/mysql-bin.log  
 datadir=/var/lib/mysql  
 innodb_flush_log_at_trx_commit=1  
 sync_binlog=1  
 binlog-do-db={DB to replicate}  
 binlog-ignore-db=mysql  
 binlog-ignore-db=information_schema  
 binlog-ignore-db=performance_schema  
 sync_binlog=1  
 innodb_flush_log_at_trx_commit=1

Restart Master MySQL server

In the mysql command line on the Master, run the following commands:

 CREATE USER replicant@{SlaveDBServerIP};  
 GRANT REPLICATION SLAVE ON *.* to 'replicant'@'{SlaveDBServerIP}' IDENTIFIED BY 'r3pl_us3r';  
 FLUSH PRIVILEGES;  
 GRANT RELOAD ON *.* TO 'replicant'@{SlaveDBServerIP};  
 GRANT SUPER ON *.* TO 'replicant'@{SlaveDBServerIP};  
 FLUSH PRIVILEGES;

Take a back up of the database in the Master MySQL server using:

 mysqldump -u {user} -p{password} --skip-lock-tables --single-transaction --flush-logs --hex-blob --master-data=2 -A > dump.sql

Get the master_log_file and MASTER_LOG_POS from the MySQL dump:

 head dump.sql -n80 | grep "MASTER_LOG_POS"

On your AWS console, take a Snapshot of the Volume housing your Master MySQL DB:
Now create a Volume using the Snapshot-Id from above in the same AZ as the SlaveDB

Attach this Volume to the SlaveDB instance

On the Slave Server:

Point /var/lib/mysql to the mysql directory in the newly attached volume

Start MySQL and verify that the MasterDB is now available on the slave server. Note that some records (towards the end) may be missing. That is ok. The steps that we will perform below will sync the DBs.

Edit my.cnf

 server-id=101  
 binlog-format = mixed  
 log-bin = /var/lib/mysql/mysql-bin.log  
 relay-log = /var/lib/mysql/mysql-relay-bin  
 log-slave-updates = 1  
 slave-skip-errors = all  
 read-only = 1  
 binlog-do-db={DBs to be replicated}

Restart the Slave

In the mysql command line, run the following commands

 change master to master_host={IP.Address.Of.Master},  
   master_user='replicant',  
   master_password='r3pl_us3r',  
   master_log_file='FROM#5ABOVE',  
   master_log_pos=FROM#5ABOVE;  
 STOP SLAVE;  
 RESET SLAVE;  
 SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;  
 START SLAVE;  
 SHOW SLAVE STATUS;

 Slave_IO_State: Waiting for master to send event  
 Master_Host: master_ip  
 Master_User: rep_user  
 Master_Port: 3306  
 Connect_Retry: 60  
 .............

Random thoughts

Friday, November 21, 2014

Starting replication for a very large MySQL database on Amazon EC2

No comments:

About Me