In this post, I will outline steps used to start replication of a very large MySQL database from scratch.
Here is the problem that we faced:
Here is the problem that we faced:
- I had set up Master-Slave replication for a MySQL database in the beginning (using the simple steps outlined here)
- On both the Master and the Slave server, I had /var/lib/mysql point to an, IOPS optimized, EBS disk
- Since this data was very critical, I had set up a cron job to backup the database on the slave server
- I was monitoring the Master MySQL server with NewRelic and set up alerts when CPU, Memory or Disk consumption reached certain thresholds
- There were three mistakes that I did (for those of you setting this up from scratch):
- Not setting up NewRelic on the Slave Server
- Performing the backup on the same EBS disk as the DB partition
- Not "cleaning up" or archiving older backups
- As a result of these mistakes, the Slave EBS disk became full and replication stopped
- I was blissfully unaware of this until a few days passed
- I cleaned up the Slave EBS disk, but I did not know where or when the replication stopped. So, I could not just resume the replication
At this time, here were the constraints that we were dealing with:
- There were already over 10m records in some tables and 60m records in some other tables.
- This was a live, production, system - so shutting it down (with a READ LOCK) was not an option
- Performing the replication and doing the database back up remained to be critical
- This task had to be completed in a few hours - because without it we ran the risk of being exposed
These constraints ruled out:
- Following the replication steps in the beginning of this post - since we could not do a READ LOCK
- Backing up the Database (on the Master) and restoring on the Slave - this would take too much time
Given these requirements and constraints, all the documented methods for resuming the Master Slave replication were ruled out. Here are steps taken to accomplish this task:
On the Master server:
On the Master server:
- Edit my.cnf
- Restart Master MySQL server
- In the mysql command line on the Master, run the following commands:
- Take a back up of the database in the Master MySQL server using:
- Get the master_log_file and MASTER_LOG_POS from the MySQL dump:
- On your AWS console, take a Snapshot of the Volume housing your Master MySQL DB:
- Now create a Volume using the Snapshot-Id from above in the same AZ as the SlaveDB
- Attach this Volume to the SlaveDB instance
server-id=1
binlog-format= mixed
log-bin=/var/lib/mysql/mysql-bin.log
datadir=/var/lib/mysql
innodb_flush_log_at_trx_commit=1
sync_binlog=1
binlog-do-db={DB to replicate}
binlog-ignore-db=mysql
binlog-ignore-db=information_schema
binlog-ignore-db=performance_schema
sync_binlog=1
innodb_flush_log_at_trx_commit=1
CREATE USER replicant@{SlaveDBServerIP};
GRANT REPLICATION SLAVE ON *.* to 'replicant'@'{SlaveDBServerIP}' IDENTIFIED BY 'r3pl_us3r';
FLUSH PRIVILEGES;
GRANT RELOAD ON *.* TO 'replicant'@{SlaveDBServerIP};
GRANT SUPER ON *.* TO 'replicant'@{SlaveDBServerIP};
FLUSH PRIVILEGES;
mysqldump -u {user} -p{password} --skip-lock-tables --single-transaction --flush-logs --hex-blob --master-data=2 -A > dump.sql
head dump.sql -n80 | grep "MASTER_LOG_POS"
On the Slave Server:
- Point /var/lib/mysql to the mysql directory in the newly attached volume
- Start MySQL and verify that the MasterDB is now available on the slave server. Note that some records (towards the end) may be missing. That is ok. The steps that we will perform below will sync the DBs.
- Edit my.cnf
- Restart the Slave
- In the mysql command line, run the following commands
server-id=101
binlog-format = mixed
log-bin = /var/lib/mysql/mysql-bin.log
relay-log = /var/lib/mysql/mysql-relay-bin
log-slave-updates = 1
slave-skip-errors = all
read-only = 1
binlog-do-db={DBs to be replicated}
change master to master_host={IP.Address.Of.Master},
master_user='replicant',
master_password='r3pl_us3r',
master_log_file='FROM#5ABOVE',
master_log_pos=FROM#5ABOVE;
STOP SLAVE;
RESET SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;
SHOW SLAVE STATUS;
The Output will look similar to this:
Slave_IO_State: Waiting for master to send event
Master_Host: master_ip
Master_User: rep_user
Master_Port: 3306
Connect_Retry: 60
.............
It may take a while to reach this state - depending on how many records need to be synced