Tutorial for ghostferry-copydb

This tutorial aims to provide you with a first look on how to operate ghostferry-copydb to copy data from one database to another so you can have some experience with actually running Ghostferry. A production run of a data migration will be largely similar, although you will have to consider how to appropriately perform the cutover operations with respect to the applications accessing the database. Recommendations on how to run copydb in production can be found in Running ghostferry-copydb in production.

Setup and Seed MySQL

In this tutorial, we will be using two test databases that we setup locally and we will not consider the application. With git, clone the Ghostferry repository and create the test MySQL instances:

$ git clone https://github.com/Shopify/ghostferry.git
$ cd ghostferry
$ docker-compose up -d mysql-1 mysql-2

Users without docker-compose can either install it on their machine or manually setup two localhost MySQL instances available at port 29291 and 29292 with FULL image row based replication.

Confirm that you can access both MySQL instances with the MySQL console:

# mysql --protocol=tcp -u root -P 29291
# mysql --protocol=tcp -u root -P 29292

We will be moving data from the 29291 server to the 29292 server. To do this, we must first create some test data on the 29291 server to be copied over:

# export LC_CTYPE=C # Only need this if you're mac
echo "CREATE DATABASE abc;" > /tmp/n1create.sql
echo "CREATE TABLE abc.table1 (id bigint(20) AUTO_INCREMENT, data varchar(16), primary key(id));" >> /tmp/n1create.sql
echo "CREATE TABLE abc.table2 (id bigint(20) AUTO_INCREMENT, data TEXT, primary key(id));" >> /tmp/n1create.sql
for i in `seq 1 350`; do
  echo "INSERT INTO abc.table1 (id, data) VALUES (${i}, '$(cat /dev/urandom | tr -cd 'a-z0-9' | head -c 16)');" >> /tmp/n1create.sql
  echo "INSERT INTO abc.table2 (id, data) VALUES (${i}, '$(cat /dev/urandom | tr -cd 'a-z0-9' | head -c 16)');" >> /tmp/n1create.sql
done
mysql --protocol=tcp -u root -P 29291 < /tmp/n1create.sql
rm /tmp/n1create.sql

This created two tables under the database abc. We will be moving table1 to 29292 while not copying 29291.

(Mirrors Production) Create Ghostferry Users

We then need to create a user with the appropriate permissions for Ghostferry to connect with, to perform the move on both servers. For this move, we neglect SSL connections to MySQL and thus do not require SSL for the user. In production, you may want to enable that.

On the source server, the minimum permissions required are:

mysql> CREATE USER 'ghostferry'@'%' IDENTIFIED BY 'ghostferry';
mysql> GRANT SELECT ON `abc`.* TO 'ghostferry'@'%';
mysql> GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'ghostferry'@'%';

The above example grants the permission to only the abc database. You can grant it to more or all databases in your production environment as needed.

On the target server, the minimum permissions required are:

mysql> CREATE USER 'ghostferry'@'%' IDENTIFIED BY 'ghostferry';
mysql> GRANT INSERT, UPDATE, DELETE, CREATE, SELECT ON *.* TO 'ghostferry'@'%';

We grant permission to all databases because we assume that the abc database does not exist on the target and Ghostferry will create it automatically.

(Mirrors Production) Install ghostferry-copydb

We then need to obtain the ghostferry-copydb binary on the server on which we want to execute Ghostferry on. Note that all the data moved will go through this server over its network so make sure the production server is appropriately picked. For the present tutorial, Ghostferry will simply live on the same machine.

To download the latest binaries, you currently have to compile copydb with Go 1.9 via make copydb after cloning the repository.

For testing purposes, you can also use this unofficial PPA (see this PR as well) to obtain a version of ghostferry-copydb. Note the unofficial PPA for ghsotferry-copydb is not supported and you should not use it in production.

(Mirrors Production) Setup Ghostferry Run Configuration

We will need to provide ghostferry-copydb with a configuration file such that it knows how to connect to the databases and what to copy. This is a json file which should look like the following:

{
  "Source": {
    "Host": "127.0.0.1",
    "Port": 29291,
    "User": "ghostferry",
    "Pass": "ghostferry",
    "Collation": "utf8mb4_unicode_ci",
    "Params": {
      "charset": "utf8mb4"
    }
  },

  "Target": {
    "Host": "127.0.0.1",
    "Port": 29292,
    "User": "ghostferry",
    "Pass": "ghostferry",
    "Collation": "utf8mb4_unicode_ci",
    "Params": {
      "charset": "utf8mb4"
    }
  },

  "Databases": {
    "Whitelist": ["abc"]
  },

  "Tables": {
    "Blacklist": ["table2"]
  },

  "VerifierType": "ChecksumTable"
}

Save this file to a file called examplerun.json.

Note that in the example above, the Collation and charsets are set. If you setup your own MySQL instances, you might need to change these values. We are also using the Whitelist and Blacklist to ensure that we only copy abc.table1 from the source to the target. For more information about this configuration file, see Running ghostferry-copydb in production.

Lastly, we have enabled verification to be available to use during the run. Specifically, we enabled the ChecksumTable verifier as the amount of data copied will be small. For more information about the verifiers, see Verifiers.

(Mirrors Production) Validate Ghostferry Configuration

Before actually running Ghostferry, it is good practise to validate the configuration you specified. ghostferry-copydb has a dryrun flag that will try to use the configuration you have to connect to the database. It will also scan the tables according to the black/whitelist specified and print it out in the debug logs:

$ ghostferry-copydb -dryrun -verbose examplerun.json

The verbose flag gives slightly more debug information in case there are any issues. In this case, there should not be any issues as we setup the database according to the tutorial and the output should be something like this (simplified for readibility in the tutorial):

[...]
INFO[0000] connecting to the source database             dsn="ghostferry:<masked>@[...]" tag=ferry
INFO[0000] connecting to the target database             dsn="ghostferry:<masked>@[...]" tag=ferry
[...]
INFO[0000] found binlog position, starting synchronization  file=[...] pos=[...] tag=binlog_streamer
[...]
DEBU[0000] loading tables from database                  database=abc tag=table_schema_cache
DEBU[0000] fetching table schema                         database=abc table=table1 tag=table_schema_cache
DEBU[0000] fetching table schema                         database=abc table=table2 tag=table_schema_cache
DEBU[0000] caching table schema                          database=abc table=table1 tag=table_schema_cache
INFO[0000] table schemas cached                          tables="[abc.table1]" tag=table_schema_cache
exiting due to dryrun

Note the last INFO line shows which tables will be moved as we cache their schemas in the memory. If there is a table you want to move and it does not show up there, it means the whitelist/blacklist configuration is incorrect.

(Mirrors Production) Starting Ghostferry Run

To start the ghostferry run, simply perform the same command as before except without the dryrun flag. You can also turn off the verbose flag, although it may be good practise to leave it on and redirect stdout to a file so the move can be audited at a later time. We will do this here for good practise:

$ ghostferry-copydb -verbose examplerun.json 2&>examplerun.log

To confirm that Ghostferry indeed copies changes to the source table, we can manually insert a row into abc.table1 during the run

# mysql --protocol=tcp -u root -P 29291
mysql> INSERT INTO abc.table1 (id, data) VALUES (351, "helloworld");

(Mirrors Production) Monitoring Ghostferry Run via Web UI

Once the run starts, a built-in webserver is started at port 8000 by default. This can be changed in the configuration json. Simply browse to http://localhost:8000 to view this server and in there you should find controls to:

  • Pause/Unpause: allows you to pause/unpause the data copy and binlog streaming process.
  • Allow automatic cutover: You should only press this button after you set the source database to read only. In its current implementation, it will simply allow ghostferry-copydb to finish all its processes in a correct manner, assuming that there are no more writes to the source database and all pending writes have been flushed to the binlog. In a future implementation, we may allow external scripts (configured via the json configuration) to be automatically executed with the push of this button so you can perform operations you need to perform during cutover.
  • Run Verification: This button is only available during the Wait-For-Cutover and Done phase of the move. It will run the ChecksumTable verifier we specified earlier ensure the data are identical on the source and target. You should only run this while the source is read only and when the target is not yet written to.

The page will refresh itself every 60 seconds.

For this tutorial, the run should be very short so thus you might miss most of the copying states. Take a look around and refresh a couple times to get familiar with the UI.

(Mirrors Production) Perform Cutover

In the default configuration, cutover is triggered manually. During cutover, you must stop writes to the data on the source database. For the purpose of this tutorial, we will set the source database to read only. Even though we have no applications writing to the source in this case, let’s do it anyway so we get into the habit of thinking of this step:

# mysql --protocol=tcp -u root -P 29291
mysql> FLUSH TABLES WITH READ LOCK; -- Ensure all writes are done
mysql> SET GLOBAL read_only = ON;   -- Sets the database to read only
mysql> FLUSH BINARY LOGS            -- Ensure all writes are record in binlog

The last step FLUSH BINARY LOGS is not necessarily required if you run your MySQL server with sync_binlog=1. If you’re running Ghostferry from a source that is a replica, you need to also turn on the option RunFerryFromReplica in the config json as well as other options. See https://godoc.org/github.com/Shopify/ghostferry/copydb#Config for more details.

We can then go back to the web ui and click the Allow Automatic Cutover button. In a second or two the ghostferry binlog streaming process should stop. Refresh the page until you see the state to be DONE.

(Mirrors Production) Verify Source and Target Data are Identical

At this point, the data on the source and target should be identical. To confirm this is the case, click the Run Verification button in the web ui to perform the verification in the background. Refresh the page a couple of times until it tells you the verification was successful.

Additionally, since we manually inserted a row earlier, we should be able to find it via:

# mysql --protocol=tcp -u root -P 29292
mysql> SELECT * FROM abc.table1 WHERE id = 351;

Finishing Ghostferry Run and Next Steps

At this point, the data on the source and target are verified identical and Ghostferry will no longer propagate data from 29291 to 29292. In a production situation, you can now notify all applications using the source database to use the target database.

The control server UI will stay up indefinitely. To stop it, simply press CTRL+C to interrupt the ghostferry-copydb process.

To run Ghostferry in production, you should read through Running ghostferry-copydb in production. If you need to interrupt and resume Ghostferry, you should also read through Interrupt and resuming ghostferry-copydb.