Monday, September 26, 2016

Mysql shutdown, dirty pages, and innodb_page_cleaners in 5.7

In a busy Mysql environment, often the shutdown is delayed for a long time by the innodb page cleaner.  On some very busy systems this delay can be hours and you will get tired of watching this message in your error log:
                   [Note] InnoDB: Waiting for page_cleaner to finish flushing of buffer pool

Prior to 5.7 all you could do was watch the dirty pages and see if you could force the number down by setting the sever to flush pages thusly:
                   set global innodb_max_dirty_pages_pct = 0;

You can then watch the dirty page count drop until it reaches a low point (likely never near zero in a busy system) where you can shut it down by monitoring the dirty pages thusly:
                     mysqladmin ext -i10 |grep pages_dirty

But in 5.7 you can now have multiple page cleaners (the above trick still works) to help get the server in a state for a quick shutdown by setting
                     innodb_page_cleaners=n

Like all good things there is no free lunch, so if you kill the server rather than waiting for the page cleaner to do it's job, you will have to pay on the startup.

Enjoy

Sunday, September 25, 2016

Huge databases and Percona Xtradb Cluster

In my latest endeavor  I have been working with Percona's Xtradb Cluster product.   All things considered I like it better than native Oracle MySQL clustering.  It's easy to set up and maintain and I've found it very resilient to the noisy failures we all are accustomed to.

But one of our challenges has been dealing with slow disks (the dreaded SAN), mediocre networks, and a huge database (~4T).  Moving this database around, getting backups, building the cluster, recovering the cluster have all presented issues.  Here are some observations which might help you but remember ymmv.

Avoid SST at all costs
If you lose a node in the cluster and you run out of time to do an incremental rebuild your node will be completely rebuilt.  (As I found out on one unfortunate incident if you mess up your my.cnf file this too can trigger an SST.)   So if your databases are in the G size, it's probably no big deal.  But in my current world it takes about 2 days to build a 4T node (don't ask).   Basically we are doing anything we can do to avoid an SST.  Here are two things to do.  One,  make your gcache as big as you can afford so you can do an incremental build.  Two, be very very careful when you stop the node.  Check your grastate.dat file for some valid non negative value and save it somewhere in case you make a mistake (so you can restore it).

Xtrabackup is your savior
If you have to do an SST then your only real option for a huge db is xtrabackup.  We use xtrabackup_v2 in the my.cnf file.  At least it does not lock one of the nodes.  I've also been using this to build fresh nodes or reconstruct a broken node.  To rebuild a node manually you can do this and maybe it's faster than letting SST do it for you.

We also tried using mydumper/myloader for building clusters with the big advantage that it shrinks our innodb data file but it's just too slow for this database; we're stuck with xtrabackup for the sheer speed.

Flow control will kill you
We are finding in our three node world that flow control due to heavy updates on one node can kill you.  Obviously you can work with your developers to break up those 1M row updates. (Changing the wsrep options to limit rows or data does not seem to work well in our environment.)    But sometimes you have nasty DDL to run.  pt-online-schema-change works well (would you believe we ran a pt-osc that took 4 days and 18 hours.  Don't ask.)  Be sure you get the latest version of pt-osc and turn on --max-flow-control so you don't slam your other nodes; for the above mentioned 4+ day pt-osc we used --max-flow-control=0 so that other dml was not affected by our massive DDL.  Thank goodness for pt-osc!  

Asynchronous replication is not recommended
We have three systems right now as we move to PXC.  We have two three node clusters and an existing master-slave environment.  To try and keep these all in sync (very similar to herding cats) we run asynchronous replication in various configurations.  We've tried master -> cluster 1 -> cluster 2 as well as master -> cluster 1 and master -> cluster 2.  Both have their challenges.  But key is to make sure you only have one that has read/write and the others have read_only=1 set.  For auto-increment columns you need to disable the automatic feature of PXC and do the whole thing manually.  I do not recommend this approach and once we finally get the old master/slave system out I am going to try and combine these two clusters into one across the WAN.

I'll update this as I get more experience but thought I would share some preliminary ideas.


Huge databases and Percona Xtradb Cluster

In my latest endeavor  I have been working with Percona's Xtradb Cluster product.   All things considered I like it better than native Oracle MySQL clustering.  It's easy to set up and maintain and I've found it very resilient to the noisy failures we all are accustomed to.

But one of our challenges has been dealing with slow disks (the dreaded SAN), mediocre networks, and a huge database (~4T).  Moving this database around, getting backups, building the cluster, recovering the cluster have all presented issues.  Here are some observations which might help you but remember ymmv.

Avoid SST at all costs
If you lose a node in the cluster and you run out of time to do an incremental rebuild your node will be completely rebuilt.  (As I found out on one unfortunate incident if you mess up your my.cnf file this too can trigger an SST.)   So if your databases are in the G size, it's probably no big deal.  But in my current world it takes about 2 days to build a 4T node (don't ask).   Basically we are doing anything we can do to avoid an SST.  Here are two things to do.  One,  make your gcache as big as you can afford so you can do an incremental build.  Two, be very very careful when you stop the node.  Check your grastate.dat file for some valid non negative value and save it somewhere in case you make a mistake (so you can restore it).

Xtrabackup is your savior
If you have to do an SST then your only real option for a huge db is xtrabackup.  We use xtrabackup_v2 in the my.cnf file.  At least it does not lock one of the nodes.  I've also been using this to build fresh nodes or reconstruct a broken node.  To rebuild a node manually you can do this and maybe it's faster than letting SST do it for you.

We also tried using mydumper/myloader for building clusters with the big advantage that it shrinks our innodb data file but it's just too slow for this database; we're stuck with xtrabackup for the sheer speed.

Flow control will kill you
We are finding in our three node world that flow control due to heavy updates on one node can kill you.  Obviously you can work with your developers to break up those 1M row updates. (Changing the wsrep options to limit rows or data does not seem to work well in our environment.)    But sometimes you have nasty DDL to run.  pt-online-schema-change works well (would you believe we ran a pt-osc that took 4 days and 18 hours.  Don't ask.)  Be sure you get the latest version of pt-osc and turn on --max-flow-control so you don't slam your other nodes; for the above mentioned 4+ day pt-osc we used --max-flow-control=0 so that other dml was not affected by our massive DDL.  Thank goodness for pt-osc!  

Asynchronous replication is not recommended
We have three systems right now as we move to PXC.  We have two three node clusters and an existing master-slave environment.  To try and keep these all in sync (very similar to herding cats) we run asynchronous replication in various configurations.  We've tried master -> cluster 1 -> cluster 2 as well as master -> cluster 1 and master -> cluster 2.  Both have their challenges.  But key is to make sure you only have one that has read/write and the others have read_only=1 set.  For auto-increment columns you need to disable the automatic feature of PXC and do the whole thing manually.  I do not recommend this approach and once we finally get the old master/slave system out I am going to try and combine these two clusters into one across the WAN.

I'll update this as I get more experience but thought I would share some preliminary ideas.


the three dba walk into a bar joke....

You've all heard it.

Three DBAs walk into a bar.  They leave because they can't find a table.

Ha ha.  Very good, but what about all the excitement today about NoSQL.  Sure no table is fine but what exactly are we using to store our data?

My experience so far with NoSQL components is limited but from what I see they do not offer the mathematical rigor of a SQL database as envisoned by Date and Codd.  But I do see that for some quick key/value look ups they are great.

the three dba walk into a bar joke....

You've all heard it.

Three DBAs walk into a bar.  They leave because they can't find a table.

Ha ha.  Very good, but what about all the excitement today about NoSQL.  Sure no table is fine but what exactly are we using to store our data?

My experience so far with NoSQL components is limited but from what I see they do not offer the mathematical rigor of a SQL database as envisoned by Date and Codd.  But I do see that for some quick key/value look ups they are great.

the three dba walk into a bar joke....

You've all heard it.

Three DBAs walk into a bar.  They leave because they can't find a table.

Ha ha.  Very good, but what about all the excitement today about NoSQL.  Sure no table is fine but what exactly are we using to store our data?

My experience so far with NoSQL components is limited but from what I see they do not offer the mathematical rigor of a SQL database as envisoned by Date and Codd.  But I do see that for some quick key/value look ups they are great.