Applying the correct troubleshooting steps to your problems is crucial:
a) Understand what resources have been saturated.
b) Understand what if anything is causing an error.
c) From there you can divert into the areas that are related to that resource and start to narrow down the issue.
d) Tackle the problems bit by bit.
Having the right tools for the job key for success. PMM is a great example of a tool that can help you quickly identify, drill in, and fix bottlenecks.
Have realistic load tests. In this case, they had tested the new release on a concurrency level that was not like their production
By identifying the culprit query we were able to:
a.) Drop average query time from 20s to 661ms
b.) Increase QPS by 2x
c.) Reduce the usage of CPU to 1/4 of its level prior to our intervention