We recently upgraded our test infrastructure backend to MongoDB 4. We were expecting to see a small but significant performance boost, and we got this… for most requests!
Strangely, and apparently at random, certain requests were taking as much as thirty seconds. Our average request times were all over the place:
Our health check uptime drastically increased:
(pic)
Several people from our development and systems administration teams got stuck into finding the cause, each approaching the problem from a different angle, or axis.
Axis 0 – load balancer and TLS termination
We use end-to-end decryption. We wondered if the problem was related.
We bypassed the load balancer and configured a single node with a public IP address.
Observations
- Observed no difference.
Conclusions and speculation
- The problem wasn’t caused by TLS.
Axis 1: Checking an endpoint with Siege
We focused in on the problem by using Siege to hit the same application endpoint over and over. This showed a strange pattern:
Observations
- Most requests are super fast
- If a request it is not fast, it is slow by a factor of (N * 5) seconds, where N is a number between 1 and 6
Conclusions and speculation
- Something somewhere may have a 5 second timeout that is occasionally being triggered.
Axis 2: Load Balancer Logs
Our testing installation was running in Amazon EC2. We loaded our load balancer logs into Amazon Athena and this enabled us to query them:
Observations
- Every other request is VERY slow
Conclusions and speculation
- Perhaps Mongo’s doing something periodically that’s slowing things down.
Axis 3: MongoDB Monitoring
Having established that something was causing a delay, we looked into our database queries. Perhaps some queries are not so efficient in MongoDB 4? Mongo’s “Slow Query Log” came in handy here.
Observations
- No queries were especially slow
Conclusions and speculation
- It’s not specific queries causing the problem
- MongoDB itself seems OK
Axis 4: MongoDB performance recommendation
On starting MongoDB, its shell recommends XFS and a couple of system settings for best performance:
We didn’t think this would help all that much for this lightly used internal test installation (see this article from DZone), but we switched to XFS and implemented these settings. We re-ran our Seige tests and saw no significant improvement – certain requests were still incredibly slow.
Axis 5: Application Layer
We suspected some parts of the application could be the cause of the issue, so we checked various parts of the system such as session handling, process pool configuration, memory limit and timeout settings.
Observations
- It exhibited the exact same problems
Conclusions and speculation
- The application code was not at fault
Axis 6: Clean Room Installation
It occurred to us that perhaps the installation we had built had some unknown problems. We checked all the network routes, security rules, firewall configurations – everything we could think of. It occurred to us that maybe the issue was data related. So, for comparison, we then stood up a completely new cluster which contained no data at all, and ran Seige against it.
Observations
- It exhibited the exact same problems
Conclusions and speculation
- Our test installation is not specifically at fault.
Axis 7: Replica Set Configuration
We stopped our Mongo replica set and restarted the primary as a standalone instance.
Observations
- The problem went away
Conclusions and speculation
- Something to do with replication is causing the issue
Since replica sets are integral in our ability to guarantee redundancy in our system and 99.9999% uptime, we had to discover why they were introducing delays in our response times.
Axis 8: Replica Set Configuration: No Arbiter
We removed the arbiter out of the mongo cluster. Arbiters are members of a replica set which don’t function as databases – they’re only present to facilitate the replication process.
Observations
- The problem went away
Conclusions and speculation
- Something to do with the arbiter node is causing the issue
Axis 8: Convert arbiter to full database server
We converted the arbiter into a regular database node.
Observations
- The problem recurred.
Conclusions and speculation
- The problem occurs whether the node in question is an arbiter or a full database server.
- Perhaps it’s network/security related?
Axis 9: Investigating the Firewall
We opened up a firewall rule so the arbiter could communicate with the application servers.
Observations
- The problem went away.
Conclusions and speculation
- Something in the application is talking to the arbiter.
Diagnosis
With this clue in hand, we found the following information in MongoDB’s specifications:
Second, Mongos 2.6 does not monitor arbiters, but drivers MUST do so. It costs little, and in rare cases an arbiter may be the client’s last hope to find the new replica set configuration.