Blog
Technology

Multi-dimensional debugging of a weird MongoDB performance issue

Bipsync

July 25, 2018

We recently upgraded our test infrastructure backend to MongoDB 4. We were expecting to see a small but significant performance boost, and we got this… for most requests!

Strangely, and apparently at random, certain requests were taking as much as thirty seconds. Our average request times were all over the place:

Average response times after upgrading to MongoDB 4.

Our health check uptime drastically increased:

(pic)

Several people from our development and systems administration teams got stuck into finding the cause, each approaching the problem from a different angle, or axis.

Axis 0 – load balancer and TLS termination

We use end-to-end decryption. We wondered if the problem was related.

We bypassed the load balancer and configured a single node with a public IP address.

Observations

Observed no difference.

Conclusions and speculation

The problem wasn’t caused by TLS.

Axis 1: Checking an endpoint with Siege

We focused in on the problem by using Siege to hit the same application endpoint over and over. This showed a strange pattern:

Observations

Most requests are super fast
If a request it is not fast, it is slow by a factor of (N * 5) seconds, where N is a number between 1 and 6

Conclusions and speculation

Something somewhere may have a 5 second timeout that is occasionally being triggered.

Axis 2: Load Balancer Logs

Our testing installation was running in Amazon EC2. We loaded our load balancer logs into Amazon Athena and this enabled us to query them:

Response time patterns, courtesy of Amazon Athena.

Observations

Every other request is VERY slow

Conclusions and speculation

Perhaps Mongo’s doing something periodically that’s slowing things down.

Axis 3: MongoDB Monitoring

Having established that something was causing a delay, we looked into our database queries. Perhaps some queries are not so efficient in MongoDB 4? Mongo’s “Slow Query Log” came in handy here.

Observations

No queries were especially slow

Conclusions and speculation

It’s not specific queries causing the problem
MongoDB itself seems OK

Axis 4: MongoDB performance recommendation

On starting MongoDB, its shell recommends XFS and a couple of system settings for best performance:

MongoDB 4 — Mongo recommendations are displayed on start-up.

We didn’t think this would help all that much for this lightly used internal test installation (see this article from DZone), but we switched to XFS and implemented these settings. We re-ran our Seige tests and saw no significant improvement – certain requests were still incredibly slow.

Axis 5: Application Layer

We suspected some parts of the application could be the cause of the issue, so we checked various parts of the system such as session handling, process pool configuration, memory limit and timeout settings.

Observations

It exhibited the exact same problems

Conclusions and speculation

The application code was not at fault

Axis 6: Clean Room Installation

It occurred to us that perhaps the installation we had built had some unknown problems. We checked all the network routes, security rules, firewall configurations – everything we could think of. It occurred to us that maybe the issue was data related. So, for comparison, we then stood up a completely new cluster which contained no data at all, and ran Seige against it.

Observations

It exhibited the exact same problems

Conclusions and speculation

Our test installation is not specifically at fault.

Axis 7: Replica Set Configuration

We stopped our Mongo replica set and restarted the primary as a standalone instance.

Observations

The problem went away

Conclusions and speculation

Something to do with replication is causing the issue

Since replica sets are integral in our ability to guarantee redundancy in our system and 99.9999% uptime, we had to discover why they were introducing delays in our response times.

Axis 8: Replica Set Configuration: No Arbiter

We removed the arbiter out of the mongo cluster. Arbiters are members of a replica set which don’t function as databases – they’re only present to facilitate the replication process.

Observations

The problem went away

Conclusions and speculation

Something to do with the arbiter node is causing the issue

Axis 8: Convert arbiter to full database server

We converted the arbiter into a regular database node.

Observations

The problem recurred.

Conclusions and speculation

The problem occurs whether the node in question is an arbiter or a full database server.
Perhaps it’s network/security related?

Axis 9: Investigating the Firewall

We opened up a firewall rule so the arbiter could communicate with the application servers.

Observations

The problem went away.

Conclusions and speculation

Something in the application is talking to the arbiter.

Diagnosis

With this clue in hand, we found the following information in MongoDB’s specifications:

Second, Mongos 2.6 does not monitor arbiters, but drivers MUST do so. It costs little, and in rare cases an arbiter may be the client’s last hope to find the new replica set configuration.