Skip to main content

We recently upgraded our test infrastructure backend to MongoDB 4. We were expecting to see a small but significant performance boost, and we got this… for most requests!

Strangely, and apparently at random, certain requests were taking as much as thirty seconds. Our average request times were all over the place:

Average response times after upgrading to MongoDB 4.

Our health check uptime drastically increased:

(pic)

Several people from our development and systems administration teams got stuck into finding the cause, each approaching the problem from a different angle, or axis.

Axis 0 – load balancer and TLS termination

We use end-to-end decryption. We wondered if the problem was related.

We bypassed the load balancer and configured a single node with a public IP address.

Observations

  • Observed no difference.

Conclusions and speculation

  • The problem wasn’t caused by TLS.

Axis 1: Checking an endpoint with Siege

We focused in on the problem by using Siege to hit the same application endpoint over and over. This showed a strange pattern:

The odd response times are in bold.

Observations

  • Most requests are super fast
  • If a request it is not fast, it is slow by a factor of (N * 5) seconds, where N is a number between 1 and 6

Conclusions and speculation

  • Something somewhere may have a 5 second timeout that is occasionally being triggered.

Axis 2: Load Balancer Logs

Our testing installation was running in Amazon EC2. We loaded our load balancer logs into Amazon Athena and this enabled us to query them:

Response time patterns, courtesy of Amazon Athena.

Observations

  • Every other request is VERY slow

Conclusions and speculation

  • Perhaps Mongo’s doing something periodically that’s slowing things down.

Axis 3: MongoDB Monitoring

Having established that something was causing a delay, we looked into our database queries. Perhaps some queries are not so efficient in MongoDB 4? Mongo’s “Slow Query Log” came in handy here.

Observations

  • No queries were especially slow

Conclusions and speculation

  • It’s not specific queries causing the problem
  • MongoDB itself seems OK

Axis 4: MongoDB performance recommendation

On starting MongoDB, its shell recommends XFS and a couple of system settings for best performance:

Mongo recommendations are displayed on start-up.

We didn’t think this would help all that much for this lightly used internal test installation (see this article from DZone), but we switched to XFS and implemented these settings. We re-ran our Seige tests and saw no significant improvement – certain requests were still incredibly slow.

Axis 5: Application Layer

We suspected some parts of the application could be the cause of the issue, so we checked various parts of the system such as session handling, process pool configuration, memory limit and timeout settings.

Observations

  • It exhibited the exact same problems

Conclusions and speculation

  • The application code was not at fault

Axis 6: Clean Room Installation

It occurred to us that perhaps the installation we had built had some unknown problems. We checked all the network routes, security rules, firewall configurations – everything we could think of. It occurred to us that maybe the issue was data related. So, for comparison, we then stood up a completely new cluster which contained no data at all, and ran Seige against it.

Observations

  • It exhibited the exact same problems

Conclusions and speculation

  • Our test installation is not specifically at fault.

Axis 7: Replica Set Configuration

We stopped our Mongo replica set and restarted the primary as a standalone instance.

Observations

  • The problem went away

Conclusions and speculation

  • Something to do with replication is causing the issue

Since replica sets are integral in our ability to guarantee redundancy in our system and 99.9999% uptime, we had to discover why they were introducing delays in our response times.

Axis 8: Replica Set Configuration: No Arbiter

We removed the arbiter out of the mongo cluster. Arbiters are members of a replica set which don’t function as databases – they’re only present to facilitate the replication process.

Observations

  • The problem went away

Conclusions and speculation

  • Something to do with the arbiter node is causing the issue

Axis 8: Convert arbiter to full database server

We converted the arbiter into a regular database node.

Observations

  • The problem recurred.

Conclusions and speculation

  • The problem occurs whether the node in question is an arbiter or a full database server.
  • Perhaps it’s network/security related?

Axis 9: Investigating the Firewall

We opened up a firewall rule so the arbiter could communicate with the application servers.

Observations

  • The problem went away.

Conclusions and speculation

  • Something in the application is talking to the arbiter.

Diagnosis

With this clue in hand, we found the following information in MongoDB’s specifications:

Second, Mongos 2.6 does not monitor arbiters, but drivers MUST do so. It costs little, and in rare cases an arbiter may be the client’s last hope to find the new replica set configuration.