In recent years, we have onboarded clients with significantly larger research repositories than we had previously seen. This is in part because the organizations themselves were sizeable (being top-tier university endowments and asset managers), but also because they were bringing with them many years’ worth of data, sourced from a legacy RMS. A typical data migration involves hundreds of thousands of research documents, and files totaling over 1TB of size.
This level-step in data management required us to fine-tune Bipsync to scale for enterprise performance. Our aim was to keep the system responsive and functional through even the most challenging of its operations.
We achieved this in several ways.
Note: this post references key parts of our tech stack. If you’re not familiar with our system, you may wish to first read this post for an overview: Bipsync RMS Technology Overview.
Our first step in any performance improvement work is to ensure that we have appropriate metrics at hand, so that we can measure the impact of our changes against the original performance levels.
The metrics we use will differ depending on the area of the stack. To trace issues with database performance we’d typically reference MongoDB’s slow query log. On the server, we watch out for memory exhaustion errors in our server logs. In the browser, we have each browser’s developer tools to inspect CPU and memory usage. However this work called for more comprehensive metrics, and New Relic is invaluable in this regard.
New Relic enables us to profile the entire application stack in one go. It identifies areas where performance is slower than expected, but crucially it also specifies the exact location in the stack where the issue is caused — from the application interface through to the code running on the server, and interactions with the database. With this information we were able to pinpoint exactly where we needed to focus our efforts in order to reduce latency within the app.
The New Relic data indicated that our database design was ripe for improvement. With the less expansive datasets our database was performing adequately, but as the scale increased we found that certain queries made by the application had poor performance. This was usually caused by one of two things:
- Inappropriate or missing database indexes
- Less-than-optimal schema design
Our first step was to upgrade the version of MongoDB we were running. This unlocked several features which we were not previously able to take advantage of. One of the most important was the ability to use wildcard indexes.
Wildcard indexes are important because MongoDB limits the number of indexes that can be applied to a collection to a maximum of 64. Bipsync’s flexible schema, which is at the heart of its infinite configurability, is made up of dynamic fields — data points which are defined for each client, which can change over time. This means that we could easily have more than 64 fields which we frequently query on a given collection. In the past, this meant that if we hit the 64 index limit we had to be selective in which fields we indexed; queries involving non-indexed fields were much slower as a result.
With wildcard indexes, we were able to spread an index across multiple dynamic fields, meaning that in the majority of cases a single wildcard index is sufficient to index all the dynamic fields on a given collection. We could add additional indexes to cover specific use cases, such as when we need to support reverse sorting for some fields (like dates), but generally the number of indexes we had dropped considerably — and now never come close to hitting the limit. This made an incredible difference to our application speed because lots of database queries now executed in a fraction of the time.
An audit of our database schema’s design helped us realize that it was optimized for the way that data was modeled, but not how it was read — a technique called normalization. A document database like MongoDB often encourages redundancy in data because that’s the best way to get optimized queries; joining collections together to achieve a desired result set is possible but usually results in poorer performance than if that data could be read from a collection in one go.
We subsequently denormalized elements of our schema to compile pre-computed views of certain groups of data, which adds overhead when writing data but makes the reading of it much faster. This approach follows a pattern known as CQRS and we decided it was appropriate given that reads are much more frequent than writes in our system. We saw an immediate reduction in database query times across our application, leading to a more responsive user interface.
We also critiqued the way that we store documents. We originally stored files within our database, but by moving them to Amazon S3 instead we reduced database size significantly while affording ourselves the opportunity to build workflows around file processing in Amazon Web Services, further reducing load on the database.
We instead shifted the processing to the server, using new collections that were optimized for the task and are therefore much faster to query. The result was a component that ran nearly ten times faster in some cases, which made a huge difference to some users’ workflows.
In other cases we realized that we had better tools for performing some of our operations. For example, we display totals for the number of research documents within a given context (like an invested company or fund). Previously, we were determining this value by using a count operation on a MongoDB collection. That method is not always recommended because in MongoDB terms it is a relatively expensive operation. So instead we shifted to using ElasticSearch, the tool we use for full-text search, to count the number of documents in a context. This is a job that is perfectly suited to ElasticSearch and it was able to arrive at the same result in a fraction of the time. This not only sped up the retrieval of these numbers but also put less strain on the database, freeing it up to perform other operations instead which again improved the overall responsiveness of the system.
Bipsync’s infrastructure architecture is hosted in Amazon EC2 which means that it’s a very simple process to Increase the hardware profile of our servers — extra CPU, memory, disk space etc. Combined with integrated profiling and alerting to make us aware when the system’s resources are constrained, this means that we were easily able to dedicate additional resources to installations which required it.
Our infrastructure has also been designed to scale horizontally, which means that we are able to add additional server instances to the pool if necessary. This is a handy alternative to adjusting the instance profile, because it’s often more beneficial from both a cost and complexity perspective to make more instances available to handle the load that the system is under. We’re able to flex this up and down as necessary.
Hopefully this document has communicated just some of the techniques we’ve used to show how we scale Bipsync for enterprise performance. It’s a continual process because we’re always looking for ways to improve performance, but we’ve made great strides and the feedback we have received from clients is proof that the changes have had an overwhelmingly positive effect. Since making these changes we’ve onboarded many clients with similar data profiles, and we’ve been heartened to see users make the most of Bipsync without coming close to challenging its ability to operate quickly.