Bipsync has always used MongoDB as its primary data store. It’s an essential component in the technology stack because it ultimately facilitates Bipsync’s renowned configurability and flexibility. MongoDB’s popularity is increasing but it’s not as prevalent as other database solutions, and we’re sometimes asked why we chose it as our database.
The main reason is that MongoDB is a schema-less database. MongoDB’s data is stored in binary JSON (BSON) format, as opposed to the tabular model that is traditionally used by SQL databases like Microsoft’s SQL Server or PostgreSQL. This means that the data schema for MongoDB databases is flexible and dynamic, because each document within the database exists separately to the others and can have a different schema applied to it. The responsibility rests with the application, rather than the database, to define what format the data should take.
No Schema Migrations
This is especially the case with regards to database migrations, which are processes that need to be carried out when the design of a database changes. With an SQL database, these involve running SQL queries to manipulate the structure of the database, and this can often be a complex process that requires system downtime. This isn’t the case for MongoDB because the application defines the schema, so as our database architecture changes, we need only update our code, and in some cases the data itself, to reflect that.
It’s therefore much more straightforward for us to make schema changes and so we’re less hesitant to make modifications than we would be if we were using an SQL database. This means that we are able to develop and ship features very quickly. Crucially, it’s also the secret behind a key Bipsync feature: dynamic content types and fields.
Dynamic Data Definitions
Each Bipsync client is free to model their data in different ways, and MongoDB’s dynamic schema makes it possible for us to accommodate all of them. Each data point that the client wishes to store is defined as a document in the database. The definition document specifies the data type, its name, how it should be formatted when displayed to a user, and so on. Clients can model as many different types of content as they like, each with its own set of field definitions.
Because content type and field definitions are defined as data, they are confined to the client’s database and so don’t alter Bipsync’s overall database design. This is really important because it means that we don’t have lots of variations of schema across our production databases, which would greatly complicate our upgrade process and slow down development.
There are some easily deployable stock definitions which we use to configure an installation from a blueprint — such as a “hedge fund” model, an “endowment” model, or an “asset manager”. We can augment these definitions to match each client’s particular workflow or data model. Sometimes we start from scratch and completely customize an installation as per a client’s specific requirements- either approach has its merit, and we take advantage of both.
Since the creation of a new field definition document does not involve schema manipulation, they take effect instantly. Definitions can be added or removed within a matter of minutes. Installations can be moulded to fit client requirements in real-time, which dramatically reduces how long it takes to set up an installation for a client and onboard them on to it.
MongoDB has proved a perfect choice for storing data and making it generally available but we complement it with other solutions, for example for to provide full text search functionality, or to host files. While MongoDB does in theory support both these requirements, we’ve decided that other options are more optimal to our application’s needs.
Full-text search is a technique which facilitates the kind of queries we’ve come to expect from modern applications — it allows users to find documents or pieces of information based on the content of the text, rather than just searching by specific keywords or phrases. MongoDB offers full-text search but only within Atlas, its hosted database solution. We don’t use Atlas but even if we did, we’d probably still be inclined to use Elasticsearch which has powered our search functionality for a long time. It has a stronger feature set and continues to move forward technically, recently adding the ability to leverage machine learning models which seem destined to become a cornerstone of document identification in the future.
File storage is another consideration. While MongoDB offers GridFS as a way to store files within the database, that approach doesn’t scale to the volume of files that Bipsync needs to accommodate, which often involves terabytes of data. Several years ago we switched to Amazon’s S3 file storage solution. It’s not only a better tool for the job; the ease at which it can be integrated with other AWS services like Textract, Lambda and Comprehend means that we can operate on the files to mine additional information or trigger related operations within our system.
Replication for Reporting and Analytics
A recent addition to our data storage architecture can be seen in the diagram above: Amazon RDS. Our recently announced Reporting and Analytics Platform (RAP) uses a MySQL database within AWS Aurora to offer read-only access to Bipsync data either either through our dedicated Report Builder application or via a third-party tool such as Tableau, Microsoft Excel or Microsoft Power BI.
Replicating this data to a separate database gives us the opportunity to reshape the schema to something more suitable for the tasks that will be performed with it. For example, our MongoDB database schema has been designed very much with the Bipsync apps in mind. The collections, properties and indexes have been chosen such that they support our features and the platform operates as quickly as possible.
For reporting and analytics, we have different considerations. Firstly, there are thousands of permutations of queries which a client may want to run to arrive at the results they’re interested in. This means that we can’t be as prescriptive with the MySQL schema as we are with MongoDB, because we can’t predict how the data will be accessed. Our solution is to denormalize the schema as much as possible, so that the data is represented in as many formats as we think will be beneficial to querying users. This offers maximum flexibility.
Secondly, we have to consider the potential impact on the system if as this database is queried. Since these queries could be expensive to execute, or even executed in parallel with a number of other queries, there is a likelihood of high load being placed on the system. By offloading this load to Amazon RDS we ensure that this does not impact the performance of the Bipsync RMS for users in general, because our MongoDB database continues to operate independently.
Bipsync wouldn’t be Bipsync without MongoDB. It’s an innovative product which continues to offer useful features, such as support for time series data, and so we’re really happy with our choice of database solution. Where necessary we complement it with other solutions to expand the technological options available to us, and the result is a platform which is flexible, scalable, and positioned to increase in capability over the next few years.