Blog
Technology

Implementing data sync between a web API and a Core Data store

Craig Marvelley

April 29, 2015

In the last few months we implemented and refined a key feature of the Bipsync Notes iOS app: data synchronisation. The implementation of a sync solution has become a rite of passage for today’s developer since users increasingly expect that their data will be accessible and editable across all their computing devices. The task has the following requirements, which at first glance seem quite straightforward:

All content created and modified on a device is sent to a web API.
A device notifies the API of any deleted content.
The API makes all created and modified content available to all devices.
The API notifies all devices about content that has been deleted and should be purged.

However a number of edge cases contrived to make this feature difficult to get right.

In the context of our app the term “content” actually refers to a number of things – plain text, HTML, JSON objects, files, and so on. It’s essential that the user is able to access the latest version of each content item effortlessly, and most importantly, that data is never lost.

Before beginning this task we read as much as we could on the topic, notably this issue of objc.io and the diary Brent Simmons kept while adding sync support to Vesper, itself a note taking app. Both of these sources gave us food for thought; the latter was also the inspiration for this very series. If you’re exploring the topic yourself, they are highly recommended reading.

We took a slightly different approach to those examples, which I’ll detail here. Let’s start with an overview of the various elements involved, then go into more depth on the client side since that’s where the majority of the complexity lies.

Overview

The image above illustrates how data moves between parts of the system (for more information into how our APIs are structured check out this post). At the centre of it all is the web API. The API code is mostly domain specific so I won’t dwell on it, but imagine a sorta-RESTful affair that is able to return items of content both singly and in lists, as well as the usual create/update/delete endpoints one would expect. There are a few other endpoints for user authentication, app bootstrapping, searching, and so on. As a rule we’ve tried to keep the API as simple as possible as we’re intent on the app being equally functional in an offline situation as in an online one, so the device needs to be as capable as the server.

Sending content to and from a device is largely an implementation detail. On the server, data is persisted in MongoDB and delivered to clients via HTTP responses crafted by PHP. On the device, changes are persisted in Core Data and periodically sent to the server via HTTP requests. We use the AFNetworking library to create and dispatch HTTP operations. Each time a note is created, updated or deleted Core Data posts a notification. Our code transforms the notification data into an atomic operation which is added to an internal queue. As each operation is processed, HTTP requests are sent to the server so the corresponding entities can be created, updated or deleted according to the operation type.

The logic that manages the creation and execution of these operations is deserving of a dedicated post as it’s interesting in its own right. With the ability to send data back and forth between client and server the groundwork has been established. The tricky part involves determining the extent of what should be sent to the client.

Making it smart

After a period of substantial use a typical Bipsync user will have hundreds, if not thousands, of notes. Factoring in all related metadata (labels, contacts, and so on) means that users’ note data on disk will likely take up several megabytes. Add in attachments, embedded images, clipped webpages and so on and it quickly becomes clear that we have to be judicious as to what we send to the app, and when. We’re mindful of users’ data allowances, app load times, and any other side effect caused by us sending too much data to the device.

Our first step was to store a revision number against records that we send to the app. The number is incremented each time the record is modified. This allows us to identify records that the client does not yet have, and only send those to the device. Initially we’ve only added revision fields to records within collections we expect to grow – some collections will never number more than a handful of records, so it’s arguable whether the added complexity is worth the reduction in bandwidth. For early versions of the app we decided it wasn’t.

You might be thinking that as collection sizes grow we’ll begin to send a substantial amount of data with each request, which will increase the time they’ll take to execute. This is true. If and when this becomes an issue we’d investigate other solutions – perhaps by utilising timestamps (though due to the nature of our data model this would likely be too simplistic an approach), or by replaying an audit log constructed through event sourcing to arrive at eventual consistency. In the meantime, in the spirit of agile development we elected to ship with the simplest approach that worked.

One from many

A sync event is actually made up of a series of discrete operations which when executed in the correct order bring the data store on the device to parity with that on the server. Currently the sequence looks like this (fetch means to bring data from the server to the device, push goes the other way):

Fetch all records which are used to contextualise notes (labels, contacts, etc.).
Push all new notes which have been created on the device.
Push all notes which already exist on the server but have been modified on the device.
Push “tombstone” records for any notes that exist on the server but have been deleted on the device.
Fetch any new or modified notes
Fetch any “tombstone” records for notes deleted on the server

Most of these operations are quite straightforward. We got the idea to use tombstone records from the objc.io article I referenced earlier; since there’s no use in sending the actual content of deleted notes to the device, we instead send a list of note identifiers (the so-called “tombstones”) which mark a record as having been deleted, and purge the corresponding Core Data entities from the local store.

This approach has been working well, though we did make a small modification post-launch to improve the usability of the application when offline. Initially we aped the way mail clients download a list of “headers” of new email items and queue the retrieval of their body content for a later time. Should the user open an email which the app has yet to download, it is fetched on demand. We did a similar thing, downloading the title and metadata of each note immediately but leaving the rich text of the body – which is relatively large – until later. This was functional but didn’t really suit the way our users tend to work, being as they are often without connectivity (e.g. on an airplane) but always wanting access to all of their content. Now we download everything up front which results in longer sync times but a more predictable application, so when a user has synchronised they can feel assured they have all their content to hand.

You may have noticed that the list of sync operations is broadly ordered by the following pattern: fetch non-note content; push note content; pull note content. This is according to the logic that firstly it’s always safe to pull contextual content since it can’t yet be modified on the device (we use it for browsing / association only). We then send all data that the user has created on the device – new notes, modified notes, deleted notes – as we want to ensure that it’s not overwritten or replaced by any content that may come back from the server (more on that in a bit). Once we’re sure all device-bound data has been sent back to the server, we then pull down any new/modified/deleted notes and update our local store as appropriate.

By prioritising device-bound content over server-bound content we pass responsibility for determining what happens in the event of a conflict – a situation whereby the same record has been modified or deleted on both the device and the server – to the server, where the logic can be centralised. Currently we store the changes from both sources in version history and let the last change ‘win’; version history is accessible to the user so they are able to revert this if they wish. Another option would have been to implement a ‘forking’ approach, used by Apple’s Notes app and Dropbox among others, which would result in two documents being created at the point of divergence – each containing the changes from a single source. We haven’t ruled out implementing such an approach further down the line.

Conclusion

At this point we have a solid synchronisation engine that delivers the requirements we earlier identified. We’ll doubtlessly refine it as the app continues to grow, but right now it’s exciting to see the same data appear across all the applications we offer.