Skip to content

Search the site

How to build a National Data Library

How would you build this centrepiece of the UK's new "AI Action Plan" anyway, muses Jamie Hutton?

Image credit: https://unsplash.com/@vnwayne

One of the centrepieces of the government’s AI Opportunities Action Plan is the creation of a National Data Library (NDL), writes Jamie Hutton, CTO of Quantexa. It plans to identify five high-impact public datasets and provide a centralised and secure system for accessing them. The NDL will be available to public organisations, private companies and academic researchers. The use cases for which are as broad as healthcare research, policy development and building solutions to improve public services. 

The big questions are how do you build one, and what is being built? 

When is a library not a library? 

The answer is of course if it doesn’t contain any books. Or in this case data. Opinions and precedents differ on whether the NDL should be a centralised data repository or a decentralised, federated data platform that connects existing databases and facilitates data exchange, without storing data centrally.  

The UK Biobank is an example of the former approach. It contains genetic, health and lifestyle data from half a million participants that can be accessed globally by approved academic, commercial, charitable and government researchers. 

Lessons from building it include the importance of data collection standardisation to ensure data interoperability and that public trust in security, privacy and societal benefits are communicated. 

Estonia’s X-Road platform is an example of the latter. X-Road is an open-source distributed information exchange that collects data across public organisations, each with their own separate information systems. Estonians use X-Road when interacting with e-government services, such as health, tax, school and residency. Its government estimates it saves the public, civil servants and public sector workers 1,345 years of working time every year. 

Lessons from the platform reinforce the need for interoperability, as well as the importance in ensuring that data cannot be corrupted and that it’s secure. Crucially it is also user-centric and operates on the ‘Only Once Principle’. Estonia’s citizens don’t need to know it works, just that it does. And only provide their data once, which is automatically updated across all relevant systems. A feature like the NHS’s ambition to create ‘single patient records’. 

A final note is that while X-Road is estimated to facilitate over one billion data exchanges a year, Estonia has a population of 1.3 million people. The UK Biobank holds half a million records. The UK population is nearly 70 million and therefore a considerably larger undertaking. 

There are other examples and propositions of how an NDL could operate. But for the purposes of this piece, these two case studies help explain two key divergent approaches. For more on what the NDL could look like, the Wellcome Trust – no affiliation – held a technical whitepaper challenge containing excellent analysis on this topic.

To Build a Library, You Need Solid Foundations

If successful, the National Data Library will be the largest data unification ever undertaken globally. But regardless of what the library ends up containing, the foundations are the same: high quality, trusted data. 

The public sector has been collecting data for decades which sits across different departments. These range from healthcare (NHS) to pensions (Department for Work and Pensions), from company records (HMRC and Companies House) to educational records (Department for Education). Not to mention the dozens of agencies and public bodies within and around them. 

The NHS contains the largest healthcare dataset in the world. And therefore, it should be the centrepiece of the NDL’s collection. However, this single department is indicative of the complexity of master data management. 

While our health system is referred to as a singular – the ‘NHS’ – it is a collection of departments, commissioning and provider organisations, regions and systems. There is also the increasing need to coordinate care across local authorities and social care, particularly as our population is aging and creating pressure on hospitals. Patient data therefore is siloed across legacy systems, fragmented IT infrastructure often without common data formats and/or standards and is often incomplete, outdated and/or inaccurate. All of which regularly results in duplicate entries in different (siloed) repositories. The goal for the NHS is to create a single view of truth by data matching across these multiple data sources. 

The MDM Transformation Challenge

Traditional Master Data Management has an inherent data-quality problem. These models take an age to ingest source system feeds that are often beset by data quality issues. They also rely on data matching, which compares each data string and applies a score across it to create a record-to-record match. 

These probabilistic matching engines use algorithms that evaluate and score the matches. All of which is unideal for patient records that have several variations, because they might have multiple identifying attributes. On top of which, if records have a slight data quality problem, such as missing data or name variations (e.g. Bob vs Robert), data matching is unable to identify them as matching records. It’s an important and often overlooked reason why NHS IT reforms in health and social care have previously become unstuck.

A more powerful approach to managing patient and citizen data, which is essential for unifying data across siloes, is Entity Resolution (ER). ER uses a schema-agnostic model to save data engineering teams time and money from performing preliminary data conversions. It also leverages the full richness of data from all the records in the system to create the best possible view and to avoid the usual challenges associated with missing, inaccurate or incomplete data. 

Lessons from the NHS for the National Data Library

NHS England’s Federated Data Platform is an example of current efforts to eliminate data siloes. The FDP aggregates local data for use at hospital trusts, reducing the need for multiple logins, resulting in faster, more coordinated care at a regional level. 

In many ways the FDP is a version of the National Data Library writ small. Once data has been stitched together locally, it can then be used at a more macro level, providing a foundation for wider NHS transformation efforts. The Health Secretary now wants to create a ‘single patient record’ containing every patient’s health records, doctors’ notes and accessible through the NHS app. This would join up data across all regions, recognising that patient interactions with health and social care spans across regions and departments. 

Applying Context to Improve Outcomes

As mentioned, the government will identify five public datasets to comprise the National Data Library. To support maximum value from the NDL, there should be a clear process for validating and approving use cases. As all the data in the world is only as valuable as the outcomes it enables.  The government is working with the public and private sectors to build a library of problem statements and use cases through its AI Opportunities Action Plan.

One example is creating an NDL that allows the NHS to access DWP data to verify eligibility for benefits. Doing so will likely reduce instances of fraud so that money can be spent on people that really need it. This data will also help policymakers understand links between health outcomes, employment status and socioeconomic factors. This approach allows government departments to build context around our citizens providing important information when the public interact with different services. 

Interoperability allows these systems to communicate, ensuring data is readily available without the need for manual processing or duplication.

How to Build a National Data Library

The government’s ambition is as laudable as it ambitious. It waits to be seen whether and in what form a National Data Library will materialise. But as the saying goes, it’s as much about the journey as it is the destination.

Our public institutions need ambitious digital transformation projects. The journey to building a National Data Library with the principles outlined above should usher in a decade of reform that will improve outcomes for our citizens and encourage growth for the UK.  

Latest