Automated Data LabellingStarted in December 2020
There is a lot of data going around within large organisations, much of which is unstructured, think of emails, documents and multimedia. Properly labelling and classifying all this information is a daunting task, as it involves a high degree of manual effort.
This in practice means that data can be left unclassified, making it difficult to properly manage and protect it. Security policies include measures to protect data, but these measures are typically applied based on the classification of the data.
In order to properly protect unstructured data, it must therefore be classified. The classification can be automated based on its so-called 'data labels' that describe the characteristics and the kind of information the data contains. However, the 'labelling' of unstructured data is a very complex and time-consuming task on itself, making it near-impossible to properly label, and ultimately classify, the large amounts of unstructured data.
This project aims to find or develop a methodology to automatically label unstructured data with a high-level of detail and precision. The goal is thereby to come up with a flexible and scalable approach, to allow for the addition of new, or alteration of existing labels over time. (Semi)-supervised learning has shown great promise in complex textual processing and classification tasks and will be examined for its suitability in labelling the unstructured data sources.
Activities in Explore phase
Assumptions underlying this project include that machine learning solutions are not yet on the market, that the method reduces subjectivity, mistakes and workload, increases coverage of labeled documents, as well as improves discoverability and ultimately compliance. The project team held a variety meetings, talked to vendors, privacy and compliancy officers, and performed a desk research to validate these assumptions.
Conclusion at the end of the Explore phase
There are solutions available on the market, but none of these seemed to fully fulfil the needs of the financials set out in this project. The primary gaps lie in the level of accuracy, detail and flexibility that these tools have to offer; many provide standard solutions that may or may not satisfy the actual needs. There is a clear incentive to explore what more advanced techniques can bring to the table.
Project results of the Proof of Concept phase
The goal of the Proof of Concept (PoC) phase was to implement and test a variety of methods that can be used to automate the data labelling process. These methods ranged from often-used wordcount-based methods up to advanced neural networks that can take contextual information into account as well. The advanced methods showed benefits in more complex tasks (e.g. specific word or sentence classifications, especially context dependent), whereas for more basic label assignments (e.g. document classification) there was no notable difference.
The second activity in the PoC-phase was the continuation of the market exploration. The extended exploration further strengthened the view that a range of solutions are available but also that these come with their limitations. Further examination of the solutions and actual business needs is a logical next step to identify the true gaps for practical implementation and to more specifically determine what business value the more advanced labelling methods can bring.
Activities within the extended PoC phase
Instead of going directly to the pilot phase, it was determined more suitable to extend the PoC-phase. The goal of this extension will be three-fold:
- Further examine the market solutions in relation to the business needs. While we determined that advanced machine learning methods can perform more complex tasks, the question remains to what extend that is an actual problem to solve. A viable compromise could also be to work together with a market player or to extend an existing solution with specific functionality.
- Explore new techniques. A more efficient training procedure can be realized by utilizing methods such as semi-supervised learning or active learning, which is an interactive approach between the learning system and the human experts.
- Find suitable datasets and gain access to that data. Any operational implementation requires access to large amounts of data.
Results of the extended POC phase
An important result of the POC phase was that we found relevant operational datasets within the participating organizations. Also a white paper was published on this website. The paper describes the lessons learned from the PCSI in applying machine learning to labelling unstructured data on medical and fraud-related personal identifiable information (PII). We describe five value propositions for financial organizations in terms of accuracy, flexibility, complexity, resolution and explainability.
In the Pilot phase, we have started to annotate data at two different sites. And we are working together with data scientist and platform engineers from the partners’ organizations to check whether our models can run on the annotated files in the production environment of the financials.
Exploit phase and final result
Within this project, we've come up with a solution to protect unstructured data using data labelling. A machine learning model and pipeline are available upon request. More information can be found here or get in touch via: email@example.com.
This project is part of the trend