Automated Data LabellingStarted in December 2020
There is a lot of data going around within large organisations, much of which is unstructured, think of emails, documents and multimedia. Properly labelling and classifying all this information is a daunting task, as it involves a high degree of manual effort.
This in practice means that data can be left unclassified, making it difficult to properly manage and protect it. Security policies include measures to protect data, but these measures are typically applied based on the classification of the data.
In order to properly protect unstructured data, it must therefore be classified. The classification can be automated based on its so-called 'data labels' that describe the characteristics and the kind of information the data contains. However, the 'labelling' of unstructured data is a very complex and time-consuming task on itself, making it near-impossible to properly label, and ultimately classify, the large amounts of unstructured data.
This project aims to find or develop a methodology to automatically label unstructured data with a high-level of detail and precision. The goal is thereby to come up with a flexible and scalable approach, to allow for the addition of new, or alteration of existing labels over time. (Semi)-supervised learning has shown great promise in complex textual processing and classification tasks and will be examined for its suitability in labelling the unstructured data sources.
Activities in Explore phase
Assumptions underlying this project include that this methodology is not yet on the market, that it reduces subjectivity, mistakes and workload, increases coverage of labeled documents, as well as improves discoverability and ultimately compliance. The project team held a variety meetings, talked to vendors, privacy and compliancy officers, and performed a desk research to validate these assumptions.
Conclusion at the end of the Explore phase
There are solutions available on the market, but none of these seemed to fully fulfil the needs set out in this project. The primary gaps lie in the level of accuracy, detail and flexibility that these tools have to offer; many provide standard solutions that may or may not satisfy the actual needs. There is a clear incentive to explore what more advanced techniques can bring to the table.
Activities in Proof of Concept phase
The goal of the Proof of Concept (PoC) phase was to implement and test a variety of methods that can be used to automate the data labelling process. These methods ranged from often-used wordcount-based methods up to advanced neural networks that can take contextual information into account as well. The advanced methods showed benefits in more complex tasks (e.g. specific word or sentence classifications, especially context dependent), whereas for more basic label assignments (e.g. document classification) there was no notable difference.
The second activity in the PoC-phase was the continuation of the market exploration. The extended exploration further strengthened the view that a range of solutions are available but also that these come with their limitations. Further examination of the solutions and actual business needs is a logical next step to identify the true gaps for practical implementation and to more specifically determine what business value the more advanced labelling methods can bring.
Activities within the extended PoC phase
Instead of going directly to the pilot phase, it was determined more suitable to extend the PoC-phase. The goal of this extension will be three-fold:
- Further examine the market solutions in relation to the business needs. While we determined that advanced machine learning methods can perform more complex tasks, the question remains to what extend that is an actual problem to solve. A viable compromise could also be to work together with a market player or to extend an existing solution with specific functionality.
- There are other drivers than classification accuracy that may be an incentive to explore new techniques. This can be, for example, more efficiently learning or adapting to classify new types of labels. A more efficient training procedure can be realized by utilizing methods such as semi-supervised learning or active learning, which is an interactive approach between the learning system and the human experts.
- If (1) and (2) give reasons to continue towards the pilot phase, it will be essential to start preparing as soon as possible. Any operational implementation requires access to large amounts of data. Finding a suitable dataset and gaining access to that data are both typically time-consuming tasks.
This project is part of the trend
Stricter rules and enforcement on information sharing
The DNB ('toezichthouder') is paying more attention to privacy, as expressed in the GDPR, and other regulations. For example it becomes more important to demonstrate compliance with privacy regulation at every step taken in capturing or modifying data.