Data classification and labellingStarted in December 2020
There is a lot of data going around within large organisations, much of which is unstructured, such as emails, documents and multimedia. This data is often unclassified but may contain confidential information. Security policy always includes measures to protect data, but these measures are applied on the basis of the classification of the data.
In order to protect unstructured data, it must therefore be classified and this classification is done by means of 'data labels' that indicate what kind of information the data contains. However, the 'labelling' of unstructured data is very complex and time-consuming, and is therefore often not done, with the result that all kinds of data are not properly secured.
This project aims to develop a methodology that uses '(semi-)supervised learning' to automatically label unstructured data. The goal is thereby to come up with a flexible and scalable approach, to allow for the addition of new labels over time.
Activities in Explore phase: The assumptions underlying this idea are that this methodology is not yet on the market, reduces subjectivity, mistakes and workload, increases coverage of labeled documents, as well as improves discoverability and compliance. The project team held several meetings, talked to vendors, privacy and compliancy officers, and performed a desk research to validate these assumptions.
Conclusion at the end of the Explore phase: We found the assumptions behind the idea to be valid. At this moment, there is no viable product on the market that offers flexible automated data labelling. Solutions are available, but they often provide a standard solution that might not fully fit, it is difficult to assess the level of trust, and manual 'tweaking' is labor intensive.
In the current Proof of Concept phase, (semi-)supervised learning will be used for the training of adaptive sensitive data detection classifiers. Our goal is to use this method to train a model that can assign the right labels to unstructured (textual) data while vastly reducing any required manual effort.
This project is part of the trend
Stricter rules and enforcement on information sharing
The DNB ('toezichthouder') is paying more attention to privacy, as expressed in the GDPR, and other regulations. For example it becomes more important to demonstrate compliance with privacy regulation at every step taken in capturing or modifying data.