Protecting unstructured data – challenges and opportunities of automated labelling


Lessons learned from financial organisations

Data-labelling1.jpg

Labelling makes it much easier to manage, organize and protect data. However, the data-labelling process is time-consuming and error-prone. Automating this process can add considerable value and there are many widely available tools on the market for this purpose. Challenges remain, however, which may prevent their future large-scale use in financial organisations.

This paper describes the lessons learned from the PCSI in applying machine learning to labelling unstructured data on medical and fraud-related personal identifiable information (PII). We describe five value propositions for financial organisations in terms of accuracy, flexibility, complexity, resolution and explainability. We invite vendors and other interested parties to join us in discussing how to put these propositions for automated data labelling into practice.  

Data labelling: what is it and why do we value it? 

‘Data is the new gold’ is a popular mantra applicable to many industries. It applies in the sense that there is substantial value to be gained, but also in the sense that it is not always straightforward to extract that value. Organisations have vast amounts of data available and are creating and collecting more at an increasing rate. Much of this data is unstructured – audio files, video and e-mails, as well as most text files. The unstructured nature of the data makes proper management, organisation and protection a challenge, as it is unclear to both the computer and its users what the data can and may be used for. 

Labels can be obtained by asking end users to make judgements about a given piece of unstructured data, but labelling data in this manner is error-prone and labour-intensive, requiring end users to invest valuable time. Consider an organisation of 10,000 employees. Each employee creates or edits on average about five files containing unstructured data. Considering the correct labels and selecting those labels will take at least 10 seconds per file. This means your employees together spend almost 700 hours of their working week on data labelling. And the labelling is often not correct or complete. People frequently make mistakes when manually labelling data, even when they are careful. 
 
This is where automated data labelling aims to offer a solution. Data labelling can be interpreted as providing data with attributes or metadata, such as ‘bank statement’ or ‘contains personal identifiable information’, to describe it. Data can be identified with the help of these labels, retrieved when necessary, and treated appropriately with regard to protection. Data can be labelled at the file or document level, but it is also possible to assign labels to sections within a file or even tag certain words. Two of the key benefits of data labelling are: (1) traceability: files can be stored and found according to the labels given, facilitating data management; and (2) data classifications: data labels provide objective support for data classifications. For example, different levels of security classifications are required according to governance regulations, such as the General Data Protection Regulation (GDPR).

Data labelling versus data classification

Data labelling and data classification are often used interchangeably to denote the assignment of attributes to data. To distinguish between the two terms, the following definitions have been adopted in this article: 

  • Data labelling: the attribute assigned by a label aims to provide an objective description of the data. For example, the file contains personal identifiable information (PII) or medical information. There should be little or no disagreement between people about whether or not to assign a particular label, and the label is not impacted by changes in regulations. 
  • Data classification: the attribute assigned by a classification is a more subjective description of the data, such as the level of security that it requires. While classifications are typically used for enforcement purposes (e.g., confidential information may not be shared through e-mail), their assignment is more open to interpretation and can change over time. Information may become less sensitive over time and two users may not assess the value equally.

Given the definitions above, it makes sense to start with the data labelling process and use that information to help determine the appropriate data classification. Labels can thus be used to discover data and to assign appropriate classifications, while those classifications can in turn be used to determine the measures that may need to be taken, such as an appropriate level of security (the flow is depicted in Figure 1). This is in contrast to many current practices, where data is immediately labelled with a security classification (e.g., public, secret). Labelling data with descriptive metadata instead, and using those labels to derive the classification, make it much easier to adapt to changing requirements (e.g., changes in regulations or classification schemes). 

Figure 1: Objective data labels can be used to determine the appropriate data classification dynamically, which in turn determines what measures should be taken for protection and sharing. 

What makes data labelling difficult?

While desirable, automated data labelling is far from straightforward, and several challenges need to be addressed:

First, unstructured data can vary greatly in content, format and even language. There are many different file types, such as e-mail, video, reports, and voice notes, which can also come in different formats, such as PDF, DOCX, WMV, and MP4. An automated data labelling solution should be as independent as possible of the file type. However, as far as we know, there is currently no tool that can label all types of data, nor will there be one in the foreseeable future [[2]]. 

Second, labelling data based on content faces the challenge of subjectivity. There are many scenarios in which people will easily agree on a specific and correct label, such as assigning the ‘contains PII’ label to a passport copy. But in other cases, this might not be as straightforward. Should a 50-page document containing the name and contact details of the author also be labelled as ‘contains PII’ or does that misrepresent the actual content? 

Third, automated evaluation of data and assignment of labels is difficult and will not always be perfect. What is considered ‘good’ performance depends on the context, as some labels are easier to identify than others. The need for automation is huge, as reviewing different kinds of objective labels would take up too much of employees’ valuable time. Furthermore, there is a distinction in automation between precision and recall (Figure 2), for which organisations have to make a trade-off that may even differ between types of labels. This makes it difficult to classify, compare and judge performance throughout.

Figure 2: Visual explanation of the evaluation metrics precision and recall. Each label requires a context-dependent decision on how to balance these in the labelling process.

Finally, designing or training a model to classify new labels is a complex and resource-intensive task. It relies either on experts designing the required rulesets (‘if this, then that’) or on large amounts of training data to learn from. The latter is often further complicated by the fact that the training data has to be labelled manually in order for the system to be able to learn from it. This makes it difficult for a system to adapt to new or changing requirements over time, especially when dealing with very specific or high-resolution labels.
 

State-of-the-art and beyond

There are already a wide variety of automated data labelling solutions available for unstructured data. Mature solution providers such as Microsoft and Proofpoint offer services that can label based on the file’s location, metadata, keywords, dictionaries, sensitive information types such as IBAN, and specific file matches (specific contracts, customer forms). Many of these mechanisms are rule-based to a large degree, meaning that files are labelled when they fulfil pre-defined conditions designed by human experts. This works well in many cases and brings tremendous value compared to labelling files manually. However, it works well mainly for clear-cut labels, such as labelling a file as being a CV or containing a social security number, but it is not always suitable for more specific or fine-grained labels. AI-based solutions are becoming increasingly common as means of dealing with such challenges.

A popular AI-based method for labelling files that do not adhere to an easily identifiable format is to use frequency-based machine learning models. Such models learn to identify files based on common words or patterns that they contain. Large sets of sample files are being used for learning the words or patterns that typically occur in a given type of document, so that the models learn implicit dictionaries of words and phrases associated with those types. New files can then be analysed and compared to determine which type a specific file resembles most. For example, looking at large numbers of mortgage documents enables models to recognise and identify new documents that probably also relate to mortgages. Sufficient exemplary data enables a powerful, dynamic, and self-learning system. In practice, this typically pertains to methods based on frequencies of words or word combinations, such as Term Frequency – Inverse Document Frequency (TF-IDF) or CountVectorizer methods.

Our five value propositions

The PCSI is exploring the potential of state-of-the-art machine learning methods that look beyond word counts as input to label a document. This is because for some labels, simply looking for the presence of certain words is not enough. A limited amount of text, specific labels, or the contextual meaning of words or sentences can make it difficult to label accurately using such methods. The state-of-the-art of machine learning on text – Natural Language Processing – makes it possible to understand and use the semantic meaning of text, rather than simply doing word- specific analysis. We see the first implementation of such techniques in machine learning packages such as spaCy and Transformers by Hugging Face. Popular models being used for such analysis include Word2Vec, Doc2Vec or a BERT model (Bidirectional Encoder Representations for Transformers). In our exploration, however, these capabilities are not yet reflected in commercial tooling and services.

While current solutions offer some value, mature organisations may benefit from innovations in this field, further boosting business value through the use of state-of-the-art machine learning solutions for labelling unstructered data. To this end, we would like to suggest the following five propositions:

  1. Accuracy: most product vendors provide an estimated accuracy of around 80% for most labels. While this is a great improvement on not having anything in place, it also means that roughly 1 in 5 files are still being missed or incorrectly labelled. This estimate is typically given for the labelling of documents, such as ‘CV’ or ‘financial statement’. More complex or detailed labels will probably suffer from far lower accuracy scores. It is desirable to have a system that enables customers to make a trade-off between performance metrics such as precision and recall.
  2. Flexibility: requirements, insights, and regulations change over time and this affects labelling needs. Large numbers of existing labels may need to be updated (further emphasising the need for automation) and new types of labels may need to be added. This calls for a system that can adapt to such changing circumstances without the need for a lengthy process. Rule-based systems require experts to design and manage the rulesets, which can be a time-consuming task, especially when designing rules for new labels. While machine-learning solutions can learn independently, they do require a training set, and this can be challenging to obtain. New implementations should take this into account by facilitating the manual annotation of labels and by making use of recent advances, such as few-shot learning and active learning. 
  3. Complexity: typical examples of labels that are assigned to documents are ‘CV’ or ‘contract’. There is value to be gained from using more specific and descriptive labels, such as ‘rental agreement’ as opposed to ‘contract’. This enables more specific searches and more precise classifications. However, distinguishing between different types of contracts is obviously a much harder task than simply determining that a document is a type of contract. The model therefore requires more complex rulesets or more specific datasets to train from.
  4. Resolution: while most tools are capable of assigning labels to documents, only a few of them provide the capability to identify and label on individual words or sentences. Doing so can add considerable value, as it provides users with more detail and the ability to search for documents and to manage them. It also enables extracting or redacting certain items of information, such as sensitive PII, with the added possibility of automation.
  5. Explainability: rule-based systems make it relatively easy to provide a clear explanation as to why a given label was assigned, for example by highlighting the exact words in the file that were used to derive that label. For machine learning-based systems, however, it is much more difficult to provide such an explanation. While learning from high-dimensional patterns is powerful, it also makes it difficult for a human to understand why a decision has been made. Explainable AI (XAI) techniques such as SHAP values can help provide insight and restore human control. They also enable the human controller to provide the system with feedback (e.g., indicating an incorrectly assigned label) from which it can learn in order to improve over time.

We encourage vendors to review their solution with respect to accuracy, flexibility, complexity, resolution and explainability. These aspects are essential for providing their existing and potential customers with greater transparency regarding what their product can and cannot do. We encourage them to improve their products on each of these aspects and consider the potential of adding state-of-the-art machine learning to their solution.

Takeaways

For most organisations starting out on data labelling, there are many good tools widely available on the market that will bring tremendous value. And for more mature organisations, there is also still considerable value to be gained. Market solutions lack accuracy, flexibility, resolution, explainability, and the ability to deal with complexity. State-of-the-art Natural Language Processing (NLP) techniques may provide a solution for bridging the gap between user needs and innovation, thus making the models more insightful. 

We invite data labelling vendors, information security specialists, and machine learning developers to join us in discussing how to align future developments in these areas with user needs in the financial sector. Do you recognise our challenges and opportunities? Are you experiencing other needs and challenges or can you see other opportunities? We would love to hear your thoughts and share ideas!

If you would like to join in our discussion, please contact the project lead, Rick van der Kleij: Rick.vanderkleij@tno.nl.

Authors: Steven Vethman (TNO), Maaike de Boer (TNO), Wouter Langenkamp (TNO), Sjoerd van Leersum (Achmea), Noor Spanjaard (ABN AMRO), Michaël Stekkinger (Achmea), Olaf Streutker (ABN AMRO), Willem van der Valk (Achmea), Ron Werther (de Volksbank), Rick van der Kleij (TNO).

Share this page