Machine learning model and pipeline available upon request
We have developed software which is able to perform a binary sentence classification to detect personal identifiable information (PII).
The aim of the PCSI Automated Data Labelling project is to help protect unstructured data using data labelling. One of the results of the project is a pipeline. This pipeline is flexible in feature extraction and classifier creation. We allow English and Dutch texts as input. The proposed pipeline offers four feature extraction approaches: countvectorizer, word2vec, BERT and finetuned BERT.
The derived features are used to create a wide range of classifiers that can be compared based on training time and test set performance, as the best method can be different per dataset / application. Additionally, some explainability in terms of the most informative feature is provided. In the project, the pipeline is evaluated on two synthetic datasets that contain (English) medical and (Dutch) fraud related PII.
The software can be made available upon request. Interested? Get into contact with us!
Send an e-mail to email@example.com.
Our 3rd Cybertalk Session on trend "Stricter rules and enforcement on information sharing" dives deeper into the developed solution within our PCSI project Automated Data Labelling. Worth watching!
Share this page