This question comes up in various forms, for example, when an employee is stealing intellectual property from their employer or if there was a breach of medical records. Most recently, it was a colleague that asked, “Do you have any script that looks inside of a local directory for PHI or PII?” By PHI, we mean protected health information which is sensitive information regulated by the Health Insurance Portability and Accountability Act (HIPAA). PHI is a subset of personally identifiable information (PII). In addition to PHI and PII, companies have a duty to protect their intellectual property (IP) and many other types of data.
Unfortunately, there is no magic “find sensitive data” easy button. The reason that discovering sensitive data is challenging is that it must be characterized. That process depends on the nature of the data and how it is handled. The other challenge with discovering sensitive data is related to how and where it is stored. Sensitive data can be stored in USB thumb drives, computer hard drives, and cloud storage. Data can be stored in databases, proprietary file formats, and encrypted.
Certain data, like credit card numbers, social security numbers, and medical record numbers (MRN), have a very specific format. If we know the format, we can create a regular expression (aka “regex’) to describe the format. Most digital forensic tools, programming languages, and command line tools, such as grep support regular expressions.
Unfortunately, not all sensitive information can be uniquely identified by regular expressions. Take, for example, a post-operative physician’s note. This will contain lots of medical jargon, medication names, and other words that are specific to the medical profession. Putting a medical record number on the document will make it easier to find using a regular expression, but what if it is a draft of the note and does not contain an MRN? We could search for medical terminology or medication names. Obviously, the better we understand the data we seek, the more effective our efforts to search for the data will be.
Obviously, a physician’s note is very sensitive, but how can software differentiate between a physician’s note and a public article in a medical journal? This is easy for a human, but how can this be done at scale across terabytes of storage? Typically, one will accept some false positives in the search results and manually sort through them, but additional technical solutions may be leveraged by those familiar with them.
At Lucid Truth Technologies, we create innovative technical solutions to solve these types of problems. We have used natural language processing (NPL) techniques, often powered by artificial intelligence, to perform document classification, document summarization, named entity recognition, and indexing to identify the data most relevant to your case or investigation.
Lucid Truth Technologies is here to help if you face a challenge like this, regardless of where the data may be stored. Contact us today!