Your AI Data Is Valuable… But Is It Safe?
It seems like it happened overnight: artificial intelligence and machine learning have become pivotal to organizations around the world. In just a short time, AI and ML have emerged as indispensable tools, rapidly transforming the digital landscape and empowering companies to generate content, make data-driven decisions, and automate manual processes.
In the healthcare sector, for instance, AI is being used to analyze medical images, diagnose diseases, and develop personalized treatment plans for patients. In finance, ML algorithms are being leveraged for fraud monitoring, market predictions, and asset management. And across a wide variety of industries, AI technologies are being used for customer support and other functions.
This is especially true in the cybersecurity space, where AI and ML are now at the core of many businesses. New tools are revolutionizing how organizations defend themselves against cyberthreats by analyzing vast amounts of data, identifying patterns in real-time, and detecting potential security breaches. We’re seeing AI’s capabilities extend to everything from intrusion detection systems to phishing- and malware-blocking security tools that safeguard sensitive data and critical systems from harm.
In our latest fireside chat, ShardSecure Field CTO Julian Weinberger and CISO and VP of Sales Engineering Zack Link sat down to discuss the rapidly changing world of AI, including the growing importance — and cost — of AI/ML models and training data. Today, we’ll recap their discussion and explore the top concerns around data security, privacy, and sovereignty for AI/ML models.
The staggering value of AI/ML data
First, we believe that AI/ML models and training data should be treated as critical intellectual property. Companies invest many millions of dollars into creating models that, when successful, provide unique advantages in the market.
The development process for these AI/ML models involves immense effort. Teams of data scientists may spend up to 80% of their time just gathering and processing huge amounts of raw data to make it usable during the training and testing phases. The quality of the resulting models will depend in large part on the quantity and quality of the processed data. (Generally, the larger the dataset, the better the model will be at recognizing intricate patterns, avoiding biases, and making accurate predictions.)
Since processing data not only makes up the bulk of the AI/ML model development process but also determines the success of those models, it makes sense to treat AI/ML data like critical IP. And that effort starts with protecting it from the unique cyberthreats it faces.
Unraveling the unique threats to AI/ML data
AI/ML data shares common data security challenges with other kinds of data, such as unauthorized access, breaches, threats to data integrity. But AI/ML data also presents distinct data security challenges, including:
- Data poisoning and adversarial attacks
- Model stealing
- Online system attacks
- Cloud provider risks
- Ransomware attacks
- Data privacy and compliance challenges
We’ll cover each of these threats in more detail below.
Data poisoning and adversarial attacks
Data poisoning is a subtle and devastating way to sabotage the success of an AI/ML model. In these attacks, training data is intentionally compromised with malicious information in order to distort a model.
In one study, researchers demonstrated how easy it is to carry out data poisoning, noting that they could have poisoned a full 0.01% of the enormous LAION-400M or COYO-700M datasets for under $70. With only limited technical skills, attackers can use online services to easily compromise outputs and embed backdoors in AI/ML models.
Like data poisoning, adversarial attacks also involve maliciously altering a model’s data points, but they do so with the specific goal of engineering negative outcomes. In the physical world, attackers might put tape on roadway markings or signage to confuse self-driving cars into driving dangerously. In the digital world, cybercriminals might instead introduce subtle changes to data in order to cause a model to misclassify fraudulent behavior as legitimate.
In AI model stealing, malicious actors exfiltrate data inputs and outputs in order to reverse-engineer a model and create a lookalike for a fraction of the cost. The lost revenue for the victims of model stealing can be staggering, as the original model likely cost tens of millions of dollars to build.
Model stealing also presents a major issue with models that are designed to detect crimes, protect assets, or perform other high-stakes tasks. If criminals are able to steal enough data to rebuild a wire fraud detection model, for instance, they will be able to come up with new approaches to fraud that can better avoid detection.
Online system attacks
Online system attacks are another form of data poisoning. Instead of exploiting security vulnerabilities to secretly manipulate data from within, online system attackers typically work with AI models that allow user feedback. Instead of inputting helpful feedback to fine-tune the model, those attackers input malicious data to bias the system and affect the outcome.
Cloud provider risks
Because the cloud offers scalable computing and customizable tools, AI/ML training data is almost always stored with cloud vendors in object stores, data lakes, etcetera. The security challenges of the cloud, from the shared responsibility model to issues of third-party access, are well documented. From high-profile incidents involving major corporations to small-scale breaches affecting individual users, the vulnerabilities posed by unauthorized access remain a pressing concern for all organizations.
But the cloud also presents issues that are unique to the AI/ML landscape. For instance, some data owners worry that cloud admins could scrape AI/ML training data to use in their own models. Although scraping customer data would likely face legal challenges, there’s currently very little legislation to regulate how cloud providers handle AI/ML data.
It might feel like you’re constantly hearing about ransomware, but that’s because it’s still a major threat. Large data repositories like the ones used for AI/ML training data are always a target for attackers, since they present a great opportunity to exfiltrate valuable material. Victims are also more likely to pay ransoms, since being locked out of data can disrupt AI product release schedules and raise the already significant costs of downtime.
Data privacy and compliance challenges
Lastly, AI/ML data poses major challenges for data privacy, confidentiality, and compliance. As AI technologies collect, process, and analyze vast amounts of personal and sensitive data, ensuring data privacy is critical. Depending on an organization’s size and jurisdiction, it may have to abide by regulations like the EU’s General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) if its training data contains PII, PHI, or any other kind of personal information.
In industries like healthcare and finance, where AI processes sensitive data such as medical records and financial information, ensuring data confidentiality is even more important. As a result, some experts have called for a comprehensive federal data privacy law to govern AI/ML data.
How ShardSecure protects AI/ML models and training data
ShardSecure offers an innovative software solution to protect AI/ML models and training data in the cloud, including hybrid- and multi-cloud architectures. Our platform provides advanced data security for AI datasets, preventing unauthorized third-party access by cyberattackers and cloud admins alike.
ShardSecure's innovative approach to file-level encryption offers major benefits for compliance and regulations like the GDPR and SOC 2. It also supports strong data confidentiality and privacy for AI/ML models and training data. By providing high availability, performing regular integrity checks and self-healing processes, and offering features like object locking and immutable storage, ShardSecure ensures that valuable AI datasets remain available, accurate, and secure in the cloud.