Infosec's Dataset Problem

June 2, 2022

Is it possible for the most paranoid industry in technology to productively share data?

Two years ago I wrote a post about ML in information security. In it I cover what I think might be required to move past anomaly detection/alerting and closer to agents that can act in support of or in place of human operators. Since the time of writing, I’ve spent more time working in the industry and more time thinking about the direction the field is moving in and developed the beginnings of a gym environment for red teaming. At the time I wrote that post, I thought the best way to move forward was to develop more complex, closer-to-real-world environments to train my agent in. A more and more realistic simulation. More and more realistic-looking machines in more and more realistic network configurations.

The algorithm I chose for my initial experiments was PPO, or Proximal Policy Optimization. This type of model is “on-policy”. A side effect of that choice is that training data cannot be re-used. Only the most recent data, or the data collected by the current model parameters (the current policy) is used to train the model at any given time. Even if I kept the state and action matrices from each timestep they would not do me or anybody else any good.

At the time of developing it, this didn’t bother me much, as my unstated assumption was that sharing data is out of the question for our field. No red-teamer, whether working for a corporation or independently, would dare to share that data. Even if they wanted to, the security critical nature of their work would mean that it just wasn’t possible. No employer would allow it, and certainly no client would agree to it. Imagine, for example, that you exported your entire Metasploit history for a given engagement into an action and state space of the kind I describe here. If you were to scrub host names from this data (leaving only numerical indicators: host 1 on subnet 1, or host 3 on subnet 2, etc), it would mean the following information about your engagement could be derived from those matrices:

How many hosts there were.
The structure of the network, in what machines shared subnets/were routable to-and-from each other.
What ports were open on those machines, and what exploits were successfully run against those hosts/services.

That some machines existed on some network that were vulnerable to some exploits wouldn’t seem to qualify as a smoking gun security risk, but it’s certainly more than I would be comfortable with as a client. If someone with access to that data could determine the identity of the sender (pentester) and determine what client that tester had been working with at the time it would give them a decent mapping of the network and its holes at that timestep. And that’s just for reinforcement learning! Since my time writing that post, I’ve wondered if Transformers could not be trained directly on multi-modal data coming from the terminal/browser with the training occuring directly on that, instead. It’s in vogue, it would probably be pretty fruitful as ressearch, but that datas even harder to get. The required fidelity is yet greater, and what someone might learn from it is even more likely to prevent a sound-minded person from ever sharing it. So why bother working on it?

Similarly, I’ve recently been working on ML-based static malware classification. I’ve found that subfield plauged by a similar data problem. End-to-end deep learning solutions, at least those being published academically, are losing to their feature-engineered peers. MLSec 2021, a for-dollar-prizes competition to see who could classify malware best was won by a Random Forest! No knock against the Secret team for their models, it’s great work, but in my experience it’s only possible for these methods to outperform deep learning when the distribution you’re modelling is simple or the datasets are small. But why should the datasets for malware classification be small? There are enormous amounts of unique malware samples, well over (a billion)[https://www.av-test.org/en/statistics/malware/] of them! And yet there is no “benchmark malware classification” dataset.

One of the big boons to deep learning, the thing that pushes forward technical progress, is benchmark datasets. The ideal benchmark dataset is difficult enough that substantive progress on it requires serious breakthroughs. ImageNet, for example, was a large and broad enough dataset such that doing classification well required the creation of convolutional neural nets. When researchers refer to the ImageNet Moment they’re referring to the 2012 rendition of the ImageNet classification challenge where AlexNet won the competition with over a 10% lead to all of its competitors, and would spawn 80,000 citations and a whole slew of technical innovation in the years to follow. But ImageNet itself was created in 2009. Would computer vision have had the same boon without ImageNet creating the bar with which all algorithms were measured? We can’t know for sure, but it’s clear that Yann LeCun’s work in the late 80s on CNNs had been largely ignored until its success in AlexNet. Perhaps the benchmark dataset and its challenge were a pre-requisite.

If we can take that as an example of a benchmarks importance, computer vision isn’t alone. DeepMind’s AlphaFold was a gigantic step forward for a very different problem: protein folding. This too is based on a longstanding competition, CASP (Critical Assessment of Protein Structure Prediction). If you’ll allow a looser definition of “benchmark dataset” DARPA Grand Challenge shaped the development of self-driving. The list goes on.

The MLSec competition on the other hand, provides about fifty samples. Any model you can deliver is perfectly acceptable, but the data you collect must be your own. That considered, my opinion is that the MLSec competition is just as much of, if not more of, a dataset collection challenge than it is a modelling challenge. There’s some evidence to back that up. Andy Applebaum has a very interesting talk about his own process of earning third place, that he gave at CactusCon this year. At around 11:08, Andy describes trying to collect more malware/goodware for this challenge. Acquiring a dataset seems to have absorbed the vast majority of his time, and there was never enough of it.

This problem isn’t limited to these competitions. It’s true academically as well. Both the feature engineering and deep learning methods refer to datasets created with industry partners that they can’t share access to. The EMBER paper refers expicitly to performing better than MalConvNet against their test dataset. But you can’t pull the data and test that for yourself, you just have to take their word for it. Two algorithms compared on different test sets don’t prove anything - the comparisons are barely meaningful.

This isn’t their fault, obviously. Hosting malware might be a bit of a faux pas, but that’s probably easier. The malware authors don’t have intellectual property lawyers! The commercial goodware on the other hand, does, and hosting the raw binaries for the sake of ‘researcher’ won’t fly. So papers are published and competitions are won with datasets you can’t see, comparing test results you can’t replicate. The field suffers as a result.

From this it seems clear that without making large, representative, shareable datasets the field will not make progress, at least not publicly. Further technical achievements will belong to only those private organizations who can afford to buy access to data for large sums of money and guard it as the moat that their products are built on.

I don’t think that’s healthy.

PhreakAI will be following EleutherAI’s inspiration with the pile in gathering and hosting large datasets for infosec. These might not quite match the inference distribution, but it would be a start.

If you’re interested, join the PhreakAI Discord. It might be fun.