ROOTs 2020: No Need to Teach New Tricks to Old Malware: Winning an Evasion Challenge with XOR-based Adversarial – Fabrício Ceschin

Sanna/ November 12, 2020/ ROOTS

Adversarial machine learning is so popular nowadays that Machine Learning (ML) based security solutions became the target of many attacks and, as a consequence, they need to adapt to them to be effective. In our talk, we explore attacks in different ML-models used to detect malware, as part of our experience in the Machine Learning Security Evasion Competition (MLSEC) 2020, sponsored by Microsoft and CUJO AI’s Vulnerability Research Lab, in which we managed to finish in first and second positions in the attacker’ and defender challenge, respectively.

During the contest’s first edition (2019), participating teams were challenged to bypass three ML models in a white box manner. Our team bypassed all three of them and reported interesting insights about the models’ weaknesses. This year, the challenge evolved into an attack-and-defense model: the teams should either propose defensive models and attack other teams’ models in a black-box manner. Despite the increase in difficulty, our team was able to bypass all models again, which allowed us to present interesting insights regarding attacking models, as well as defending them from adversarial attacks.

In particular, we showed how frequency-based models (e.g., TF-IDF) are vulnerable to the addition of dead function imports, and how models based on raw bytes are vulnerable to payload-embedding obfuscation (e.g., XOR and base64 encoding). One of the main contributions of this work is to show that adversarial attacks are more practical in real life models than previously thought, affecting even anti-virus used by final users.

We asked Fabrício a few more questions about his talk.

Please tell us the top 5 facts about your talk.

We describe the experience in an ML-based malware detection evasion challenge.
We describe our defensive ML model and discuss considerations to be made when developing a detection model.
We present the attack techniques we leveraged in the contest to bypass all ML models.
We discuss the impact of adversarial malware in practice via the detection rate of the evasive samples when inspected by real AVs.
We release code and a platform for the development of experiments with adversarial malware.

How did you come up with it? Was there something like an initial spark that set your mind on creating this talk?

Everything started back in 2019, in our first experience in the Machine Learning Security Evasion Competition (MLSEC), where we managed to finish in the second position and produce a paper reporting our experience. This year, the challenge was more complex, given that the participants needed to create defense solutions (defender’s challenge) that were further attacked by everybody (attacker’s challenge). This motivated us to test a research model that we developed in 2018 and to produce new attacks in order to improve ML-based malware detectors.

Why do you think this is an important topic?

Machine learning is being applied to a wide variety of problems and all of them are exposed to adversarial attacks. Understanding these attacks is an important topic, especially in cybersecurity where solutions need to detect as many attacks as possible, given that many of them are produced on a daily basis and everyone might be exposed to them.

Is there something you want everybody to know – some good advice for our readers maybe?

We want to everybody know that ML models are easily exposed to adversarial attacks, even the most complex ones produced by cybersecurity defense solutions that are used in practice. Our talk will show it, so come to my talk! 😀

A prediction for the future – what do you think will be the next innovations or future downfalls when it comes to your field of expertise / the topic of your talk in particular?

Creating robust ML models is a key factor to improve the development of future defense solutions that could face different types of adversarial attacks, which may also become even more sophisticated due to the arms race created by attackers and defenders. Our prediction for the future is that both defenders and attackers will improve, but defenders must be aware and use the insights produced by attackers to improve their solutions.

Fabrício Ceschin is a Ph.D. student at Federal University of Paraná, Brazil, where he received his M.S. degree in informatics. He was awarded by Google Latin America Research Awards 2017/2018. His research interests include machine learning applied to cybersecurity, such as data streams, concept drift, and adversarial machine learning.