DeepSec 2024 Talk: Detecting Phishing using Visual Similarity – Josh Pyorre

Sanna/ October 10, 2024/ Conference

Current phishing detection methods include analyzing URL reputation and patterns, hosting infrastructure, and file signatures. However, these approaches may not always detect phishing pages that mimic the look and feel of previously observed attacks.

This talk explores an approach to detecting similar phishing pages by creating a corpus of visual fingerprints from known malicious sites. By taking screenshots, calculating hash values, and storing metadata, a reference library can compare against newly crawled suspicious URLs. By combining fuzzy searches and OCR techniques with other methods, we can identify similar matches.

We asked Josh a few more questions about his talk.

Please tell us the top 5 facts about your talk.

In security, URL block lists are widely used, but I rarely see people utilizing a database of visual information to hunt for phishing attacks that are similar or part of a campaign. This presentation shows following that process.
To build a malicious dataset of screenshots and other artifacts, I built a web crawler on my home network. When it was too slow, I started running 20 of them in a docker-compose environment. When that wasn’t fast enough, I set up VPS servers around the world to run my crawlers and started sending URLs by location to their nearest crawler. I ran the crawlers non-stop for about a month, collecting data from malicious URL datasets. This is all in the presentation and included code as it’s important to talk about the engineering efforts that are behind security. Those roles are often separated, but I believe security research and security engineering need to know each other’s domains and work together.
Building the architecture to make the research possible for this presentation resulted in the creation of several really useful tools that I’ll be sharing via GitHub at the conference. One tool will Geo-locate and visually map the locations of a list of URLs to see where they are hosted (this was made to find out where I should build my web crawlers). Another tool allows uploads of many screenshots that will be grouped together for mass deletion or addition to a database (this made filtering through thousands of screenshots faster when building the malicious dataset).
With only two months to build everything from scratch for this presentation, I worked late into most nights, occasionally employing a home-built AI server running a local LLM to do some of the more tedious coding. I’ve reviewed and refactored any code, so it’s not just a copy/paste from AI. Most of the code I made the AI help with was JavaScript (I dislike coding JavaScript). This presentation shows creative methods of alerting and visualization. As both an artist and someone who has spent years looking at lines and lines of logs, displayed in one spreadsheet-like interface after another, I seek to creatively explore new methods of distilling and conveying information. Another reason to explore different alerting and research methods is to show that our work is not just scientific, but also artistic. It can be playful and useful at the same time.
It’s difficult to build a malicious dataset based off screenshots. There are only so many services that phishing tries to impersonate. I sent around 100,000 confirmed malicious URLs through my crawlers to get as much data as possible. These have been distilled down to the most essential screenshots. The presentation and code will also use other indicators to support convictions.

How did you come up with it? Was there something like an initial spark that set your mind on creating this talk?

Most of my work involves researching domains, IP addresses, and URLs to determine what is bad and what’s good. I work with the semantics of URLs, hosting infrastructure, DNS, and more. All of these are fast and easy to work with, so I’ve typically avoided exploring other options.
However, when I need to see what’s actually at a URL, I typically load up TOR or a virtual machine and visit the domain to see what’s hosted. This takes some time, but provides valuable information just from visual inspection, HTML and JavaScript source analysis, and any redirections.
Since visiting every endpoint myself isn’t practical, I’m often missing a entire world of indicators that could lead to more detection. I realized it would fun to build something where I can send a list of streaming URLs to be visually inspected and compared with similar, known-bad destinations, resulting in an alerting system and a potential new method of tracking specific phishing campaigns.

Why do you think this is an important topic?

This is a new way to look for malicious activities that can supplement the actions we in the security industry already employ. It also explores creative methods of exploration that can lead to new types of analysis and discovery. I initially was a little wary of this technique as it involves actually visiting a potentially malicious destination, but computing is cheap and plentiful, and it’s OK to do more than passive analysis if you take measures to obfuscate who you are and what you’re trying to do.

Is there something you want everybody to know – some good advice for our readers maybe?

I’m going to show some outside-the-box thinking and methodologies that will hopefully leave you wanting to contribute or explore new methods of analysis, detection, and reporting. I try to make doing security a fun activity.

A prediction for the future – what do you think will be the next innovations or future downfalls when it comes to your field of expertise / the topic of your talk in particular?

On the topic of phishing in general, generative AI is already allowing threat actors to make more convincing phishing emails, with text properly written in the recipient’s language. As phishing will more easily fool its victims, analysis techniques will need to become more advanced and work at a larger scale.

Josh Pyorre is a Security Researcher with Cisco Talos. He’s been in security since 2000 with NASA, Mandiant, and other organizations. Josh has presented at many conferences, such as DEFCON, B-Sides, Derbycon, DeepSec, Qubit, and others. His professional interests involve network, computer and data security with a goal of maintaining and improving the security of as many systems and networks as possible. He writes dark electronic music under the name Die Vortex.