The Risk of faulty Metrics and Statistics

It’s never a bad idea to see what the outside world looks like. If you intend to go for a walk, you will probably consult the weather report in advance. If you plan to invest money (either for fun or for savings), you will most certainly gather information about the risks involved. There are a lot of reports out there about the IT security landscape, too. While there is nothing wrong with reading reports, you must know what you read, how the data was procured and how it was processed. Not everything that talks percentages or numbers has anything to do with statistics.

Let’s talk about metrics by using an example. Imagine an Internet service provider introduced a „real-time map of Cyber attacks“. The map would show attacks to their „honeypot“ systems at 90 locations worldwide. Let’s say that the bait hosts record up to 450,000 attacks per day. Furthermore let’s assume the data is published on a web page will real-time updates of counters, date, country of origin, attacked subsystem and target sensor type. What does this give you in terms of risk assessment? Answer: A lot of open questions.

  • What is „cyber“? What is an „attack“? What is a „cyber attack“?
    „Cyber“ could be anything. It is most definitely something digital. It usually is the Internet, but why call it „cyber“ then? If „cyber“ really means the Internet, then we need to know which one. The IPv4 Internet? The IPv6 Internet? Both? The Internet2? If only the „honeypot“ section is included, what does it look like? The answer is really important to make sense of the numbers.
    What about the „attack“ part? What is being detected? Directed exploits? Port scans? Which protocols? Is payload being injected? We are not interested in misguided web browsers or background noise.
  • How many „honeypot“ systems are in place? Does the number change?
    If you change the number of your sensors, then the number you publish makes not much sense. Reducing the number of sensors actually doesn’t reduce the risk for exposed systems. Likewise increasing the number of sensors is no rational justification to yell „Fire!“ in a crowded chat room. Once you play with numbers, please establish what they really mean. Context is king.
  • What does „450,000 attacks per day“ mean?
    Sine we neither know what exactly an „attack“ is or how many sensors we deal with, the number 450,000 could be generated by a single port scanner. Basically we are down to the statement „cyber gizmos receive quite some packets per day“. We already knew that in 1995. And before.
  • What kind of attacks can the „honeypot“ systems detect?
    We need to know what the sensors can detect. We need to know operating system, application, protocol, description of vulnerability (please don’t let the efforts of categorisation be in vain!) and more. Without this information no one operating a real life system can sensibly estimate any risks from the dashboard.
  • How is the country of origin determined?
    If the dashboard claims the country of origin, how was the country information deduced? In a world where the proper attribution might decide where the drones go for retaliation (hint: you might get killed by accident), you want to be extra careful with the term origin! A simple WHOIS lookup on spoofed origin addresses is nice, but the dashboard doesn’t say anything about the procedure involved behind the scenes.

So a half-hearted dashboard with colourful graphs, charts and numbers won’t get you anywhere. Once the metrics and the methods are flawed, all you get is digital manure. Picking meaningful metrics is difficult. You can just take anything that can be counted and put it into databases. You have to attach meaning and a link to real threats to real production networks.

Even if you disregard the technical issues and use „incidents“ instead, then you may run into trouble. First of all there is the problem of publishing security incidents. You will always have a difference between detected and undetected breaches, and there will always be organisation which won’t disclose all/some incidents. Then there is the growth of the Internet. The number of connected systems is rising. It is pretty hard to count all active hosts and their services. We are not even talking about IPv6 where special methods of enumeration work better, given the vast address space. So if you see reports claiming that the number of attacks is rising, well, the number of connected hosts is rising, too. Even here you have to get the meaning and metrics right.

If you have dealt with meaning, metrics and statistics, are a security researcher and have visions how to assess risks and the state of IT security closer to reality, then you are invited to share your findings with us at the DeepSec conference or the DeepINTEL seminar.  In case you know that statistics is a part of mathematics and is a lot more than simply dividing two numbers to get a percentage, we will be very happy! ☺

Tags: , ,

3 Responses to "The Risk of faulty Metrics and Statistics"