We’re often asked about the data sources behind our cyber risk research. We generally include these details in the appendix of our Information Risk Insights Studies (IRIS), but those only come around once or twice a year. Plus, we’re always looking to expand our cyber risk data sources and partners, so the list evolves over time. This post will serve as a regularly updated reference of our primary sources of frequency and loss data.

A thick fog of FUD — fear, uncertainty, and doubt — obscures cyber risk analysis. InfoSec teams struggle to measure risk exposure, prioritize the most effective mitigation, and credibly demonstrate success to stakeholders. Much of this struggle stems from a lack of reliable data on cyber risk factors.

The Cyentia Institute’s ongoing research in the Information Risk Insights Study (IRIS) series seeks to close this gap with credible analysis grounded in historical data on cyber risk trends. This research is only as good as the data on which it is built, which is why we’re always seeking new sources of reliable cyber risk intelligence.

Data Sources

Cyentia’s cyber risk research draws heavily upon three datasets:

Zywave Cyber Loss Data
Feedly for Threat Intelligence
Board Cybersecurity Incident Tracker

Zywave’s Cyber Loss Data forms the foundation of our analysis because of its breadth (over 150,000 security incidents), history (spanning decades), inclusion of financial losses, and firmographic data. The data is compiled from publicly available sources, such as breach disclosures, company filings, litigation details, and Freedom of Information Act requests. It is used extensively by cyber (re)insurers to assess risk and price premiums. More info: https://www.advisenltd.com/data/cyber-loss-data/

Feedly for Threat Intelligence is a platform that continually scours thousands of open trusted web sources to extract actionable intelligence for cybersecurity teams. The platform uses over 1,000 pre-trained AI models to filter out noise and identify cyber threats, including attacks, threat actors, TTPs, and IOCs. We used Feedly for Threat Intelligence in this research to enrich incidents with contextual threat information. More info: https://feedly.com/threat-intelligence

Board Cybersecurity is a governance-oriented platform that aggregates and structures cyber risk signals from SEC disclosures, state AG notifications, news reports, and governance filings. This currently spans over 12,000 companies and roughly 9,000 incidents. We incorporate this data into both our frequency and loss models. More info: https://www.board-cybersecurity.com/

Cyentia does extensive processing of all these sources to extend and enrich the combined dataset for our research and services. This is done using a combination of classification models, natural language processing (NLP), AI-supported evidence gathering, taxonomy mapping, malware behavioral analysis, and manual research by our analysts.

Analytical Methodology

Incident Likelihood

Incident frequency is modeled from the perspective of asking the question: “What’s the likelihood of a firm experiencing an incident in the next year?”

To that end, we divide our historical dataset into 12-month rolling windows and generate incident counts by firm over these rolling windows. This gives us a large number of observations that allow us to more confidently model the annualized loss event frequency.

These observations are treated as samples from an underlying probability distribution (negative binomial) and are fed into random effects models to estimate distribution parameters both overall and within specific slices like industry and revenue bands. The results are closed-form estimates of the probability that an organization will experience a certain number of incidents in a given year.

Financial Losses

While financial losses tend to be less reported than other data points for cyber events, those that are reported tend to reflect direct losses that are easier to quantify (e.g., response costs or lost revenue) and/or identify from public records (e.g., class action suits or SEC filings). Indirect and intangible impacts usually aren’t captured.

The good news, from a data standpoint, is that the record of losses from major security incidents—like those we analyze—is more complete than for minor events due to increased visibility and reporting. Thus, we believe that our loss dataset is sufficient to form a well-supported model of cyber event losses over the last 15 years.

Loss distribution models begin by adjusting all loss amounts for inflation, and then fitting log-normal models with random effects that again account for various slices like industry and revenue bands.

If you’re looking for cyber risk data beyond what we publish in the IRIS series, consider becoming a member (free) or contracting our Retina risk intelligence service.