This post is co-authored with Sam Perl.
The CERT Division of the Software Engineering Institute (SEI) at Carnegie Mellon University recently released the Cyobstract Python library as an open source tool. You can use it to quickly and efficiently extract artifacts from free text in a single report, from a collection of incident reports, from threat assessment summaries, or any other textual source.
Cybersecurity teams need to collect and process data from incident reports, blog posts, news feeds, threat reports, IT tickets, incident tickets, and more. Often incident analysts are looking for technical artifacts to help in their investigation of a threat and development of a mitigation. This activity is often done manually by cutting and pasting between sources and tools. Cyobstract helps extract key artifact values from any kind of text, allowing incident analysts to focus their attention on processing the data once it’s extracted. Once artifact values are extracted, they can be more easily correlated across multiple datasets.
After using Cyobstract to extract security-relevant data types, the data can be used for higher level downstream analysis and investigation of source incident reports and data. The resulting data can also be loaded into a database, such as an indicator database. Using Cyobstract, the extracted artifacts can be used to quickly and easily find commonality and similarity across reports, thereby potentially revealing patterns that might otherwise have remained hidden.
At its core, the Cyobstract library is built around a collection of robust regular expressions that target 24 specific security-relevant data types of potential interest, including the following:
- IP addresses: IPv4, IPv4 CIDR, IPv4 range, IPv6, IPv6 CIDR, and IPv6 range
- hashes: MD5, SHA1, SHA256, and ssdeep
- Internet and system-related strings: FQDN, URL, user agent strings, email addresses, filenames, filepath, and registry keys
- Internet infrastructure values: ASN, ASN owner, country, and ISP
- security analysis values: CVE, malware, and attack type
Cyobstract is capable of extracting these artifacts even if they are malformed in some way. For example, in the incident response community, it’s often standard practice for teams to “defang” indicators of compromise before storing them or sending them to another team. Defanged indicator values can be difficult for automated solutions to extract. There is no standard practice for defanging, so there are many ways it can be done. Cyobstract was built from a large collection of real incident reports from many organizations so it can handle many ways of defanging that CERT researchers have observed in the field.
In addition to the core extraction library, Cyobstract includes developer tools that teams can use to craft their own regular expressions and capture custom security data types. Analysts typically do this by using a list of names to extract; however, over time, that list can be large and slow to use in matching. Using Cyobstract, analysts can input their lists of terms to match (such as a list of common malware names) and get an optimized regex as output. This expression is significantly faster than trying to match directly to a name on a list.
The library also includes benchmarking tools that can track the effect of changes to regular expressions and present the analyst with feedback on the overall effectiveness of the individual change (e.g., this regex change found three more reports than the previous version).
The Cyobstract library can be downloaded from GitHub at https://github.com/cmu-sei/cyobstract.