Learning to Rank Strings Output for Speedier Malware Analysis

Original Post from FireEye
Author: Philip Tully

Reverse engineers, forensic investigators, and incident responders
have an arsenal of tools at their disposal to dissect malicious
software binaries. When performing malware analysis, they successively
apply these tools in order to gradually gather clues about a binary’s
function, design detection methods, and ascertain how to contain its
damage. One of the most useful initial steps is to inspect its
printable characters via the Strings
program
. A binary will often contain strings if it performs
operations like printing an error message, connecting to a URL,
creating a registry key, or copying a file to a specific location –
each of which provide crucial hints that can help drive future analysis.

Manually filtering out these relevant strings can be time consuming
and error prone, especially considering that:

  • Relevant strings occur
    disproportionately less often than irrelevant strings.
  • Larger binaries can output upwards of tens of thousands of
    individual strings.
  • The definition of “relevant” can
    vary significantly across individual human analysts.

Investigators would never want to miss an important clue that could
have reduced their time spent performing the malware analysis, or even
worse, led them to draw incomplete or incorrect conclusions. In this
blog post, we will demonstrate how the FireEye Data Science (FDS) and
FireEye Labs Reverse Engineering (FLARE) teams recently collaborated
to streamline this analyst pain point using machine learning.

Highlights

  • Running the
    Strings program on a piece of malware inevitably
    produces noisy strings mixed in with important ones, which
    can only be uncovered after sifting and scrolling through
    the entirety of its messy output. FireEye’s new machine
    learning model that automatically ranks strings based on
    their relevance for malware analysis speeds up this process
    at scale.
  • Knowing which individual strings are
    relevant often requires highly experienced analysts.
    Quality, security-relevant labeled training data can be time
    consuming and expensive to obtain, but weak supervision that
    leverages the domain expertise of reverse engineers helps
    accelerate this bottleneck.
  • Our proposed
    learning-to-rank model can efficiently prioritize Strings
    outputs from individual malware samples. On a dataset of
    relevant strings from over 7 years of malware reports
    authored by FireEye reverse engineers, it also performs well
    based on criteria commonly used to evaluate recommendation
    and search engines.

Background

Each string returned by the Strings program is represented by
sequences of 3 characters or more ending with a null terminator,
independent of any surrounding context and file formatting. These
loose criteria mean that Strings may identify sequences of
characters as strings when they are not human-interpretable. For
example, if consecutive bytes 0x31, 0x33, 0x33, 0x37, 0x00 appear
within a binary, Strings will interpret this as “1337.”
However, those ASCII characters may not actually represent that string
per se; they could instead represent a memory address, CPU
instructions, or even data utilized by the program. Strings
leaves it up to the analyst to filter out such irrelevant strings that
appear within its output. For instance, only a handful of the strings
listed in Figure 1 that originate from an example malicious binary are
relevant from a malware analyst’s point of view.


Figure 1: An example Strings output
containing 44 strings for a toy sample with a SHA-256 value of eb84360ca4e33b8bb60df47ab5ce962501ef3420bc7aab90655fd507d2ffcedd.

Ranking strings in terms of descending relevance would make an
analyst’s life much easier. They would then only need to focus their
attention on the most relevant strings located towards the top of the
list, and simply disregard everything below. However, solving the task
of automatically ranking strings is not trivial. The space of relevant
strings is unstructured and vast, and devising finely tuned rules to
robustly account for all the possible variations among them would be a
tall order.

Learning to Rank Strings Output

This task can instead be formulated in a machine learning (ML)
framework called learning to
rank (LTR)
, which has been historically applied to problems like
information retrieval, machine translation, web search, and
collaborative filtering. One way to tackle LTR problems is by using Gradient
Boosted Decision Trees (GBDTs)
. GBDTs successively learn
individual decision trees that reduce the loss using a gradient
descent procedure, and ultimately use a weighted sum of every trees’
prediction as an ensemble. GBDTs with an LTR objective function can
learn class probabilities to compute each string’s expected relevance,
which can then be used to rank a given Strings output. We
provide a high-level overview of how this works in Figure 2.

In the initial train() step of Figure 2,
over 25 thousand binaries are run through the Strings program
to generate training data consisting of over 18 million total
strings. Each training sample then corresponds to the
concatenated list of ASCII and Unicode strings output by the
Strings program on that input file. To train the model, these
raw strings are transformed into numerical vectors containing natural
language processing features like Shannon entropy and character
co-occurrence frequencies, together with domain-specific signals like
the presence of indicators of compromise (e.g. file paths, IP
addresses, URLs, etc.), format strings, imports, and other relevant landmarks.


Figure 2: The ML-based LTR framework
ranks strings based on their relevance for malware analysis. This
figure illustrates different steps of the machine learning modeling
process: the initial train() step is denoted by solid arrows and
boxes, and the subsequent predict() and sort() steps are denoted by
dotted arrows and boxes.

Each transformed string’s feature vector is associated with a
non-negative integer label that represents their relevance for malware
analysis. Labels range from 0 to 7, with higher numbers indicating
increased relevance. To generate these labels, we leverage the subject
matter knowledge of FLARE analysts to apply heuristics and impose
high-level constraints on the resulting label distributions. While
this weak
supervision approach
may generate noise and spurious errors
compared to an ideal case where every string is manually labeled, it
also provides an inexpensive and model-agnostic way to integrate
domain expertise directly into our GBDT model.

Next during the predict() step of Figure
2, we use the trained GBDT model to predict ranks for the strings
belonging to an input file that was not originally part of the
training data, and in this example query we use the Strings
output shown in Figure 1. The model predicts ranks for each string in
the query as floating-point numbers that represent expected relevance
scores, and in the final sort() step of
Figure 2, strings are sorted in descending order by these scores.
Figure 3 illustrates how this resulting prediction achieves the
desired goal of ranking strings according to their relevance for
malware analysis.


Figure 3: The resulting ranking on the
strings depicted in both Figure 1 and in the truncated query of
Figure 2. Contrast the relative ordering of the strings shown here
to those otherwise identical lists.

The predicted and sorted string rankings in Figure 3 show
network-based indicators on top of the list, followed by registry
paths and entries. These reveal the potential C2 server and malicious
behavior on the host. The subsequent output consisting of user-related
information is more likely to be benign, but still worthy of
investigation. Rounding out the list are common strings like Windows
API functions and PE artifacts that tend to raise no red flags for the
malware analyst.

Quantitative Evaluation

While it seems like the model qualitatively ranks the above strings
as expected, we would like some quantitative way to assess the model’s
performance more holistically. What evaluation criteria can we use to
convince ourselves that the model generalizes beyond the coverage of
our weak supervision sources, and to compare models that are trained
with different parameters?

We turn to the recommender systems literature, which uses the Normalized
Discounted Cumulative Gain
(NDCG) score to evaluate ranking of
items (i.e. individual strings) in a collection (i.e. a Strings
output). NDCG sounds complicated, but let’s boil it down one letter at
a time:

  • “G” is for gain, which
    corresponds to the magnitude of each string’s relevance.
  • “C” is for cumulative, which refers to the cumulative gain or
    summed total of every string’s relevance.
  • “D” is for
    discounted, which divides each string’s predicted relevance by a
    monotonically increasing function like the logarithm of its ranked
    position, reflecting the goal of having the most relevant strings
    ranked towards the top of our predictions.
  • “N” is for
    normalized, which means dividing DCG scores by ideal DCG scores
    calculated for a ground truth holdout dataset, which we obtain from
    FLARE-identified relevant strings contained within historical
    malware reports. Normalization makes it possible to compare scores
    across samples since the number of strings within different
    Strings outputs can vary widely.


Figure 4: Kernel Density Estimate of
NDCG@100 scores for Strings outputs from the holdout dataset. Scores
are calculated for the original ordering after simply running the
Strings program on each binary (gray) versus the predicted ordering
from the trained GBDT model (red).

In practice, we take the first k strings indexed by their
ranks within a single Strings output, where the k
parameter is chosen based on how many strings a malware analyst will
attend to or deem relevant on average. For our purposes we set k =
100 based on the approximate average number of relevant strings
per Strings output. NDCG@k scores are bounded between 0
and 1, with scores closer to 1 indicating better prediction
quality in which more relevant strings surface towards the top. This
measurement allows us to evaluate the predictions from a given model
versus those generated by other models and ranked with different algorithms.

To quantitatively assess model performance, we run the strings from
each sample that have ground truth FLARE reports though the predict() step of Figure 2, and compare their
predicted ranks with a baseline of the original ranking of strings
output by Strings. The divergence in distributions of NDCG@100
scores between these two approaches demonstrates that the trained GBDT
model learns a useful structure that generalizes well to the
independent holdout set (Figure 4).

Conclusion

In this blog post, we introduced an ML model that learns to rank
strings based on their relevance for malware analysis. Our results
illustrate that it can rank Strings output based both on
qualitative inspection (Figure 3) and quantitative evaluation of
NDCG@k (Figure 4). Since Strings is so commonly applied
during malware analysis at FireEye and elsewhere, this model could
significantly reduce the overall time required to investigate
suspected malicious binaries at scale. We plan on continuing to
improve its NDCG@k scores by training it with more high
fidelity labeled data, incorporating more sophisticated modeling and
featurization techniques, and soliciting further analyst feedback from
field testing.

It’s well known that malware authors go through great lengths to
conceal useful strings from analysts, and a potential blind spot to
consider for this model is that the utility of Strings itself
can be thwarted by obfuscation. However, open source tools like the FireEye
Labs Obfuscated Strings Solver (FLOSS)
can be used as an in-line
replacement for Strings. FLOSS automatically extracts printable
strings just as Strings does, but additionally reveals
obfuscated strings that have been encoded, packed, or manually
constructed on the stack. The model can be readily trained on FLOSS
outputs to rank even obfuscated strings. Furthermore, since it can be
applied to arbitrary lists of strings, the model could also be used to
rank strings extracted from live memory dumps and sandbox runs.

This work represents a collaboration between the FDS and FLARE
teams, which together build predictive
models to help find evil and improve outcomes for FireEye’s
customers and products
. If you are interested in this mission,
please consider joining the team by applying to one of
our job openings
.


Go to Source
Author: Philip Tully

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Appliance - Powered by TurnKey Linux