Original Post from Rapid7
Author: Kwan Lin
We have a small data science team here at Rapid7, and we routinely operate across a broad range of functions throughout the company, including threat research, product, and operations, among other functions.
There is often a misconception that data science merely entails grabbing a mass of data, throwing it at some machine learning, and hoping for the best. In practice, we like to be more thoughtful than that.
Our approach is to treat data science as a framework that factors in a consulting phase, a data phase, and a “science” phase—a term used loosely here to encapsulate a broad range of analysis and modeling methods that may come into play depending on the challenge under consideration.
As a general rule of thumb, be wary of data scientists who don’t ask enough questions. We data scientists possess a particular set of skills acquired over a period of academic and professional growth, but it is highly improbable that we are as well-versed in a particular domain as someone who works in that space on a regular basis.
Our data science team routinely supports other teams across the company. Whenever we start engaging with our colleagues in other functions on specific problems, we need to begin first by fully understanding the problems we’re trying to solve.
The first set of questions we might ask of our non-data science colleagues could include clarification around the project’s intended purpose (i.e., “What are your goals?”). This is often the trickiest phase of consulting: While an idea might seem clear in one person’s mind, the risk of misunderstanding is ever-present.
Once the scope and intent of the project are clarified, the questioning often moves onto more practical considerations, such as the availability of appropriate data, the time horizons under consideration, and any technical constraints that might exist. Simply put, despite the best intentions of any project, it may prove to be a non-starter if critical resources such as appropriate data do not exist or cannot be made available.
In the consulting phase, we might also present a wide range of ideas. Many of those ideas may be wrong, but that’s OK. Brainstorming is fruitful to creative progress and might foment new avenues of exploration that may not have been considered yet.
If any ideas seem appropriate and feasible, we might then proceed toward implementing those ideas. We often keep our counterparts engaged in an active discussion and feedback loop to constantly validate the utility of what we’re doing to ensure that the outputs remain appropriate.
After we’ve suitably understood the problem during the consulting phase, we move on to the data. The adage of “garbage in, garbage out” is highly apt in the space of data science. Without the right data, any data science output would be flawed at best, and misleading at worst.
We often want more data, both in terms of volume and variety. However, too much data does pose some severe risks. Including too much data might introduce too much noise and confound the utility of any data science analysis or model. Too much variety in the data heightens the risk of multicollinearity and misleading spurious correlations—which are bad things when we’re trying to understand causal relationships. In a technical sense, too much data could also unnecessarily raise the cost and complexity of computation, turning simple data science projects into needlessly complicated engagements.
With some of the risks having been stated, more of the right data is good. In fact, adding more of the right types of data can often enable remarkable revelations that would not have been feasible with small or segregated sets of data. In practice, combining data often magnifies the value of the data, such that the overall value of the data is more than the mere sum of the value of the various sets.
Once the useful sets of data are combined, there is often a necessary data-wrangling phase in which the raw form of the data is transformed into a more useful structure. This might involve manually cleaning the data values, spreading or gathering the data, creating new fields based on pre-existing fields, and so on. There is no prescribed sequence of operations that should be applied, and different data scientists might manipulate the same sets of data differently. While the field is referred to as “data science,” there is a high degree of art to the craft.
Depending on the type of method that is to be applied within a given data science project, it may be necessary to break the set of data into chunks: a training set that is used to construct a model, a validation set that is used interactively to assess the accuracy of the model and to inform whether or not a model needs to be revised, and a testing set that provides a final, impartial assessment of the overall model accuracy.
The composition of the data is highly dependent on the methods to be applied, which we will elaborate on next.
We think of this practice as a “science” because of its well-reasoned, inquisitive nature. Once we understand the challenges to be addressed from a practical standpoint, we proceed with formulating and testing hypothesis, applying systematic analytical methods to better understand the subject under consideration, and building models that can subsequently be integrated into operations or products. We recognize that there may be missteps along the way, but we consider those moments to be investments in the overall process of working toward an objectively good outcome.
Terms that are often bandied about in data science discussions include “statistics,” “support vector machines,” “random forests,” and “neural networks.” We consider these, but some available tools that can be utilized to address particular problem spaces. We tend not to occupy ourselves too much with specific tools; instead, we selectively pick the right tools for the job, depending on our understanding derived from our consultations with the subject-matter experts. We often find it necessary to use combinations of different tools to get the job done correctly.
Given the volume of tool choices to pick from, it can sometimes be overwhelming to understand which tool is appropriate for a given problem. In the realm of data science, everyone should expect an inquisition. We might ask if projects are strictly for analysis, in which case, summaries, tabulation, or visualization might be sufficient to convey a deeper understanding, or for eventual automation, in which case modeling might be appropriate. If it is, in fact, a modeling task, we might ask further about the scope of the problem and the data availability, which may lead us down a path of conventional statistics or more sophisticated machine learning. We could also inquire about the intended output. Is it a quantified output? If so, we’ll consider regression methods, like linear regressions. Or, is it a categorical output? In which case, we’ll consider classifiers, like random forests.
Throughout the process of applying the science, we need to remain cognizant of potential risks. A key consideration is the risk of overfitting while formulating the model. In essence, we might see remarkably good results during the model training phase, but it might be due to the model internalizing the idiosyncrasies of the particular dataset under consideration. The risk there is that the model ceases to be generalizable and would likely perform poorly when deployed.
The path of data science is fraught with potential pitfalls, but if we can apply a well-reasoned framework to scope and direct the flow of the work, we can improve the likelihood of delivering something useful. At the core of our engagements is an appreciation of the expertise of non-data science colleagues that we might work with, from whom we draw further direction on the handling of data and the eventual selection and application of diverse methods. Such is the path to constructing useful methods to help identify malicious activities, to understand the threat landscape, to optimizing operations.
Go to Source
Author: Kwan Lin