Creative inquiry and analysis requires creative individuals with broad skills in a variety of technical fields. Accordingly, our group of data scientists come from academic fields like computer science, physics, and applied mathematics, with problem domains as diverse as nuclear engineering, materials research, computer vision, high-energy physics, computational fluid dynamics, and transportation logistics. We strive to frame and address problems with the nuance they deserve, rather than applying the same small roster of one-size-fits-all methods to generic problems, and claiming victory if something happens to stick.
In these roles and many others, our software developers are engaged in building the kinds of environments where data science can succeed.
Part of innovation is bringing specialists from the bleeding edge of their respective fields into cross-disciplinary research, where their scientific contributions and ideas can be applied to valuable problems in other industries. In that capacity, our group engages researchers from financial, pure, and applied mathematics, engineering, science, medicine, and other fields to explore how big problems can be addressed with non-traditional solutions.
Trainees & Visitors
We’re always looking for interns, visitors, and trainees to join our team and learn from our way of doing things. Although we are happy to hear from just about anyone, we’re especially interested in engaging students of mathematics, scientific computing, and other applied and technical fields.
- Predictive Analytics
- Uncertainty Quantification
- Behavioral Analytics
- Distributed & Scientific Comp.
- Analyzing Integrated Data
- Hybrid Data Architectures
- Machine Learning
- Latent Knowledge Extraction
Quantitative marketing and financial forecasting have shown that the future can be predicted and business strategy can be oriented with the resulting quantification in mind. Predictive analytics is concerned with deploying techniques from statistics, applied mathematics, and machine learning in order to infer the future state of a system based on its previous known states. Although this is often as simple as forecasting supply or demand, it can result in answers such as the forecasted distribution of patients receiving a certain level of care, the adherence of a population to a given self-care protocol (e.g., pharmaceutical compliance), or even predicting the conditions under which a service or resource will be oversubscribed in the future.
An answer is often only as good as the context which can be provided for the original question, the data consumed during the formulation of the answer, or the techniques used over the course of that analysis. Uncertainty quantification is a discipline which attempts to take measure of uncertainty and minimize it when possible. For example, if a data set is known to have certain anomalous features (high signal-to-noise ratio in one of the columns, missing entries, incorrectly encoded entries, etc), one immediate option is to invest resources in cleaning that data set. However, this is not always practical or economical. Initially, it is often preferable to attempt an answer to the questions, “How much is my data affecting the resulting answer?” And relatedly, “How much should I pay for cleaner data?”
A patient’s health doesn’t pause when a doctor isn’t observing it. However, almost all types of medical data are collected in a medical setting. As a result, researches can conflate the health of a hospital with the population at large. To get a view of the bigger picture of what else contributes to a patient’s health in the real world, our team works with collections of data both inside and outside the traditional realm of electronic health records. Massive data repositories collected from social media, criminal justice records, and census forms can fill knowledge gaps concerning a population’s behavior.
Deploying a predictive algorithm is (and should be!) more than just finding a library which implements that algorithm, and calling it from one’s favorite programming interface. Algorithm development and deployment is all about understanding how a technique works and what aspects of its function are most relevant to the challenge at hand. Computational science, as a discipline concerned with many of the mechanisms underpinning data science, has been a core skill in the development of our data science framework.
Moreover, because computational infrastructure is no longer restricted to handfuls of monolithic servers providing the core resources of an enterprise, the groups responsible for analyzing data coming from modern computer networks must have the skills to appropriately leverage clusters of compute devices. Not only must teams be aware of the methodologies for scheduling and load-balancing work on clusters, but also programming methodologies such as MapReduce, which can reduce the burden on the developer for exploiting distributed hardware.
This interplay between distributed and scientific computing is one of the core tenants of our technological efforts.
Ultimately, there are very few simple ways to integrate different data sources (for example, database tables alongside written documents or Tweets), making it challenging to apply the classic tools of data mining to problems which attempt to incorporate multiple views of the same context. However, the value of seeing the same story unfold from two separate perspectives, and the enormous value of consuming those perspectives for quantitative analysis, is enormous. The analysis of integrated data relies on the use of multiple tools from computer vision, machine learning, and statistics, in order to featurize datasets which contain information from multiple sources.
If today’s trends are to be followed, we’re certain that tomorrow’s data architectures are going to store a greater variety of data and give the end-consumer more flexibility over what to do with it. Accordingly, we’ve engineered many of our data science pipelines accordingly. This is a challenge because the end-user must have a consistent functional experience throughout the engagement, and because multiple enterprise-scale tools still naturally lend themselves to silos.
A common challenge in data science is the synthesis between unstructured (e.g., images and documents) and structured (e.g., databases, spreadsheets) data related to related domains. In the context of that problem–and many more–machine learning has been found to be especially valuable as a means of featurizing unstructured data in a way compatible with structured analysis. Above and beyond that single need, machine learning also offers a rich collection of unique tools with which to tackle data science problems.
Most common methods in statistics make statements about significance for individual variables. This approach, however, can fail when analyzing diverse datasets. For instance, a text analysis of patient records would provide very little if confined to analyzing words by themselves. Instead, one need tools that incorporate several features as new latent variables. This goes beyond the usual linear combination methods of component analysis. Our group uses graph and text clustering techniques through similarity metrics, Bayesian inference, and other nonlinear considerations. These new variables can uncover new trends that are unobtainable from only observing marginals.
JCO March 10, 2014 vol. 32 no. 8 774-782
PLoS ONE 6(7), 2011