Loan Repayment: Build Better Models Faster
The primary difficulty facing a data scientist approaching the Home Credit Loan problem (a machine learning competition currently running on Kaggle where the objective is to predict if a loan will be repaid by a client) is the size and spread of the data. Take a look at the complete dataset and you are confronted with 58 million rows of data spread across seven tables. Machine learning requires a single table for training, so feature engineering means consolidating all the information about each client in one table.
My first attempt at the problem used traditional manual feature engineering: I spent a total of 10 hours creating a set of features by hand. First I read other data scientist’s work, explored the data, and researched the problem area in order to acquire the necessary domain knowledge. Then I translated the knowledge into code, building one feature at a time. As an example of a single manual feature, I found the total number of late payments a client had on previous loans, an operation that required using 3 different tables.
The final manual engineered features performed quite well, achieving a 65% improvement over the baseline features (relative to the top leaderboard score), indicating the importance of proper feature engineering.
However, inefficient does not even begin to describe this process. For manual feature engineering, I ended up spending over 15 minutes per feature because I used the traditional approach of making a single feature at a time.
Besides being tedious and time-consuming, manual feature engineering is:
- Problem-specific: all of the code I wrote over many hours cannot be applied to any other problem
- Error-prone: each line of code is another opportunity to make a mistake
Furthermore, the final manual engineered features are limited both by human creativity and patience: there are only so many features we can think to build and only so much time we have to make them.
The promise of automated feature engineering is to surpass these limitations by taking a set of related tables and automatically building hundreds of useful features using code that can be applied across all problems.
From Manual to Automated Feature Engineering
As implemented in Featuretools, automated feature engineering allows even a domain novice such as myself to create thousands of relevant features from a set of related data tables. All we need to know is the basic structure of our tables and the relationships between them which we track in a single data structure called an entity set. Once we have an entity set, using a method called Deep Feature Synthesis (DFS), we’re able to build thousands of features in one function call.
DFS works using functions called “primitives” to aggregate and transform our data. These primitives can be as simple as taking a mean or a max of a column, or they can be complex and based on subject expertise because Featuretools allows us to define our own custom primitives.
Feature primitives include many operations we already would do manually by hand, but with Featuretools, instead of re-writing the code to apply these operations on different datasets, we can use the same exact syntax across any relational database. Moreover, the power of DFS comes when we stack primitives on each other to create deep features. (For more on DFS, take a look at this blog post by one of the inventors of the technique.)
Deep Feature Synthesis is flexible — allowing it to be applied to any data science problem — and powerful — revealing insights in our data by creating deep features.
I’ll spare you the few lines of code needed for the set-up, but the action of DFS happens in a single line. Here we make thousands of features for each client using all 7 tables in our dataset (
ft is the imported featuretools library) :
# Deep feature synthesis
feature_matrix, features = ft.dfs(entityset=es,
agg_primitives = agg_primitives,
trans_primitives = trans_primitives)
Below are some of the 1820 features we automatically get from Featuretools:
- The maximum total amount paid on previous loans by a client. This is created using a
SUMprimitive across 3 tables.
- The percentile ranking of a client in terms of average previous credit card debt. This uses a
MEANprimitive across 2 tables.
- Whether or not a client turned in two documents during the application process. This uses a
ANDtransform primitive and 1 table.
Each of these features is built using simple aggregations and hence is human-interpretable. Featuretools created many of the same features I did manually, but also thousands I never would have conceived — or had the time to implement. Not every single feature will be relevant to the problem, and some of the features are highly correlated, nonetheless, having too many features is a better problem than having too few!
After a little feature selection and model optimization, these features did slightly better in a predictive model compared to the manual features with an overall development time of 1 hour, a 10x reduction compared to the manual process. Featuretools is much faster both because it requires less domain knowledge and because there are considerably fewer lines of code to write.
I’ll admit that there is a slight time cost to learning Featuretools but it’s an investment that will pay off. After taking an hour or so to learn Featuretools, you can apply it to any machine learning problem.
The following graphs sum up my experience for the loan repayment problem:
- Development time: accounts for everything required to make the final feature engineering code: 10 hours manual vs 1 hour automated
- Number of features produced by the method: 30 features manual vs 1820 automated
- Improvement relative to baseline is the % gain over the baseline compared to the top public leaderboard score using a model trained on the features: 65% manual vs 66% automated
My takeaway is that automated feature engineering will not replace the data scientist, but rather by significantly increasing efficiency, it will free her to spend more time on other aspects of the machine learning pipeline.
Furthermore, the Featuretools code I wrote for this first project could be applied to any dataset while the manual engineering code would have to be thrown away and entirely rewritten for the next dataset!