Predict Loan Repayment

16 minute read

Objective: Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), determine whether or not a borrower will pay back their loan.

Source: Udemy | Python for Data Science and Machine Learning Bootcamp
Data used in the below analysis: Subset of LendingClub DataSet from Kaggle.
Actual files used: Info | Data

#importing libraries to be used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

data_info = pd.read_csv('../DATA/lending_club_info.csv', index_col='LoanStatNew') #reading data desc

#created a function to get the desc of any column in our dataset
def feat_info(col_name):
    print(data_info.loc[col_name]['Description'])

sns.set_style('whitegrid')

#reading data in DataFrame
df = pd.read_csv('../DATA/lending_club_loan_two.csv')

#Getting details of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 396030 entries, 0 to 396029
Data columns (total 27 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 loan_amnt             396030 non-null  float64
 term                  396030 non-null  object
 int_rate              396030 non-null  float64
 installment           396030 non-null  float64
 grade                 396030 non-null  object
 sub_grade             396030 non-null  object
 emp_title             373103 non-null  object
 emp_length            377729 non-null  object
 home_ownership        396030 non-null  object
 annual_inc            396030 non-null  float64
verification_status   396030 non-null  object
issue_d               396030 non-null  object
loan_status           396030 non-null  object
purpose               396030 non-null  object
title                 394275 non-null  object
dti                   396030 non-null  float64
earliest_cr_line      396030 non-null  object
open_acc              396030 non-null  float64
pub_rec               396030 non-null  float64
revol_bal             396030 non-null  float64
revol_util            395754 non-null  float64
total_acc             396030 non-null  float64
initial_list_status   396030 non-null  object
application_type      396030 non-null  object
mort_acc              358235 non-null  float64
pub_rec_bankruptcies  395495 non-null  float64
address               396030 non-null  object
dtypes: float64(12), object(15)
memory usage: 81.6+ MB

df.describe().transpose()

	count	mean	std	min	25%	50%	75%	max
loan_amnt	396030.0	14113.888089	8357.441341	500.00	8000.00	12000.00	20000.00	40000.00
int_rate	396030.0	13.639400	4.472157	5.32	10.49	13.33	16.49	30.99
installment	396030.0	431.849698	250.727790	16.08	250.33	375.43	567.30	1533.81
annual_inc	396030.0	74203.175798	61637.621158	0.00	45000.00	64000.00	90000.00	8706582.00
dti	396030.0	17.379514	18.019092	0.00	11.28	16.91	22.98	9999.00
open_acc	396030.0	11.311153	5.137649	0.00	8.00	10.00	14.00	90.00
pub_rec	396030.0	0.178191	0.530671	0.00	0.00	0.00	0.00	86.00
revol_bal	396030.0	15844.539853	20591.836109	0.00	6025.00	11181.00	19620.00	1743266.00
revol_util	395754.0	53.791749	24.452193	0.00	35.80	54.80	72.90	892.30
total_acc	396030.0	25.414744	11.886991	2.00	17.00	24.00	32.00	151.00
mort_acc	358235.0	1.813991	2.147930	0.00	0.00	1.00	3.00	34.00
pub_rec_bankruptcies	395495.0	0.121648	0.356174	0.00	0.00	0.00	0.00	8.00

Starting with Exploratory Data Analysis!

sns.countplot(x='loan_status',data=df)

Since we are trying to predict loan_status, the above plot makes it clear that our historic data is not well balanced. This means our model will not have a great precision / recall as data is unbalanced.

plt.figure(figsize=(12,5))
sns.distplot(df['loan_amnt'], bins=50, kde=False)

Most of the loan amounts is concentrated around 5000-20000 with very few values at 40000

#checking correlations among continuous features
df.corr()

	loan_amnt	int_rate	installment	annual_inc	dti	open_acc	pub_rec	revol_bal	revol_util	total_acc	mort_acc	pub_rec_bankruptcies
loan_amnt	1.000000	0.168921	0.953929	0.336887	0.016636	0.198556	-0.077779	0.328320	0.099911	0.223886	0.222315	-0.106539
int_rate	0.168921	1.000000	0.162758	-0.056771	0.079038	0.011649	0.060986	-0.011280	0.293659	-0.036404	-0.082583	0.057450
installment	0.953929	0.162758	1.000000	0.330381	0.015786	0.188973	-0.067892	0.316455	0.123915	0.202430	0.193694	-0.098628
annual_inc	0.336887	-0.056771	0.330381	1.000000	-0.081685	0.136150	-0.013720	0.299773	0.027871	0.193023	0.236320	-0.050162
dti	0.016636	0.079038	0.015786	-0.081685	1.000000	0.136181	-0.017639	0.063571	0.088375	0.102128	-0.025439	-0.014558
open_acc	0.198556	0.011649	0.188973	0.136150	0.136181	1.000000	-0.018392	0.221192	-0.131420	0.680728	0.109205	-0.027732
pub_rec	-0.077779	0.060986	-0.067892	-0.013720	-0.017639	-0.018392	1.000000	-0.101664	-0.075910	0.019723	0.011552	0.699408
revol_bal	0.328320	-0.011280	0.316455	0.299773	0.063571	0.221192	-0.101664	1.000000	0.226346	0.191616	0.194925	-0.124532
revol_util	0.099911	0.293659	0.123915	0.027871	0.088375	-0.131420	-0.075910	0.226346	1.000000	-0.104273	0.007514	-0.086751
total_acc	0.223886	-0.036404	0.202430	0.193023	0.102128	0.680728	0.019723	0.191616	-0.104273	1.000000	0.381072	0.042035
mort_acc	0.222315	-0.082583	0.193694	0.236320	-0.025439	0.109205	0.011552	0.194925	0.007514	0.381072	1.000000	0.027239
pub_rec_bankruptcies	-0.106539	0.057450	-0.098628	-0.050162	-0.014558	-0.027732	0.699408	-0.124532	-0.086751	0.042035	0.027239	1.000000

We can better visualize this with a heat map!

plt.figure(figsize=(13,8))
sns.heatmap(df.corr(), annot=True, cmap='viridis')

We can explore the high correlation of installment with loan amount. This is as per expectations as we know the amount of installment must depend on the loan taken.

feat_info('installment')

The monthly payment owed by the borrower if the loan originates.

feat_info('loan_amnt')

The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.

plt.figure(figsize=(12,7))
sns.scatterplot(x='installment',y='loan_amnt',data=df)

plt.figure(figsize=(10,5))
sns.boxplot(x='loan_status',y='loan_amnt',data=df)

df.groupby(by='loan_status')['loan_amnt'].describe()

	count	mean	std	min	25%	50%	75%	max
loan_status
Charged Off	77673.0	15126.300967	8505.090557	1000.0	8525.0	14000.0	20000.0	40000.0
Fully Paid	318357.0	13866.878771	8302.319699	500.0	7500.0	12000.0	19225.0	40000.0

df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',
       'dti', 'earliest_cr_line', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'application_type',
       'mort_acc', 'pub_rec_bankruptcies', 'address'],
      dtype='object')

df['grade'].unique()

array(['B', 'A', 'C', 'E', 'D', 'F', 'G'], dtype=object)

df['sub_grade'].unique()

array(['B4', 'B5', 'B3', 'A2', 'C5', 'C3', 'A1', 'B2', 'C1', 'A5', 'E4',
       'A4', 'A3', 'D1', 'C2', 'B1', 'D3', 'D5', 'D2', 'E1', 'E2', 'E5',
       'F4', 'E3', 'D4', 'G1', 'F5', 'G2', 'C4', 'F1', 'F3', 'G5', 'G4',
       'F2', 'G3'], dtype=object)

plt.figure(figsize=(10,5))
grade_sorted = sorted(df['grade'].unique())
sns.countplot(x='grade',data=df,hue='loan_status', order=grade_sorted)

plt.figure(figsize=(13,6))
sub_grade_sorted = sorted(df['sub_grade'].unique())
sns.countplot(x='sub_grade',data=df, order=sub_grade_sorted, palette = 'coolwarm')

plt.figure(figsize=(15,6))
sub_grade_sorted = sorted(df['sub_grade'].unique())
sns.countplot(x='sub_grade',data=df, order=sub_grade_sorted, hue='loan_status')

It seems like F and G grade loans don’t get paid back often

plt.figure(figsize=(10,6))
df_grade_FG = df[(df['grade']=='F') | (df['grade'] == 'G')]
sub_grade_sorted = sorted(df_grade_FG['sub_grade'].unique())
sns.countplot(x='sub_grade', data = df_grade_FG, order = sub_grade_sorted, hue='loan_status')

def loan_status(string):
    if string == 'Fully Paid':
        return 1
    else:
        return 0

df['loan_repaid'] = df['loan_status'].apply(loan_status)

df.head()

	loan_amnt	term	int_rate	installment	grade	sub_grade	emp_title	emp_length	home_ownership	annual_inc	...	revol_bal	revol_util	total_acc	initial_list_status	application_type	mort_acc	address	loan_repaid
0	10000.0	36 months	11.44	329.48	B	B4	Marketing	10+ years	RENT	117000.0	...	36369.0	41.8	25.0	w	INDIVIDUAL	0.0	0174 Michelle Gateway\nMendozaberg, OK 22690	1
1	8000.0	36 months	11.99	265.68	B	B5	Credit analyst	4 years	MORTGAGE	65000.0	...	20131.0	53.3	27.0	f	INDIVIDUAL	3.0	1076 Carney Fort Apt. 347\nLoganmouth, SD 05113	1
2	15600.0	36 months	10.49	506.97	B	B3	Statistician	< 1 year	RENT	43057.0	...	11987.0	92.2	26.0	f	INDIVIDUAL	0.0	87025 Mark Dale Apt. 269\nNew Sabrina, WV 05113	1
3	7200.0	36 months	6.49	220.65	A	A2	Client Advocate	6 years	RENT	54000.0	...	5472.0	21.5	13.0	f	INDIVIDUAL	0.0	823 Reid Ford\nDelacruzside, MA 00813	1
4	24375.0	60 months	17.27	609.33	C	C5	Destiny Management Inc.	9 years	MORTGAGE	55000.0	...	24584.0	69.8	43.0	f	INDIVIDUAL	1.0	679 Luna Roads\nGreggshire, VA 11650	0

5 rows × 28 columns

plt.figure(figsize=(10,6))
df.corr()['loan_repaid'].sort_values().drop('loan_repaid').plot(kind='bar')

The loan repayment is most correlated with annual_inc and mort_acc and highly uncorrelated with interest rate.

print("annual_inc:")
feat_info('annual_inc')
print('mort_acc:')
feat_info('mort_acc')
print('int_rate:')
feat_info('int_rate')

annual_inc:
The self-reported annual income provided by the borrower during registration.
mort_acc:
Number of mortgage accounts.
int_rate:
Interest Rate on the loan

Data PreProcessing!

We need to fill missing data or remove unnecessary features and convert categorical features into numerical ones using dummy variable (one hot encoding)

df.shape[0] #number of entries in our data

df.isnull().sum()

loan_amnt                   0
term                        0
int_rate                    0
installment                 0
grade                       0
sub_grade                   0
emp_title               22927
emp_length              18301
home_ownership              0
annual_inc                  0
verification_status         0
issue_d                     0
loan_status                 0
purpose                     0
title                    1755
dti                         0
earliest_cr_line            0
open_acc                    0
pub_rec                     0
revol_bal                   0
revol_util                276
total_acc                   0
initial_list_status         0
application_type            0
mort_acc                37795
pub_rec_bankruptcies      535
address                     0
loan_repaid                 0
dtype: int64

#convert into % of total data frame
df.isnull().sum()/df.shape[0]

loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
emp_title               0.057892
emp_length              0.046211
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
title                   0.004431
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.000697
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                0.095435
pub_rec_bankruptcies    0.001351
address                 0.000000
loan_repaid             0.000000
dtype: float64

We need to keep/remove features with missing values. Let’s determine that one by one.

print("emp_title:")
feat_info('emp_title')
print("emp_length:")
feat_info('emp_length')

emp_title:
The job title supplied by the Borrower when applying for the loan.*
emp_length:
Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.

df['emp_title'].nunique()

df['emp_title'].value_counts()

Teacher                                   4389
Manager                                   4250
Registered Nurse                          1856
RN                                        1846
Supervisor                                1830
                                          ...
Color tech                                   1
New York City Department of Education        1
Bartender/Food Server                        1
Commercial Ara Executive                     1
El Cortez Hotel & Casino                     1
Name: emp_title, Length: 173105, dtype: int64

There are way too many unique titles to convert them into dummy variable, we shall drop it

df.drop('emp_title', axis=1, inplace=True)

sorted(df['emp_length'].dropna().unique())

['1 year',
 '10+ years',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '< 1 year']

emp_length_order = [ '< 1 year',
                      '1 year',
                     '2 years',
                     '3 years',
                     '4 years',
                     '5 years',
                     '6 years',
                     '7 years',
                     '8 years',
                     '9 years',
                     '10+ years']

plt.figure(figsize=(13,5))
sns.countplot(x='emp_length',data=df,order=emp_length_order)

plt.figure(figsize=(13,5))
sns.countplot(x='emp_length',data=df,order=emp_length_order, hue='loan_status')

This doesn’t really inform us if there is a strong relationship between employment length and being charged off, we need a percentage of charge offs per category or what percent of people per employment category didn’t pay back their loan.

data_for_emp_len = df[df['loan_repaid']==1].groupby(by='emp_length').count()
data_for_emp_len = pd.merge(data_for_emp_len,
                                    pd.DataFrame(df.groupby(by='emp_length').count()
                                                 ['loan_repaid']).reset_index(),on = 'emp_length')
data_for_emp_len['percentage'] = data_for_emp_len['loan_repaid_x'] / data_for_emp_len['loan_repaid_y']
data_for_emp_len.set_index('emp_length',inplace=True)
data_for_emp_len

	loan_amnt	term	int_rate	installment	grade	sub_grade	home_ownership	annual_inc	verification_status	issue_d	...	revol_util	total_acc	initial_list_status	application_type	mort_acc	pub_rec_bankruptcies	address	loan_repaid_x	loan_repaid_y	percentage
emp_length
1 year	20728	20728	20728	20728	20728	20728	20728	20728	20728	20728	...	20712	20728	20728	20728	18126	20666	20728	20728	25882	0.800865
10+ years	102826	102826	102826	102826	102826	102826	102826	102826	102826	102826	...	102766	102826	102826	102826	95511	102753	102826	102826	126041	0.815814
2 years	28903	28903	28903	28903	28903	28903	28903	28903	28903	28903	...	28886	28903	28903	28903	25355	28848	28903	28903	35827	0.806738
3 years	25483	25483	25483	25483	25483	25483	25483	25483	25483	25483	...	25468	25483	25483	25483	22220	25437	25483	25483	31665	0.804769
4 years	19344	19344	19344	19344	19344	19344	19344	19344	19344	19344	...	19333	19344	19344	19344	16526	19321	19344	19344	23952	0.807615
5 years	21403	21403	21403	21403	21403	21403	21403	21403	21403	21403	...	21391	21403	21403	21403	18691	21381	21403	21403	26495	0.807813
6 years	16898	16898	16898	16898	16898	16898	16898	16898	16898	16898	...	16884	16898	16898	16898	15002	16878	16898	16898	20841	0.810806
7 years	16764	16764	16764	16764	16764	16764	16764	16764	16764	16764	...	16747	16764	16764	16764	15284	16751	16764	16764	20819	0.805226
8 years	15339	15339	15339	15339	15339	15339	15339	15339	15339	15339	...	15327	15339	15339	15339	14142	15323	15339	15339	19168	0.800240
9 years	12244	12244	12244	12244	12244	12244	12244	12244	12244	12244	...	12235	12244	12244	12244	11192	12233	12244	12244	15314	0.799530
< 1 year	25162	25162	25162	25162	25162	25162	25162	25162	25162	25162	...	25139	25162	25162	25162	21629	25055	25162	25162	31725	0.793128

11 rows × 28 columns

plt.figure(figsize=(13,5))
data_for_emp_len['percentage'].plot(kind='bar')

The above plot shows % of poeple who paid their loans from the total number of people who took loan grouped at employee length. We can see that this is extremely same across all emp_lengths so we can drop the same.

df.drop('emp_length',axis=1,inplace=True)

df.isnull().sum()

loan_amnt                   0
term                        0
int_rate                    0
installment                 0
grade                       0
sub_grade                   0
home_ownership              0
annual_inc                  0
verification_status         0
issue_d                     0
loan_status                 0
purpose                     0
title                    1755
dti                         0
earliest_cr_line            0
open_acc                    0
pub_rec                     0
revol_bal                   0
revol_util                276
total_acc                   0
initial_list_status         0
application_type            0
mort_acc                37795
pub_rec_bankruptcies      535
address                     0
loan_repaid                 0
dtype: int64

print("purpose:")
feat_info('purpose')
print("title:")
feat_info('title')

purpose:
A category provided by the borrower for the loan request.
title:
The loan title provided by the borrower

df['purpose'].unique()

array(['vacation', 'debt_consolidation', 'credit_card',
       'home_improvement', 'small_business', 'major_purchase', 'other',
       'medical', 'wedding', 'car', 'moving', 'house', 'educational',
       'renewable_energy'], dtype=object)

df['title'].unique()

array(['Vacation', 'Debt consolidation', 'Credit card refinancing', ...,
       'Credit buster ', 'Loanforpayoff', 'Toxic Debt Payoff'],
      dtype=object)

The title is just a sub category for purpose, we can drop it!

df.drop('title', axis=1, inplace=True)

df['mort_acc'].value_counts()

0     139777
0      60416
0      49948
0      38049
0      27887
0      18194
0      11069
0       6052
0       3121
0       1656
0       865
0       479
0       264
0       146
0       107
0        61
0        37
0        22
0        18
0        15
0        13
0        10
0         7
0         4
0         4
0         3
0         2
0         2
0         2
0         2
0         1
0         1
0         1
Name: mort_acc, dtype: int64

There are many ways we could deal with this missing data. Build a simple model to fill it in, fill it in based on the mean of the other columns, or bin the columns into categories and then set NaN as its own category.

df.corr()['mort_acc'].sort_values()

int_rate               -0.082583
dti                    -0.025439
revol_util              0.007514
pub_rec                 0.011552
pub_rec_bankruptcies    0.027239
loan_repaid             0.073111
open_acc                0.109205
installment             0.193694
revol_bal               0.194925
loan_amnt               0.222315
annual_inc              0.236320
total_acc               0.381072
mort_acc                1.000000
Name: mort_acc, dtype: float64

The column mort_acc is most correlated with total_acc.

mean_mort_total = df.groupby(by='total_acc').mean()['mort_acc']
mean_mort_total

total_acc
0      0.000000
0      0.052023
0      0.066743
0      0.103289
0      0.151293
           ...   
0    1.000000
0    1.000000
0    3.000000
0    2.000000
0    0.000000
Name: mort_acc, Length: 118, dtype: float64

We’ll try to fill the missing values using this series

def fill_mort_acc(total,mort):
    if np.isnan(mort):
        return mean_mort_total[total]
    else:
        return mort

df['mort_acc'] = df.apply(lambda x : fill_mort_acc(x['total_acc'],x['mort_acc']), axis=1)

df.isnull().sum()/df.shape[0]

loan_amnt               0.000000
term                    0.000000
int_rate                0.000000
installment             0.000000
grade                   0.000000
sub_grade               0.000000
home_ownership          0.000000
annual_inc              0.000000
verification_status     0.000000
issue_d                 0.000000
loan_status             0.000000
purpose                 0.000000
dti                     0.000000
earliest_cr_line        0.000000
open_acc                0.000000
pub_rec                 0.000000
revol_bal               0.000000
revol_util              0.000697
total_acc               0.000000
initial_list_status     0.000000
application_type        0.000000
mort_acc                0.000000
pub_rec_bankruptcies    0.001351
address                 0.000000
loan_repaid             0.000000
dtype: float64

As the data covered by revol_util and pub_rec_bankruptcies account for less than 0.05% of the data, we’ll drop these.

df.drop('revol_util',axis=1,inplace=True)
df.drop('pub_rec_bankruptcies',axis=1,inplace=True)

df.isnull().sum()

loan_amnt              0
term                   0
int_rate               0
installment            0
grade                  0
sub_grade              0
home_ownership         0
annual_inc             0
verification_status    0
issue_d                0
loan_status            0
purpose                0
dti                    0
earliest_cr_line       0
open_acc               0
pub_rec                0
revol_bal              0
total_acc              0
initial_list_status    0
application_type       0
mort_acc               0
address                0
loan_repaid            0
dtype: int64

Convert categorical data into dummy variables

#getting textual columns
df.select_dtypes(include='object').columns

Index(['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status',
       'issue_d', 'loan_status', 'purpose', 'earliest_cr_line',
       'initial_list_status', 'application_type', 'address'],
      dtype='object')

feat_info('term')

The number of payments on the loan. Values are in months and can be either 36 or 60.

df['term'].unique()

array([' 36 months', ' 60 months'], dtype=object)

def term_change(string):
    if string == ' 36 months':
        return 36
    else:
        return 60

df['term'] = df['term'].apply(term_change)

df['term'].unique()

array([36, 60])

Since grade is a part of sub_grade, we can drop it!

df.drop('grade',axis=1,inplace=True)

#converting sub grade into dummy variables
dummy_df = pd.get_dummies(df,columns=['sub_grade'],drop_first=True)

dummy_df.select_dtypes(include='object').columns

Index(['home_ownership', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'earliest_cr_line', 'initial_list_status',
       'application_type', 'address'],
      dtype='object')

features = ['verification_status', 'application_type','initial_list_status','purpose']
dummy_df = pd.get_dummies(dummy_df,columns=features,drop_first=True)

dummy_df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'home_ownership',
       'annual_inc', 'issue_d', 'loan_status', 'dti', 'earliest_cr_line',
       'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'mort_acc', 'address',
       'loan_repaid', 'sub_grade_A2', 'sub_grade_A3', 'sub_grade_A4',
       'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2', 'sub_grade_B3',
       'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1', 'sub_grade_C2',
       'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5', 'sub_grade_D1',
       'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4', 'sub_grade_D5',
       'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3', 'sub_grade_E4',
       'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2', 'sub_grade_F3',
       'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1', 'sub_grade_G2',
       'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5',
       'verification_status_Source Verified', 'verification_status_Verified',
       'application_type_INDIVIDUAL', 'application_type_JOINT',
       'initial_list_status_w', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_educational',
       'purpose_home_improvement', 'purpose_house', 'purpose_major_purchase',
       'purpose_medical', 'purpose_moving', 'purpose_other',
       'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding'],
      dtype='object')

dummy_df.select_dtypes(include='object').columns

Index(['home_ownership', 'issue_d', 'loan_status', 'earliest_cr_line',
       'address'],
      dtype='object')

dummy_df['home_ownership'].value_counts()

MORTGAGE    198348
RENT        159790
OWN          37746
OTHER          112
NONE            31
ANY              3
Name: home_ownership, dtype: int64

def home_owner(string):
    if string == 'NONE' or string == 'ANY':
        return 'OTHER'
    else:
        return string

dummy_df['home_ownership'] = dummy_df['home_ownership'].apply(home_owner)
dummy_df['home_ownership'].value_counts()

MORTGAGE    198348
RENT        159790
OWN          37746
OTHER          146
Name: home_ownership, dtype: int64

dummy_df = pd.get_dummies(dummy_df,columns=['home_ownership'],drop_first=True)

dummy_df.select_dtypes(include='object').columns

Index(['issue_d', 'loan_status', 'earliest_cr_line', 'address'], dtype='object')

dummy_df['address'] = dummy_df['address'].apply(lambda x : x[-5:])
dummy_df['address']                            

0         22690
1         05113
2         05113
3         00813
4         11650
          ...  
396025    30723
396026    05113
396027    70466
396028    29597
396029    48052
Name: address, Length: 396030, dtype: object

dummy_df = pd.get_dummies(dummy_df,columns=['address'],drop_first=True)
dummy_df.columns

Index(['loan_amnt', 'term', 'int_rate', 'installment', 'annual_inc', 'issue_d',
       'loan_status', 'dti', 'earliest_cr_line', 'open_acc', 'pub_rec',
       'revol_bal', 'total_acc', 'mort_acc', 'loan_repaid', 'sub_grade_A2',
       'sub_grade_A3', 'sub_grade_A4', 'sub_grade_A5', 'sub_grade_B1',
       'sub_grade_B2', 'sub_grade_B3', 'sub_grade_B4', 'sub_grade_B5',
       'sub_grade_C1', 'sub_grade_C2', 'sub_grade_C3', 'sub_grade_C4',
       'sub_grade_C5', 'sub_grade_D1', 'sub_grade_D2', 'sub_grade_D3',
       'sub_grade_D4', 'sub_grade_D5', 'sub_grade_E1', 'sub_grade_E2',
       'sub_grade_E3', 'sub_grade_E4', 'sub_grade_E5', 'sub_grade_F1',
       'sub_grade_F2', 'sub_grade_F3', 'sub_grade_F4', 'sub_grade_F5',
       'sub_grade_G1', 'sub_grade_G2', 'sub_grade_G3', 'sub_grade_G4',
       'sub_grade_G5', 'verification_status_Source Verified',
       'verification_status_Verified', 'application_type_INDIVIDUAL',
       'application_type_JOINT', 'initial_list_status_w',
       'purpose_credit_card', 'purpose_debt_consolidation',
       'purpose_educational', 'purpose_home_improvement', 'purpose_house',
       'purpose_major_purchase', 'purpose_medical', 'purpose_moving',
       'purpose_other', 'purpose_renewable_energy', 'purpose_small_business',
       'purpose_vacation', 'purpose_wedding', 'home_ownership_OTHER',
       'home_ownership_OWN', 'home_ownership_RENT', 'address_05113',
       'address_11650', 'address_22690', 'address_29597', 'address_30723',
       'address_48052', 'address_70466', 'address_86630', 'address_93700'],
      dtype='object')

#drop issue_d as we shouldn't know beforehand whether loadn would be issued or now
dummy_df.drop('issue_d',axis=1,inplace=True)

feat_info('earliest_cr_line')

The month the borrower's earliest reported credit line was opened

dummy_df['earliest_cr_line'] = dummy_df['earliest_cr_line'].apply(lambda x : int(x[-4:]))
dummy_df['earliest_cr_line']

0         1990
1         2004
2         2007
3         2006
4         1999
          ...
396025    2004
396026    2006
396027    1997
396028    1990
396029    1998
Name: earliest_cr_line, Length: 396030, dtype: int64

dummy_df.select_dtypes(include='object').columns

Index(['loan_status'], dtype='object')

We can now start with Model building!

from sklearn.model_selection import train_test_split
dummy_df.drop('loan_status',axis=1,inplace=True) #we alreday have loan_repaid in 0 and 1

X = dummy_df.drop('loan_repaid',axis=1).values
y = dummy_df['loan_repaid'].values

from sklearn.preprocessing import MinMaxScaler
X_train,X_test,y_train, y_test = train_test_split(X,y,random_state=101,test_size=0.2)

scale = MinMaxScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test)

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
from tensorflow.keras.optimizers import Adam

X_train.shape

(316824, 76)

We are creating a Sequential Model, with activation function as Rectified Linear Unit, with a Dropout layer of 20% neurons switching off, sigmoid function as activation for output. Since its a binary classification problem, we are using Binary Cross-Entropy as loss function and Adam as optimizer.

model = Sequential()

model.add(Dense(units=76, activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(units=38,activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(units=19,activation='relu'))
model.add(Dropout(0.2))

model.add(Dense(units=1,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam')

model.fit(X_train,y_train,batch_size=256,epochs=50,validation_data=(X_test,y_test),
          verbose=1)

Epoch 1/50
1238/1238 [==============================] - 2s 2ms/step - loss: 0.3007 - val_loss: 0.2633
Epoch 2/50
1238/1238 [==============================] - 2s 2ms/step - loss: 0.2664 - val_loss: 0.2604
Epoch 3/50
1238/1238 [==============================] - 2s 2ms/step - loss: 0.2637 - val_loss: 0.2597
Epoch 4/50
1238/1238 [==============================] - 2s 2ms/step - loss: 0.2626 - val_loss: 0.2592
Epoch 5/50
1238/1238 [==============================] - 2s 2ms/step - loss: 0.2618 - val_loss: 0.2591
Epoch 6/50
1238/1238 [==============================] - 2s 2ms/step - loss: 0.2612 - val_loss: 0.2593
.
.
.
Epoch 50/50
1238/1238 [==============================] - 3s 2ms/step - loss: 0.2549 - val_loss: 0.2575

loss = pd.DataFrame(model.history.history)
loss.plot()

The model is overfitting. We can try to fix that using Early stopping. We can even play around with dropout layers.

from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

n_model = Sequential()

n_model.add(Dense(units=76, activation='relu'))
n_model.add(Dropout(0.2))

n_model.add(Dense(units=38,activation='relu'))
n_model.add(Dropout(0.2))

n_model.add(Dense(units=19,activation='relu'))
n_model.add(Dropout(0.2))

n_model.add(Dense(units=1,activation='sigmoid'))
n_model.compile(loss='binary_crossentropy', optimizer='adam')

n_model.fit(X_train,y_train,batch_size=256,epochs=50,validation_data=(X_test,y_test),
          verbose=1, callbacks = [early_stop])

Epoch 1/50
1238/1238 [==============================] - 3s 2ms/step - loss: 0.3013 - val_loss: 0.2625
Epoch 2/50
1238/1238 [==============================] - 3s 2ms/step - loss: 0.2666 - val_loss: 0.2607
Epoch 3/50
1238/1238 [==============================] - 4s 3ms/step - loss: 0.2639 - val_loss: 0.2593
Epoch 4/50
1238/1238 [==============================] - 3s 3ms/step - loss: 0.2630 - val_loss: 0.2593
Epoch 5/50
1238/1238 [==============================] - 3s 2ms/step - loss: 0.2618 - val_loss: 0.2594
Epoch 6/50
1238/1238 [==============================] - 3s 2ms/step - loss: 0.2614 - val_loss: 0.2589
Epoch 7/50
.
.
.
Epoch 21/50
1238/1238 [==============================] - 3s 2ms/step - loss: 0.2580 - val_loss: 0.2587
Epoch 00021: early stopping

loss = pd.DataFrame(n_model.history.history)
loss.plot()

Evaluate the model

from sklearn.metrics import classification_report, confusion_matrix
n_predictions = n_model.predict_classes(X_test)

print(confusion_matrix(y_test,n_predictions))

[[ 6840  8653]
 [  114 63599]]

print(classification_report(y_test,n_predictions))

              precision    recall  f1-score   support

           0       0.98      0.44      0.61     15493
           1       0.88      1.00      0.94     63713

    accuracy                           0.89     79206
   macro avg       0.93      0.72      0.77     79206
weighted avg       0.90      0.89      0.87     79206

Predicting a new entry

import random
random.seed(101)
random_ind = random.randint(0,len(df))

new_customer = dummy_df.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer

loan_amnt        24000.00
term                60.00
int_rate            13.11
installment        547.43
annual_inc       85000.00
                   ...   
address_30723        0.00
address_48052        0.00
address_70466        0.00
address_86630        0.00
address_93700        0.00
Name: 304691, Length: 76, dtype: float64

new_customer = scale.transform(new_customer.values.reshape(1,76))
n_model.predict_classes(new_customer)

array([[1]], dtype=int32)

We’ve predicted that we would give this person the loan. Let’s check if they returned it.

dummy_df.iloc[random_ind]['loan_repaid']

1.0

Result: We were able to pre-process our data, do some feature engineering and create a model that predicts whether a loan would get repaid or not with an accuracy of 89%.

As mentioned before, due to unbalanced data, out true metric evaluation is ideally given by the F1 score of parametric with less values, i.e. 0. Here, it is 61%.

Whether it is good or not depends on the metric and the actual business requirement of the model we created. Given the skewed data, this is still pretty good.

Share on

Twitter Facebook Google+ LinkedIn

Vanya Sahu