This data set was taken from the UCI repository. It contains the medical records of 299 patients who were diagnosed with heart failure. The data was collected during a follow-up period where each patient had 13 clinical features examined. The data was collected by Davide Chicco and Guiseppe Jurman in 2020 for their research paper “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone”.
The purpose of this project is to predict which features are more statistically significant in regards to death event. We intend to focus on model selection and other different statistical prediction methods. Our results will be detailed below.
The limitations to this data set is that it is relatively small which means that even after finding the best model – we would run into problems with under fitting.
Thirteen (13) clinical features:
age: age of the patient (years)
anaemia: decrease of red blood cells or hemoglobin (boolean)
high blood pressure: if the patient has hypertension (boolean)
creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L). When a muscle tissue gets damaged, CPK flows into the blood. Therefore, high levels of CPK in the blood of a patient might indicate a heart failure or injury
diabetes: if the patient has diabetes (boolean)
ejection fraction: blood leaving the left ventricle pumps at each contraction (percentage)
platelets: platelets in the blood (kiloplatelets/mL)
sex: woman(0) or man(1) (binary)
serum creatinine: level of serum creatinine, which is a waste product generated by creatine, when a muscle breaks down, in the blood (mg/dL). Especially, doctors focus on serum creatinine in blood to check kidney function.
serum sodium: level of serum sodium in the blood (mEq/L). An abnormally low level of sodium in the blood might be caused by heart failure.
smoking: if the patient smokes or not (boolean)
time: follow-up period (days)
[target] death event: if the patient deceased during the follow-up period (boolean)
head(heartdata)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 1 265000 1.9 130 1 0 4
## 2 0 263358 1.1 136 1 0 6
## 3 0 162000 1.3 129 1 1 7
## 4 0 210000 1.9 137 1 0 7
## 5 0 327000 2.7 116 0 0 8
## 6 1 204000 2.1 132 1 1 8
## DEATH_EVENT
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
sum(is.na(heartdata))
## [1] 0
sum(duplicated(heartdata))
## [1] 0
The data set contains 0 null values, and 0 duplicated observations.
str(heartdata)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : int 0 0 0 1 1 1 1 1 0 1 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : int 0 0 0 0 1 0 0 1 0 0 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : int 1 0 0 0 0 1 0 0 0 1 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : int 1 1 1 1 0 1 1 1 0 1 ...
## $ smoking : int 0 0 1 0 0 1 0 1 0 1 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : int 1 1 1 1 1 1 1 1 1 1 ...
The variables anaemaia, smoking, sex, diabetes, high blood pressure and Death event should be categorical. This correction is made in the next block.
# Making correction in Data Type:----'
<- c("anaemia","smoking","sex","diabetes","DEATH_EVENT","high_blood_pressure")
categoricalcolumns for( i in categoricalcolumns){
<- as.factor(heartdata[,i])
heartdata[,i]
}str(heartdata)
## 'data.frame': 299 obs. of 13 variables:
## $ age : num 75 55 65 50 65 90 75 60 65 80 ...
## $ anaemia : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
## $ creatinine_phosphokinase: int 582 7861 146 111 160 47 246 315 157 123 ...
## $ diabetes : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
## $ ejection_fraction : int 20 38 20 20 20 40 15 60 65 35 ...
## $ high_blood_pressure : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
## $ platelets : num 265000 263358 162000 210000 327000 ...
## $ serum_creatinine : num 1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
## $ serum_sodium : int 130 136 129 137 116 132 137 131 138 133 ...
## $ sex : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
## $ smoking : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
## $ time : int 4 6 7 7 8 8 10 10 10 10 ...
## $ DEATH_EVENT : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
The variables are now converted to categorical.
summary(heartdata)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## Min. :40.00 0:170 Min. : 23.0 0:174 Min. :14.00
## 1st Qu.:51.00 1:129 1st Qu.: 116.5 1:125 1st Qu.:30.00
## Median :60.00 Median : 250.0 Median :38.00
## Mean :60.83 Mean : 581.8 Mean :38.08
## 3rd Qu.:70.00 3rd Qu.: 582.0 3rd Qu.:45.00
## Max. :95.00 Max. :7861.0 Max. :80.00
## high_blood_pressure platelets serum_creatinine serum_sodium sex
## 0:194 Min. : 25100 Min. :0.500 Min. :113.0 0:105
## 1:105 1st Qu.:212500 1st Qu.:0.900 1st Qu.:134.0 1:194
## Median :262000 Median :1.100 Median :137.0
## Mean :263358 Mean :1.394 Mean :136.6
## 3rd Qu.:303500 3rd Qu.:1.400 3rd Qu.:140.0
## Max. :850000 Max. :9.400 Max. :148.0
## smoking time DEATH_EVENT
## 0:203 Min. : 4.0 0:203
## 1: 96 1st Qu.: 73.0 1: 96
## Median :115.0
## Mean :130.3
## 3rd Qu.:203.0
## Max. :285.0
library(ggplot2)
ggplot(heartdata, aes(x=age)) +
geom_histogram(color="white", fill="blue")
This plot shows that the of the sample was between 40 and 95 years old. With the majority of participants being around 60 years old. The distribution of age appears to be normal distribution as visible from the plot.
ggplot(heartdata, aes(x=anaemia)) +
geom_bar(color="white", fill="blue")
More people in the study did not have anaemia, but a good amount still did. We can continue with the analysis based on this data.
ggplot(heartdata, aes(x=creatinine_phosphokinase)) +
geom_histogram(color="white", fill="blue")
The majority of people in the study had creatinine phosphokinase levels below 2000, but the range was up to 8000.
ggplot(heartdata, aes(x=diabetes)) +
geom_bar(color="white", fill="blue")
The majority of participants did not have diabetes, but a good amount did. We can continue with the analysis based on this data.
ggplot(heartdata, aes(x=ejection_fraction)) +
geom_histogram(color="white", fill="blue")
The majority of patients were under 55% for their ejection fraction. Normal EF is in the range of 55% to 70%. As the percentage falls, it tells the doctor that the heart failure is getting worse. In general, if the EF falls below 30%, it’s relatively severe. A reading of 20% or below is very severe heart failure. EF above 75% is considered too high, and can be a problem as well.
Source: https://www.webmd.com/heart-disease/heart-failure/features/ejection-fraction
ggplot(heartdata, aes(x=high_blood_pressure)) +
geom_bar(color="white", fill="blue")
Most patients did not have high blood pressure.
ggplot(heartdata, aes(x=platelets)) +
geom_histogram(color="white", fill="blue")
The platelets follow a normal distribution. A normal platelet count ranges from 150,000 to 450,000 kiloplatelets per milliliter of blood.
ggplot(heartdata, aes(x=serum_creatinine)) +
geom_histogram(color="white", fill="blue")
The serum creatinine distribution is skewed to the left. 1.35 milligrams/dl is the cut off range for poor kidney function in men, and 1.04 milligrams/dl for women . A lot of the participants were above this range.
ggplot(heartdata, aes(x=serum_sodium)) +
geom_histogram(color="white", fill="blue")
A normal blood sodium level is between 135 and 145 milliequivalents per liter (mEq/L). Most patients were in the normal range.
Source: https://www.mayoclinic.org/diseases-conditions/hyponatremia/symptoms-causes/syc-20373711#
ggplot(heartdata, aes(x=sex)) +
geom_bar(color="white", fill="blue")
More patients in the study were male.
ggplot(heartdata, aes(x=smoking)) +
geom_bar(color="white", fill="blue")
More patients did not smoke compared to smoking.
ggplot(heartdata, aes(x=time)) +
geom_histogram(color="white", fill="blue")
There was no real distribution for time after follow up.
ggplot(heartdata, aes(x=DEATH_EVENT)) +
geom_bar(color="white", fill="blue")
More people survived in the study during the follow up period.
An important assumption of logistic regression is that the outcomes should be binary. This assumption is satisfied.
table(heartdata$DEATH_EVENT)
##
## 0 1
## 203 96
The relationship between the logit function and each continuous predictor variable should be linear. This assumption also appears to be satisfied.
# Applying Model
<- glm(DEATH_EVENT ~ ., family = binomial,
model_test data = heartdata)
<- predict(model_test, type = "response")
probabilities
<- heartdata %>% select_if(is.numeric)
data_check <- colnames(data_check)
predictors
# Bind the logit and tidying the data for plot
<- data_check %>%
mydata mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(mydata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
plot(model_test, which = 4, id.n = 3) # Gives Influential Values based on observations.
<- augment(model_test) %>%
model.data mutate(index = 1:n())
%>% top_n(3, .cooksd) model.data
## # A tibble: 3 × 20
## DEATH_EVENT age anaemia creatinine_phosphokinase diabetes ejection_fraction
## <fct> <dbl> <fct> <int> <fct> <int>
## 1 0 60 1 1082 1 45
## 2 1 54 1 427 0 70
## 3 0 65 0 56 0 25
## # … with 14 more variables: high_blood_pressure <fct>, platelets <dbl>,
## # serum_creatinine <dbl>, serum_sodium <int>, sex <fct>, smoking <fct>,
## # time <int>, .fitted <dbl>, .resid <dbl>, .std.resid <dbl>, .hat <dbl>,
## # .sigma <dbl>, .cooksd <dbl>, index <int>
ggplot(model.data, aes(index, .std.resid)) +
geom_point(aes(color = DEATH_EVENT), alpha = .5) +
theme_bw()
%>%
model.data filter(abs(.std.resid) > 3)
## # A tibble: 0 × 20
## # … with 20 variables: DEATH_EVENT <fct>, age <dbl>, anaemia <fct>,
## # creatinine_phosphokinase <int>, diabetes <fct>, ejection_fraction <int>,
## # high_blood_pressure <fct>, platelets <dbl>, serum_creatinine <dbl>,
## # serum_sodium <int>, sex <fct>, smoking <fct>, time <int>, .fitted <dbl>,
## # .resid <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## # index <int>
No influential measures were found.
All the values are less than 4, so multicollinearity assumption hold true, meaning that no multicollinearity exists.
::vif(model_test) car
## age anaemia creatinine_phosphokinase
## 1.104307 1.114540 1.085629
## diabetes ejection_fraction high_blood_pressure
## 1.052375 1.172842 1.063014
## platelets serum_creatinine serum_sodium
## 1.045319 1.102088 1.070685
## sex smoking time
## 1.380635 1.284512 1.151810
# Splitting into training and testing:
set.seed(4)
= nrow(heartdata)
n <- sample(1:n, 0.75*n, replace = FALSE)
split <- heartdata[split,]
training <- heartdata[-split,]
test
# Applying Model:
<- glm(DEATH_EVENT ~., family = binomial(link = 'logit'), data = training)
model_logit
# Summary:
summary(model_logit)
##
## Call:
## glm(formula = DEATH_EVENT ~ ., family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3932 -0.6453 -0.2183 0.5192 2.4608
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.317e+01 6.368e+00 2.068 0.0387 *
## age 3.689e-02 1.801e-02 2.049 0.0405 *
## anaemia1 4.735e-02 4.137e-01 0.114 0.9089
## creatinine_phosphokinase 1.246e-04 1.699e-04 0.733 0.4635
## diabetes1 7.387e-02 3.952e-01 0.187 0.8517
## ejection_fraction -8.919e-02 1.996e-02 -4.467 7.92e-06 ***
## high_blood_pressure1 -3.963e-01 4.289e-01 -0.924 0.3556
## platelets -6.989e-07 2.106e-06 -0.332 0.7400
## serum_creatinine 3.883e-01 2.198e-01 1.767 0.0773 .
## serum_sodium -7.406e-02 4.376e-02 -1.692 0.0906 .
## sex1 -7.634e-01 4.864e-01 -1.569 0.1165
## smoking1 -1.019e-01 4.741e-01 -0.215 0.8299
## time -2.212e-02 3.638e-03 -6.082 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 288.28 on 223 degrees of freedom
## Residual deviance: 170.23 on 211 degrees of freedom
## AIC: 196.23
##
## Number of Fisher Scoring iterations: 6
Most statistically significant variables are age, ejection_fraction, and time. time has the lowest p-value which suggests is has a strong association with death event.
anova(model_logit, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: DEATH_EVENT
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 223 288.28
## age 1 8.997 222 279.29 0.002704 **
## anaemia 1 0.072 221 279.22 0.788645
## creatinine_phosphokinase 1 2.153 220 277.06 0.142248
## diabetes 1 0.402 219 276.66 0.526235
## ejection_fraction 1 27.769 218 248.89 1.367e-07 ***
## high_blood_pressure 1 0.168 217 248.72 0.681951
## platelets 1 0.020 216 248.70 0.886580
## serum_creatinine 1 16.120 215 232.58 5.946e-05 ***
## serum_sodium 1 2.027 214 230.56 0.154558
## sex 1 1.060 213 229.50 0.303144
## smoking 1 0.243 212 229.25 0.622015
## time 1 59.026 211 170.23 1.556e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova suggests serum_creatinine, in addition to the other 3 variables suggested in the previous part.
<- bestglm(Xy = training, family = binomial, IC = "BIC")
bs1 summary(bs1)
## Fitting algorithm: BIC-glm
## Best Model:
## df deviance
## Null Model 220 181.6895
## Full Model 223 288.2842
##
## likelihood-ratio test - GLM
##
## data: H0: Null Model vs. H1: Best Fit BIC-glm
## X = 106.59, df = 3, p-value < 2.2e-16
bs1
## BIC
## BICq equivalent for q in (0.13864112844905, 0.639542967214341)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.56479321 0.854055815 4.173958 2.993528e-05
## ejection_fraction -0.07417862 0.017205430 -4.311349 1.622615e-05
## serum_creatinine 0.59316104 0.225618041 2.629050 8.562385e-03
## time -0.02047303 0.003266243 -6.268065 3.655623e-10
$Subsets bs1
## Intercept age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 0 TRUE FALSE FALSE FALSE FALSE FALSE
## 1 TRUE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE FALSE TRUE
## 3* TRUE FALSE FALSE FALSE FALSE TRUE
## 4 TRUE TRUE FALSE FALSE FALSE TRUE
## 5 TRUE TRUE FALSE FALSE FALSE TRUE
## 6 TRUE TRUE FALSE FALSE FALSE TRUE
## 7 TRUE TRUE FALSE FALSE FALSE TRUE
## 8 TRUE TRUE FALSE TRUE FALSE TRUE
## 9 TRUE TRUE FALSE TRUE FALSE TRUE
## 10 TRUE TRUE FALSE TRUE FALSE TRUE
## 11 TRUE TRUE FALSE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking
## 0 FALSE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE
## 3* FALSE FALSE TRUE FALSE FALSE FALSE
## 4 FALSE FALSE TRUE FALSE FALSE FALSE
## 5 FALSE FALSE TRUE TRUE FALSE FALSE
## 6 FALSE FALSE TRUE TRUE TRUE FALSE
## 7 TRUE FALSE TRUE TRUE TRUE FALSE
## 8 TRUE FALSE TRUE TRUE TRUE FALSE
## 9 TRUE TRUE TRUE TRUE TRUE FALSE
## 10 TRUE TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE
## time logLikelihood BIC
## 0 FALSE -144.14211 288.2842
## 1 TRUE -107.36937 220.1504
## 2 TRUE -95.37721 201.5777
## 3* TRUE -90.84476 197.9245
## 4 TRUE -88.71232 199.0712
## 5 TRUE -87.23411 201.5264
## 6 TRUE -85.95420 204.3783
## 7 TRUE -85.50229 208.8861
## 8 TRUE -85.22504 213.7432
## 9 TRUE -85.16525 219.0353
## 10 TRUE -85.13704 224.3905
## 11 TRUE -85.12043 229.7690
## 12 TRUE -85.11388 235.1675
The BIC criterion suggested to use 3 variables that are: ejection_fraction, serum_creatinine, time.
<- bestglm(Xy = training, family = binomial, IC = "AIC")
bs2 summary(bs2)
## Fitting algorithm: AIC-glm
## Best Model:
## df deviance
## Null Model 217 171.9084
## Full Model 223 288.2842
##
## likelihood-ratio test - GLM
##
## data: H0: Null Model vs. H1: Best Fit AIC-glm
## X = 116.38, df = 6, p-value < 2.2e-16
bs2
## AIC
## BICq equivalent for q in (0.806264154642832, 0.904987365701856)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.49988349 6.15599546 2.030522 4.230352e-02
## age 0.03339734 0.01729805 1.930699 5.352025e-02
## ejection_fraction -0.08740968 0.01939957 -4.505754 6.613780e-06
## serum_creatinine 0.45475163 0.20880107 2.177918 2.941213e-02
## serum_sodium -0.07109988 0.04271560 -1.664495 9.601364e-02
## sex1 -0.70413868 0.44407036 -1.585647 1.128195e-01
## time -0.02157386 0.00350475 -6.155608 7.479016e-10
$Subsets bs2
## Intercept age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 0 TRUE FALSE FALSE FALSE FALSE FALSE
## 1 TRUE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE FALSE TRUE
## 3 TRUE FALSE FALSE FALSE FALSE TRUE
## 4 TRUE TRUE FALSE FALSE FALSE TRUE
## 5 TRUE TRUE FALSE FALSE FALSE TRUE
## 6* TRUE TRUE FALSE FALSE FALSE TRUE
## 7 TRUE TRUE FALSE FALSE FALSE TRUE
## 8 TRUE TRUE FALSE TRUE FALSE TRUE
## 9 TRUE TRUE FALSE TRUE FALSE TRUE
## 10 TRUE TRUE FALSE TRUE FALSE TRUE
## 11 TRUE TRUE FALSE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking
## 0 FALSE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE TRUE FALSE FALSE FALSE
## 4 FALSE FALSE TRUE FALSE FALSE FALSE
## 5 FALSE FALSE TRUE TRUE FALSE FALSE
## 6* FALSE FALSE TRUE TRUE TRUE FALSE
## 7 TRUE FALSE TRUE TRUE TRUE FALSE
## 8 TRUE FALSE TRUE TRUE TRUE FALSE
## 9 TRUE TRUE TRUE TRUE TRUE FALSE
## 10 TRUE TRUE TRUE TRUE TRUE TRUE
## 11 TRUE TRUE TRUE TRUE TRUE TRUE
## 12 TRUE TRUE TRUE TRUE TRUE TRUE
## time logLikelihood AIC
## 0 FALSE -144.14211 288.2842
## 1 TRUE -107.36937 216.7387
## 2 TRUE -95.37721 194.7544
## 3 TRUE -90.84476 187.6895
## 4 TRUE -88.71232 185.4246
## 5 TRUE -87.23411 184.4682
## 6* TRUE -85.95420 183.9084
## 7 TRUE -85.50229 185.0046
## 8 TRUE -85.22504 186.4501
## 9 TRUE -85.16525 188.3305
## 10 TRUE -85.13704 190.2741
## 11 TRUE -85.12043 192.2409
## 12 TRUE -85.11388 194.2278
The AIC criterion suggested to use 6 variables that are: age, ejection_fraction, serum_creatinine, sex, time, and serum_sodium. It is as expected as AIC is more lenient.
We went ahead with both models and compared the accuracy rates to figure out the better model.
#Variables taken from BIC
<- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine + time, family = binomial,data = training)
model1 summary(model1)
##
## Call:
## glm(formula = DEATH_EVENT ~ ejection_fraction + serum_creatinine +
## time, family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1277 -0.6797 -0.2658 0.6305 2.5867
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.564793 0.854056 4.174 2.99e-05 ***
## ejection_fraction -0.074179 0.017205 -4.311 1.62e-05 ***
## serum_creatinine 0.593161 0.225618 2.629 0.00856 **
## time -0.020473 0.003266 -6.268 3.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 288.28 on 223 degrees of freedom
## Residual deviance: 181.69 on 220 degrees of freedom
## AIC: 189.69
##
## Number of Fisher Scoring iterations: 5
<-predict(model1, test, type="response")
predicted1for (i in 1:length(predicted1)) {
if(predicted1[i]>=0.5){
=1
predicted1[i]else
} =0
{predicted1[i]
}
}<- (length(which(predicted1==test$DEATH_EVENT)) / nrow(test)) * 100) (accuracy
## [1] 81.33333
#Variables taken from AIC
<- glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, family = binomial,
model2 data = training)
summary(model2)
##
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
## serum_sodium + sex + time, family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4657 -0.6626 -0.2309 0.5266 2.5327
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.499883 6.155995 2.031 0.0423 *
## age 0.033397 0.017298 1.931 0.0535 .
## ejection_fraction -0.087410 0.019400 -4.506 6.61e-06 ***
## serum_creatinine 0.454752 0.208801 2.178 0.0294 *
## serum_sodium -0.071100 0.042716 -1.664 0.0960 .
## sex1 -0.704139 0.444070 -1.586 0.1128
## time -0.021574 0.003505 -6.156 7.48e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 288.28 on 223 degrees of freedom
## Residual deviance: 171.91 on 217 degrees of freedom
## AIC: 185.91
##
## Number of Fisher Scoring iterations: 5
<-predict(model2, test, type="response")
predicted2for (i in 1:length(predicted2)) {
if(predicted2[i]>=0.5){
=1
predicted2[i]else
} =0
{predicted2[i]
}
}<- (length(which(predicted2==test$DEATH_EVENT)) / nrow(test)) * 100) (accuracy2
## [1] 84
The accuracy for variables with AIC is slightly better than with BIC. We ran Leave One out Cross validation to verify the results.
<- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine + time, family = binomial, heartdata)
loocv1
<- cv.glm(heartdata, glmfit = loocv1, K = nrow(heartdata))) (out1
## $call
## cv.glm(data = heartdata, glmfit = loocv1, K = nrow(heartdata))
##
## $K
## [1] 299
##
## $delta
## [1] 0.1305802 0.1305721
##
## $seed
## [1] 10403 331 442120954 887807557 135382687 1579654182
## [7] 2101590192 2041315667 1735425137 889127328 1140743486 1102298185
## [13] 1912223995 1729474762 -1972500948 -1889117649 2133119717 636053404
## [19] -208842254 487308397 -238613081 626423758 889509160 1667909419
## [25] -190296759 -2026614184 -1682674938 107955937 1351242355 -2035524894
## [31] -38640044 1517596503 623674189 1273685732 -696473910 -755456203
## [37] -1385230961 -807097706 1248395840 -1981676797 1166144289 -1957743600
## [43] 608113646 1343677337 1963727915 -1673860838 -879080484 -1785811489
## [49] -14353995 -1428957364 -1544073822 1046142205 535970295 750423838
## [55] 1193185752 410831675 -19792679 535965800 1293702550 330657041
## [61] 347807427 -1080099118 -964016284 -1377068377 -2110741507 198030836
## [67] -2140597542 -1719925723 -1333171905 -1078546746 705209168 -348898445
## [73] -124121711 -1241542720 -997209698 2106643305 -1302419557 378662570
## [79] 890448076 441825679 942534085 -160484996 -102735406 -1264927411
## [85] 899203079 1239379950 -1250093368 1932372427 364100649 627182200
## [91] -1008019546 -1509662271 -624904621 976837890 -1899369420 838569399
## [97] 35692973 -524819772 -1687226646 -1810556587 -1357779217 1917832438
## [103] -1824981984 1890041315 -959734143 866916592 -503441522 -753407751
## [109] -1188924789 -1092693318 -411984452 68241151 1891371605 113257836
## [115] -559897150 238503453 -1068990825 305625086 1019298744 -148119653
## [121] 214042105 1836844360 -1082290058 -646767183 -1660402717 -875364174
## [127] 438119684 -581800377 856217885 1748621204 472554298 -446126203
## [133] -79883169 1243733734 1586996592 613191315 -398510287 -2008683040
## [139] 274143870 870942985 123038523 -459988214 809855724 913434479
## [145] -977614683 2106696284 -182447566 -1880270547 -1731780633 95053582
## [151] -836765464 1760365291 -1360788599 186146584 1341063750 -994401631
## [157] -1188723405 -1831682910 -749816940 663431703 75247117 1475005220
## [163] 1775064842 990509685 659515855 1155592918 -587295104 -1836377917
## [169] -1815456159 1580822224 1150695470 -945959463 2048897515 -298240934
## [175] -1337635300 -1450426593 -113388683 -897279476 1171921890 -238884675
## [181] 1157994295 60841950 -1548088040 -1050156933 -804129255 550346408
## [187] -576820138 1252649169 224777347 -168380654 -989490524 -1862017433
## [193] -807145539 -1429487308 949674650 -129965851 33989247 -2035364858
## [199] 499351440 1008165427 -1640089135 153774976 1438041694 -1180131927
## [205] -2022699941 -1257884566 -1870557940 -776433841 1029065733 666194620
## [211] -2063236206 872104077 1682446023 -1560166482 551782920 -1322422773
## [217] 781461225 1451975096 347135590 -502531327 248368531 989603394
## [223] 976124404 1408017399 1704630253 1129517188 -414734678 613022741
## [229] 1406176431 669335740 -936241516 1577494146 -946250560 994570556
## [235] -233590640 801548834 -1268161016 1868906764 -1402137956 1023225362
## [241] -539153024 -56320492 339760680 620258618 -2146089008 -129002900
## [247] -696211660 -1620335982 -398706336 765972812 -1136300064 -1138090078
## [253] 1598142376 -1202435188 -473110852 98503538 477804784 -1950853052
## [259] -1978229416 1351832634 -219823248 -879868868 2014063380 -2139403838
## [265] 1913880800 1012660092 1653923536 1112704674 1686651944 372222220
## [271] -488934468 1166982450 516956096 -597716428 602823624 -1211784006
## [277] 1486565264 -1915611668 -1763458988 579722898 -333579488 1153371596
## [283] -132154400 -1755168062 -2058675672 430849708 -952757572 1431007730
## [289] 348768560 -250565468 606380728 -666211494 -336646768 -739856004
## [295] 1485819156 -267835070 214491904 590666684 1111513936 -2095668958
## [301] 770268488 -702187508 -1411450788 -1623268142 -212809728 899013460
## [307] -1185234456 503871482 1616328400 -483973844 2021861620 379545490
## [313] -1997595552 -3400436 -1291434272 1486126434 -1298455896 -1747592628
## [319] 118609788 -1271715022 -1237180816 -1976656572 -1216627752 939257978
## [325] 1814046256 -2035332420 1779926868 -1720272318 -1879053408 -868996932
## [331] -251584048 -1905924766 -1589505048 -1695868980 1795806972 -905342990
## [337] 1413774464 -1324844812 -2003782520 1229546810 -798407856 -982242196
## [343] -1964908396 -937903790 -1046069536 1954287820 1584388256 1181243778
## [349] -101844760 -2058381076 761337916 -203251598 837560432 71754276
## [355] 1282646584 1378200090 -1569582896 228570684 839142548 -2115464062
## [361] -431421504 319270204 -930431344 -1841572830 1237946376 2108472460
## [367] -367935460 807707794 13257856 89113620 -602807512 -1255795654
## [373] -1510560560 -463915924 498177716 -996024302 -1513242656 -2136110132
## [379] -1241072160 -1897690078 -904637144 1004034316 653446716 -1555577614
## [385] 337410288 -1456818364 -1515851688 1606484538 -1290778384 1935188668
## [391] -1085847020 1486466242 -773704096 -1135835780 1176134864 -1819167710
## [397] -781891416 10263948 1912906172 262277170 1376211776 -572514636
## [403] 866554184 -1520815430 1962573584 876282092 1383741524 290703122
## [409] -1577197152 -1060739124 2123177824 -518345534 -850735576 -1738184276
## [415] -588635332 41968498 1908576048 -1066827740 787496504 -2060086566
## [421] 1011144592 17495292 514018708 1064403778 972343808 1201506364
## [427] -850148784 -1405009758 -1299135672 1046166028 521476828 -1460948910
## [433] 1749388416 -1758181548 981581800 1223963386 -869513392 -868359124
## [439] -841070220 258125458 250602080 1692709004 2053329248 689281250
## [445] -1743426392 496621260 150329852 986483634 -1740553232 1542554052
## [451] -222282408 476049786 -175869520 -494392900 -1123090860 571611586
## [457] -1321680598 -1162828344 -1909499151 -895290269 -357167004 -666756206
## [463] -2099703705 -1728139215 -1210220170 -1396371820 -907590491 -2059036241
## [469] 607224424 770398790 -671588365 -1578795179 2038833458 -172079872
## [475] 191419881 -1156939685 -1747307860 -785904454 1960611439 1845582841
## [481] 1209180814 -1947774116 -2050310195 95379255 644242880 -426327394
## [487] -822846613 -769596691 816519002 1862626328 728136065 1095757619
## [493] -1479211692 997551970 -1171692553 1589005377 -576452410 -382519804
## [499] -960224843 289840799 1679904504 411783894 347195235 1730805541
## [505] 1940255938 -1125342800 -475489831 -974407861 -181396036 -2140443158
## [511] 1047845407 -1115967319 -1568368578 -510830292 -568192131 1453181383
## [517] 173034672 1424632526 1535483899 -10098595 236375946 1934720936
## [523] 170210577 1634020355 -863467772 688540466 664275335 1106448337
## [529] 921903446 -1187201228 571286085 -1917374833 -1603631096 -1513579482
## [535] -356100909 824817205 1435651474 266834272 -46217015 -815567877
## [541] 1139608524 -1225824678 -225547441 -1916296679 2123391022 652963388
## [547] 123012013 -2033603497 959875488 -1667386626 -1340014133 826255309
## [553] 1920338682 1626960184 1513413345 -1526452845 1876809140 -1808275774
## [559] 191235799 -84236511 188505830 504100836 2088470293 1822203711
## [565] 1157552856 -1982873866 -413132541 -1170863547 308546146 -355792432
## [571] 141113081 -865926101 -1463565412 1494703050 -2097711937 -1161980407
## [577] 2007768862 2033510092 1330988701 908383015 -247071408 1082213358
## [583] -555635557 -1292024067 -1262334486 1981963912 1264830513 -430222685
## [589] -664241500 -321295534 1703699111 560774897 -1173991882 -1381887276
## [595] -162011547 1937716335 -2078044504 -1643514 199547187 -963148395
## [601] 276433394 855402048 439551913 -1583273445 2099897324 -221740038
## [607] 125972527 -863231943 -1714916786 1221317276 2090825741 1162585207
## [613] 1178039424 -58400034 -650440277 -101315923 -2075209958 -499691560
## [619] -916644799 -1235268877 199691540 492263586 264517047 -384923263
## [625] -1311379578 1059699905
<- glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, family = binomial, heartdata)
loocv2
<- cv.glm(heartdata, glmfit = loocv2, K = nrow(heartdata))$delta[1]) (out2
## [1] 0.1271257
Model 2 has the lower error rate, and better accuracy than Model 1. We can conclude from this that age, ejection_fraction, serum_creatinine, serum_sodium, sex, and time could be used as predictor variables for death_event.
We also tried other models for prediction, in order to figure out the best one, and verify the results of model selection.
<- glm(DEATH_EVENT ~ ., family = binomial, training)
lr <-predict(lr, test, type="response")
predictedlr
for (i in 1:length(predictedlr)) {
if(predictedlr[i]>=0.5){
=1
predictedlr[i]else
} =0
{predictedlr[i]
}
}<- (length(which(predictedlr==test$DEATH_EVENT)) / nrow(test)) * 100) (accuracylr
## [1] 81.33333
confusionMatrix(table(Predicted=predictedlr, Actual = test$DEATH_EVENT))
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 47 5
## 1 9 14
##
## Accuracy : 0.8133
## 95% CI : (0.7067, 0.894)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.1139
##
## Kappa : 0.5387
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.8393
## Specificity : 0.7368
## Pos Pred Value : 0.9038
## Neg Pred Value : 0.6087
## Prevalence : 0.7467
## Detection Rate : 0.6267
## Detection Prevalence : 0.6933
## Balanced Accuracy : 0.7881
##
## 'Positive' Class : 0
##
Linear Discriminant Analysis is a method to find a linear combination of features that characterizes or separates two or more classes of objects or events. This method can be used for linear classification or dimension reduction. We have used it for the purpose of classification in this project.
= lda(DEATH_EVENT ~ ., data= training)
ldamod
= predict(ldamod, test)
predicted confu_lda = table(Predicted=predict(ldamod, test)$class, Actual = test$DEATH_EVENT)) (
## Actual
## Predicted 0 1
## 0 48 5
## 1 8 14
<- sum(100 * diag(confu_lda)/ nrow(test))) (accuracylda
## [1] 82.66667
confusionMatrix(confu_lda)
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 48 5
## 1 8 14
##
## Accuracy : 0.8267
## 95% CI : (0.7219, 0.9043)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.06804
##
## Kappa : 0.5645
##
## Mcnemar's Test P-Value : 0.57910
##
## Sensitivity : 0.8571
## Specificity : 0.7368
## Pos Pred Value : 0.9057
## Neg Pred Value : 0.6364
## Prevalence : 0.7467
## Detection Rate : 0.6400
## Detection Prevalence : 0.7067
## Balanced Accuracy : 0.7970
##
## 'Positive' Class : 0
##
= lda(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, data= training)
ldamod1
= predict(ldamod1, test)
predicted confu_lda = table(Predicted=predict(ldamod1, test)$class, Actual = test$DEATH_EVENT)) (
## Actual
## Predicted 0 1
## 0 47 5
## 1 9 14
<- sum(100 * diag(confu_lda)/ nrow(test))) (accuracylda
## [1] 81.33333
confusionMatrix(confu_lda)
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 47 5
## 1 9 14
##
## Accuracy : 0.8133
## 95% CI : (0.7067, 0.894)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.1139
##
## Kappa : 0.5387
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.8393
## Specificity : 0.7368
## Pos Pred Value : 0.9038
## Neg Pred Value : 0.6087
## Prevalence : 0.7467
## Detection Rate : 0.6267
## Detection Prevalence : 0.6933
## Balanced Accuracy : 0.7881
##
## 'Positive' Class : 0
##
Quadratic Discriminant Analysis is similar to Linear discriminant analysis but used a quadratic function for classification instead of linear.
<- qda(DEATH_EVENT ~ ., data = training)
qda.fit <- predict(qda.fit, test)$class
qda.class confu_qda = table(Predicted = qda.class, Actual = test$DEATH_EVENT)) (
## Actual
## Predicted 0 1
## 0 49 8
## 1 7 11
<- sum(100 * diag(confu_qda)/ nrow(test))) (accuracyqda
## [1] 80
confusionMatrix(confu_qda)
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 49 8
## 1 7 11
##
## Accuracy : 0.8
## 95% CI : (0.6917, 0.8835)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.1771
##
## Kappa : 0.462
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.8750
## Specificity : 0.5789
## Pos Pred Value : 0.8596
## Neg Pred Value : 0.6111
## Prevalence : 0.7467
## Detection Rate : 0.6533
## Detection Prevalence : 0.7600
## Balanced Accuracy : 0.7270
##
## 'Positive' Class : 0
##
= qda(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, data= training)
qdamod1
= predict(qdamod1, test)
predicted confu_qda = table(Predicted=predict(qdamod1, test)$class, Actual = test$DEATH_EVENT)) (
## Actual
## Predicted 0 1
## 0 49 6
## 1 7 13
<- sum(100 * diag(confu_qda)/ nrow(test))) (accuracyqda
## [1] 82.66667
confusionMatrix(confu_qda)
## Confusion Matrix and Statistics
##
## Actual
## Predicted 0 1
## 0 49 6
## 1 7 13
##
## Accuracy : 0.8267
## 95% CI : (0.7219, 0.9043)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.06804
##
## Kappa : 0.5497
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8750
## Specificity : 0.6842
## Pos Pred Value : 0.8909
## Neg Pred Value : 0.6500
## Prevalence : 0.7467
## Detection Rate : 0.6533
## Detection Prevalence : 0.7333
## Balanced Accuracy : 0.7796
##
## 'Positive' Class : 0
##
The QDA is performing better than LDA.
This is another algorithm used for classification purposes. These algorithms are able to map non linear relationships as well leading to better prediction power than other linear supervised learning models in case of non linear relationships in the data.
# Fitting the tree to training dataset:
<- rpart(DEATH_EVENT ~ ., method="class", data=training)
fit printcp(fit) # display the results
##
## Classification tree:
## rpart(formula = DEATH_EVENT ~ ., data = training, method = "class")
##
## Variables actually used in tree construction:
## [1] ejection_fraction platelets serum_creatinine time
##
## Root node error: 77/224 = 0.34375
##
## n= 224
##
## CP nsplit rel error xerror xstd
## 1 0.558442 0 1.00000 1.00000 0.092319
## 2 0.051948 1 0.44156 0.48052 0.072178
## 3 0.012987 3 0.33766 0.48052 0.072178
## 4 0.010000 4 0.32468 0.51948 0.074443
plot(fit, uniform=TRUE, main="Classification Tree")
text(fit, use.n=TRUE, all=TRUE, cex= 0.7)
# Accuracy Rate:
conf.matrix = table(Actual = test$DEATH_EVENT, Predicted = predict(fit,test,type="class"))) (
## Predicted
## Actual 0 1
## 0 44 12
## 1 4 15
sum(diag(conf.matrix))/nrow(test) * 100
## [1] 78.66667
confusionMatrix(conf.matrix)
## Confusion Matrix and Statistics
##
## Predicted
## Actual 0 1
## 0 44 12
## 1 4 15
##
## Accuracy : 0.7867
## 95% CI : (0.6768, 0.8729)
## No Information Rate : 0.64
## P-Value [Acc > NIR] : 0.004518
##
## Kappa : 0.505
##
## Mcnemar's Test P-Value : 0.080118
##
## Sensitivity : 0.9167
## Specificity : 0.5556
## Pos Pred Value : 0.7857
## Neg Pred Value : 0.7895
## Prevalence : 0.6400
## Detection Rate : 0.5867
## Detection Prevalence : 0.7467
## Balanced Accuracy : 0.7361
##
## 'Positive' Class : 0
##
# Pruning
= prune(fit, cp = 0.019608)
fitp printcp(fitp)
##
## Classification tree:
## rpart(formula = DEATH_EVENT ~ ., data = training, method = "class")
##
## Variables actually used in tree construction:
## [1] ejection_fraction serum_creatinine time
##
## Root node error: 77/224 = 0.34375
##
## n= 224
##
## CP nsplit rel error xerror xstd
## 1 0.558442 0 1.00000 1.00000 0.092319
## 2 0.051948 1 0.44156 0.48052 0.072178
## 3 0.019608 3 0.33766 0.48052 0.072178
plot(fitp) #plot smaller rpart object
text(fitp, use.n=TRUE, all=TRUE, cex=.8)
conf.matrix.prune = table(test$DEATH_EVENT, predict(fitp,test, type="class"))) (
##
## 0 1
## 0 42 14
## 1 4 15
sum(diag(conf.matrix.prune))/nrow(test) * 100
## [1] 76
confusionMatrix(conf.matrix.prune)
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 42 14
## 1 4 15
##
## Accuracy : 0.76
## 95% CI : (0.6475, 0.8511)
## No Information Rate : 0.6133
## P-Value [Acc > NIR] : 0.005283
##
## Kappa : 0.4596
##
## Mcnemar's Test P-Value : 0.033895
##
## Sensitivity : 0.9130
## Specificity : 0.5172
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.7895
## Prevalence : 0.6133
## Detection Rate : 0.5600
## Detection Prevalence : 0.7467
## Balanced Accuracy : 0.7151
##
## 'Positive' Class : 0
##
<- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
control set.seed(4)
<- sqrt(ncol(heartdata))
mtry <- train(DEATH_EVENT~., data=heartdata, method="rf", metric='Accuracy', tuneLength=15, trControl=control)
rf_random print(rf_random)
## Random Forest
##
## 299 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 269, 269, 269, 268, 269, 270, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.7691954 0.3574794
## 2 0.8461884 0.6272339
## 3 0.8529675 0.6492393
## 4 0.8541169 0.6541063
## 5 0.8541144 0.6544194
## 6 0.8474478 0.6373658
## 7 0.8463367 0.6407240
## 8 0.8306612 0.6038030
## 10 0.8228068 0.5860266
## 12 0.8104672 0.5610446
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
plot(rf_random)
set.seed(4)
<- randomForest(x = training[,-13],
rf_model y = training$DEATH_EVENT,
ntree = 1000, mtry = 4)
= predict(rf_model, newdata = test[,-13])
y_pred table(y_pred,test$DEATH_EVENT)
##
## y_pred 0 1
## 0 48 4
## 1 8 15
sum(diag(table(y_pred,test$DEATH_EVENT)))/nrow(test) * 100
## [1] 84
confusionMatrix(table(y_pred,test$DEATH_EVENT))
## Confusion Matrix and Statistics
##
##
## y_pred 0 1
## 0 48 4
## 1 8 15
##
## Accuracy : 0.84
## 95% CI : (0.7372, 0.9145)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.03754
##
## Kappa : 0.6046
##
## Mcnemar's Test P-Value : 0.38648
##
## Sensitivity : 0.8571
## Specificity : 0.7895
## Pos Pred Value : 0.9231
## Neg Pred Value : 0.6522
## Prevalence : 0.7467
## Detection Rate : 0.6400
## Detection Prevalence : 0.6933
## Balanced Accuracy : 0.8233
##
## 'Positive' Class : 0
##
#Splitting into training and testing for Model 2:
set.seed(4)
<- heartdata[,c("age","ejection_fraction","serum_creatinine","serum_sodium","sex","time","DEATH_EVENT")]
heartdata2 = nrow(heartdata)
n <- sample(1:n, 0.75*n, replace = FALSE)
split <- heartdata2[split,]
training_model2 <- heartdata2[-split,] test_model2
<- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
control set.seed(4)
<- floor(sqrt(ncol(heartdata2)))
mtry <- train(DEATH_EVENT~., data=heartdata2, method="rf", metric='Accuracy', tuneLength=15, trControl=control)
rf_random print(rf_random)
## Random Forest
##
## 299 samples
## 6 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 269, 269, 269, 268, 269, 270, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.8540143 0.6458588
## 2 0.8450748 0.6386223
## 3 0.8394401 0.6255142
## 4 0.8406254 0.6299524
## 5 0.8284748 0.6003216
## 6 0.8285465 0.5998256
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.
plot(rf_random)
set.seed(4)
<- randomForest(x = training_model2[,-7],
rf_model y = training_model2$DEATH_EVENT,
ntree = 1000, mtry = 1)
= predict(rf_model, newdata = test_model2[,-7])
y_pred table(y_pred,test_model2$DEATH_EVENT)
##
## y_pred 0 1
## 0 52 5
## 1 4 14
sum(diag(table(y_pred,test_model2$DEATH_EVENT)))/nrow(test_model2) * 100
## [1] 88
confusionMatrix(table(y_pred,test_model2$DEATH_EVENT))
## Confusion Matrix and Statistics
##
##
## y_pred 0 1
## 0 52 5
## 1 4 14
##
## Accuracy : 0.88
## 95% CI : (0.7844, 0.9436)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.003624
##
## Kappa : 0.6772
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.9286
## Specificity : 0.7368
## Pos Pred Value : 0.9123
## Neg Pred Value : 0.7778
## Prevalence : 0.7467
## Detection Rate : 0.6933
## Detection Prevalence : 0.7600
## Balanced Accuracy : 0.8327
##
## 'Positive' Class : 0
##
set.seed(4)
<- knn(training,test,cl=training$DEATH_EVENT,k=6)
knn_mod table(knn_mod,test$DEATH_EVENT)
##
## knn_mod 0 1
## 0 44 10
## 1 12 9
sum(diag(table(knn_mod,test$DEATH_EVENT)))/nrow(test) * 100
## [1] 70.66667
confusionMatrix(table(knn_mod,test$DEATH_EVENT))
## Confusion Matrix and Statistics
##
##
## knn_mod 0 1
## 0 44 10
## 1 12 9
##
## Accuracy : 0.7067
## 95% CI : (0.5902, 0.8062)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.8245
##
## Kappa : 0.2507
##
## Mcnemar's Test P-Value : 0.8312
##
## Sensitivity : 0.7857
## Specificity : 0.4737
## Pos Pred Value : 0.8148
## Neg Pred Value : 0.4286
## Prevalence : 0.7467
## Detection Rate : 0.5867
## Detection Prevalence : 0.7200
## Balanced Accuracy : 0.6297
##
## 'Positive' Class : 0
##
set.seed(9)
<- knn(training_model2,test_model2,cl=training_model2$DEATH_EVENT,k=6)
knn_mod table(knn_mod,test_model2$DEATH_EVENT)
##
## knn_mod 0 1
## 0 50 5
## 1 6 14
sum(diag(table(knn_mod,test_model2$DEATH_EVENT)))/nrow(test_model2) * 100
## [1] 85.33333
confusionMatrix(table(knn_mod,test$DEATH_EVENT))
## Confusion Matrix and Statistics
##
##
## knn_mod 0 1
## 0 50 5
## 1 6 14
##
## Accuracy : 0.8533
## 95% CI : (0.7527, 0.9244)
## No Information Rate : 0.7467
## P-Value [Acc > NIR] : 0.01899
##
## Kappa : 0.6189
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.8929
## Specificity : 0.7368
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.7000
## Prevalence : 0.7467
## Detection Rate : 0.6667
## Detection Prevalence : 0.7333
## Balanced Accuracy : 0.8148
##
## 'Positive' Class : 0
##
After conducting model selection, following conclusions can be made:
The most useful predictor variables, as suggested by AIC criteria, are: age, serum_sodium, sex, ejection_fraction, serum_creatinine, and time. Using these predictors improved the accuracy over using all the variables as predictors, in all the models.
The overall accuracy rate for logistic regression with given 6 variables is 84%. This accuracy rate is better than the accuracy rate when all the variables are included, i.e, 81.33%
The best predictor models, if model 2 variables are considered were Random Forest, KNN,and Logistic Regression.
The variables used by tree based method for classification, before pruning, were: ejection_fraction, platelets, serum_creatinine, and time, and after pruning were: ejection_fraction, serum_creatinine, and time.
Having more observations would have been better. It would have helped us in being more confident about the accuracy rates, and model selection.
Having longer followup period would have made data more reliable.
Davide Chicco, Giuseppe Jurman: “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone”. BMC Medical Informatics and Decision Making 20, 16 (2020). Web Link