1 Introduction
- 1.1 Variable Description
2 Data Cleaning
3 Descriptive Statistics
4 Data Visualization
5 Model Selection - Logistic Regression
6 Other Models:
7 Conclusion
8 Reflection
9 Citations

1 Introduction

This data set was taken from the UCI repository. It contains the medical records of 299 patients who were diagnosed with heart failure. The data was collected during a follow-up period where each patient had 13 clinical features examined. The data was collected by Davide Chicco and Guiseppe Jurman in 2020 for their research paper “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone”.

The purpose of this project is to predict which features are more statistically significant in regards to death event. We intend to focus on model selection and other different statistical prediction methods. Our results will be detailed below.

The limitations to this data set is that it is relatively small which means that even after finding the best model – we would run into problems with under fitting.

1.1 Variable Description

Thirteen (13) clinical features:

age: age of the patient (years)
anaemia: decrease of red blood cells or hemoglobin (boolean)
high blood pressure: if the patient has hypertension (boolean)
creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L). When a muscle tissue gets damaged, CPK flows into the blood. Therefore, high levels of CPK in the blood of a patient might indicate a heart failure or injury
diabetes: if the patient has diabetes (boolean)
ejection fraction: blood leaving the left ventricle pumps at each contraction (percentage)
platelets: platelets in the blood (kiloplatelets/mL)
sex: woman(0) or man(1) (binary)
serum creatinine: level of serum creatinine, which is a waste product generated by creatine, when a muscle breaks down, in the blood (mg/dL). Especially, doctors focus on serum creatinine in blood to check kidney function.
serum sodium: level of serum sodium in the blood (mEq/L). An abnormally low level of sodium in the blood might be caused by heart failure.
smoking: if the patient smokes or not (boolean)
time: follow-up period (days)
[target] death event: if the patient deceased during the follow-up period (boolean)

2 Data Cleaning

head(heartdata)

##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

sum(is.na(heartdata))

## [1] 0

sum(duplicated(heartdata))

## [1] 0

The data set contains 0 null values, and 0 duplicated observations.

str(heartdata)

## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : int  0 0 0 1 1 1 1 1 0 1 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : int  1 0 0 0 0 1 0 0 0 1 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ smoking                 : int  0 0 1 0 0 1 0 1 0 1 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : int  1 1 1 1 1 1 1 1 1 1 ...

The variables anaemaia, smoking, sex, diabetes, high blood pressure and Death event should be categorical. This correction is made in the next block.

# Making correction in Data Type:----'
categoricalcolumns <- c("anaemia","smoking","sex","diabetes","DEATH_EVENT","high_blood_pressure")
for( i in categoricalcolumns){
heartdata[,i] <- as.factor(heartdata[,i])
}
str(heartdata)

## 'data.frame':    299 obs. of  13 variables:
##  $ age                     : num  75 55 65 50 65 90 75 60 65 80 ...
##  $ anaemia                 : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 2 1 2 ...
##  $ creatinine_phosphokinase: int  582 7861 146 111 160 47 246 315 157 123 ...
##  $ diabetes                : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 1 ...
##  $ ejection_fraction       : int  20 38 20 20 20 40 15 60 65 35 ...
##  $ high_blood_pressure     : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 2 ...
##  $ platelets               : num  265000 263358 162000 210000 327000 ...
##  $ serum_creatinine        : num  1.9 1.1 1.3 1.9 2.7 2.1 1.2 1.1 1.5 9.4 ...
##  $ serum_sodium            : int  130 136 129 137 116 132 137 131 138 133 ...
##  $ sex                     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 1 2 ...
##  $ smoking                 : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 2 1 2 ...
##  $ time                    : int  4 6 7 7 8 8 10 10 10 10 ...
##  $ DEATH_EVENT             : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

The variables are now converted to categorical.

3 Descriptive Statistics

summary(heartdata)

##       age        anaemia creatinine_phosphokinase diabetes ejection_fraction
##  Min.   :40.00   0:170   Min.   :  23.0           0:174    Min.   :14.00    
##  1st Qu.:51.00   1:129   1st Qu.: 116.5           1:125    1st Qu.:30.00    
##  Median :60.00           Median : 250.0                    Median :38.00    
##  Mean   :60.83           Mean   : 581.8                    Mean   :38.08    
##  3rd Qu.:70.00           3rd Qu.: 582.0                    3rd Qu.:45.00    
##  Max.   :95.00           Max.   :7861.0                    Max.   :80.00    
##  high_blood_pressure   platelets      serum_creatinine  serum_sodium   sex    
##  0:194               Min.   : 25100   Min.   :0.500    Min.   :113.0   0:105  
##  1:105               1st Qu.:212500   1st Qu.:0.900    1st Qu.:134.0   1:194  
##                      Median :262000   Median :1.100    Median :137.0          
##                      Mean   :263358   Mean   :1.394    Mean   :136.6          
##                      3rd Qu.:303500   3rd Qu.:1.400    3rd Qu.:140.0          
##                      Max.   :850000   Max.   :9.400    Max.   :148.0          
##  smoking      time       DEATH_EVENT
##  0:203   Min.   :  4.0   0:203      
##  1: 96   1st Qu.: 73.0   1: 96      
##          Median :115.0              
##          Mean   :130.3              
##          3rd Qu.:203.0              
##          Max.   :285.0

4 Data Visualization

library(ggplot2)
ggplot(heartdata, aes(x=age)) +
geom_histogram(color="white", fill="blue")

This plot shows that the of the sample was between 40 and 95 years old. With the majority of participants being around 60 years old. The distribution of age appears to be normal distribution as visible from the plot.

ggplot(heartdata, aes(x=anaemia)) +
geom_bar(color="white", fill="blue")

More people in the study did not have anaemia, but a good amount still did. We can continue with the analysis based on this data.

ggplot(heartdata, aes(x=creatinine_phosphokinase)) +
geom_histogram(color="white", fill="blue")

The majority of people in the study had creatinine phosphokinase levels below 2000, but the range was up to 8000.

ggplot(heartdata, aes(x=diabetes)) +
geom_bar(color="white", fill="blue")

The majority of participants did not have diabetes, but a good amount did. We can continue with the analysis based on this data.

ggplot(heartdata, aes(x=ejection_fraction)) +
geom_histogram(color="white", fill="blue")

The majority of patients were under 55% for their ejection fraction. Normal EF is in the range of 55% to 70%. As the percentage falls, it tells the doctor that the heart failure is getting worse. In general, if the EF falls below 30%, it’s relatively severe. A reading of 20% or below is very severe heart failure. EF above 75% is considered too high, and can be a problem as well.

Source: https://www.webmd.com/heart-disease/heart-failure/features/ejection-fraction

ggplot(heartdata, aes(x=high_blood_pressure)) +
geom_bar(color="white", fill="blue")

Most patients did not have high blood pressure.

ggplot(heartdata, aes(x=platelets)) +
geom_histogram(color="white", fill="blue")

The platelets follow a normal distribution. A normal platelet count ranges from 150,000 to 450,000 kiloplatelets per milliliter of blood.

ggplot(heartdata, aes(x=serum_creatinine)) +
geom_histogram(color="white", fill="blue")

The serum creatinine distribution is skewed to the left. 1.35 milligrams/dl is the cut off range for poor kidney function in men, and 1.04 milligrams/dl for women . A lot of the participants were above this range.

ggplot(heartdata, aes(x=serum_sodium)) +
geom_histogram(color="white", fill="blue")

A normal blood sodium level is between 135 and 145 milliequivalents per liter (mEq/L). Most patients were in the normal range.

Source: https://www.mayoclinic.org/diseases-conditions/hyponatremia/symptoms-causes/syc-20373711#

ggplot(heartdata, aes(x=sex)) +
geom_bar(color="white", fill="blue")

More patients in the study were male.

ggplot(heartdata, aes(x=smoking)) +
geom_bar(color="white", fill="blue")

More patients did not smoke compared to smoking.

ggplot(heartdata, aes(x=time)) +
geom_histogram(color="white", fill="blue")

There was no real distribution for time after follow up.

ggplot(heartdata, aes(x=DEATH_EVENT)) +
geom_bar(color="white", fill="blue")

More people survived in the study during the follow up period.

5 Model Selection - Logistic Regression

5.1 Checking Assumptions for Logistic Regression

5.1.1 Assumption 1: Binary Outcome (DEATH_EVENT)

An important assumption of logistic regression is that the outcomes should be binary. This assumption is satisfied.

table(heartdata$DEATH_EVENT)

## 
##   0   1 
## 203  96

5.1.2 Assumption 2: Linear Relationship between logit function and each continuous predictor

The relationship between the logit function and each continuous predictor variable should be linear. This assumption also appears to be satisfied.

# Applying Model
model_test <- glm(DEATH_EVENT ~ ., family = binomial, 
              data = heartdata)
probabilities <- predict(model_test, type = "response")

data_check <- heartdata %>% select_if(is.numeric) 
predictors <- colnames(data_check)

# Bind the logit and tidying the data for plot
mydata <- data_check %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(mydata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

5.1.3 Assumption 3: Influential Value in predictors

plot(model_test, which = 4, id.n = 3) # Gives Influential Values based on observations.

model.data <- augment(model_test) %>% 
  mutate(index = 1:n()) 
model.data %>% top_n(3, .cooksd)

## # A tibble: 3 × 20
##   DEATH_EVENT   age anaemia creatinine_phosphokinase diabetes ejection_fraction
##   <fct>       <dbl> <fct>                      <int> <fct>                <int>
## 1 0              60 1                           1082 1                       45
## 2 1              54 1                            427 0                       70
## 3 0              65 0                             56 0                       25
## # … with 14 more variables: high_blood_pressure <fct>, platelets <dbl>,
## #   serum_creatinine <dbl>, serum_sodium <int>, sex <fct>, smoking <fct>,
## #   time <int>, .fitted <dbl>, .resid <dbl>, .std.resid <dbl>, .hat <dbl>,
## #   .sigma <dbl>, .cooksd <dbl>, index <int>

ggplot(model.data, aes(index, .std.resid)) + 
  geom_point(aes(color = DEATH_EVENT), alpha = .5) +
  theme_bw()

model.data %>% 
  filter(abs(.std.resid) > 3)

## # A tibble: 0 × 20
## # … with 20 variables: DEATH_EVENT <fct>, age <dbl>, anaemia <fct>,
## #   creatinine_phosphokinase <int>, diabetes <fct>, ejection_fraction <int>,
## #   high_blood_pressure <fct>, platelets <dbl>, serum_creatinine <dbl>,
## #   serum_sodium <int>, sex <fct>, smoking <fct>, time <int>, .fitted <dbl>,
## #   .resid <dbl>, .std.resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
## #   index <int>

No influential measures were found.

5.1.4 Assumption 4: Multicollinearity

All the values are less than 4, so multicollinearity assumption hold true, meaning that no multicollinearity exists.

car::vif(model_test)

##                      age                  anaemia creatinine_phosphokinase 
##                 1.104307                 1.114540                 1.085629 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                 1.052375                 1.172842                 1.063014 
##                platelets         serum_creatinine             serum_sodium 
##                 1.045319                 1.102088                 1.070685 
##                      sex                  smoking                     time 
##                 1.380635                 1.284512                 1.151810

5.2 Applying Logistic Regression Model

# Splitting into training and testing:
set.seed(4)
n = nrow(heartdata)
split <- sample(1:n, 0.75*n, replace = FALSE)
training <- heartdata[split,]
test <- heartdata[-split,]

# Applying Model:
model_logit <- glm(DEATH_EVENT ~., family = binomial(link = 'logit'), data = training)

# Summary:
summary(model_logit)

## 
## Call:
## glm(formula = DEATH_EVENT ~ ., family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3932  -0.6453  -0.2183   0.5192   2.4608  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               1.317e+01  6.368e+00   2.068   0.0387 *  
## age                       3.689e-02  1.801e-02   2.049   0.0405 *  
## anaemia1                  4.735e-02  4.137e-01   0.114   0.9089    
## creatinine_phosphokinase  1.246e-04  1.699e-04   0.733   0.4635    
## diabetes1                 7.387e-02  3.952e-01   0.187   0.8517    
## ejection_fraction        -8.919e-02  1.996e-02  -4.467 7.92e-06 ***
## high_blood_pressure1     -3.963e-01  4.289e-01  -0.924   0.3556    
## platelets                -6.989e-07  2.106e-06  -0.332   0.7400    
## serum_creatinine          3.883e-01  2.198e-01   1.767   0.0773 .  
## serum_sodium             -7.406e-02  4.376e-02  -1.692   0.0906 .  
## sex1                     -7.634e-01  4.864e-01  -1.569   0.1165    
## smoking1                 -1.019e-01  4.741e-01  -0.215   0.8299    
## time                     -2.212e-02  3.638e-03  -6.082 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 288.28  on 223  degrees of freedom
## Residual deviance: 170.23  on 211  degrees of freedom
## AIC: 196.23
## 
## Number of Fisher Scoring iterations: 6

Most statistically significant variables are age, ejection_fraction, and time. time has the lowest p-value which suggests is has a strong association with death event.

anova(model_logit, test = "Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: DEATH_EVENT
## 
## Terms added sequentially (first to last)
## 
## 
##                          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                       223     288.28              
## age                       1    8.997       222     279.29  0.002704 ** 
## anaemia                   1    0.072       221     279.22  0.788645    
## creatinine_phosphokinase  1    2.153       220     277.06  0.142248    
## diabetes                  1    0.402       219     276.66  0.526235    
## ejection_fraction         1   27.769       218     248.89 1.367e-07 ***
## high_blood_pressure       1    0.168       217     248.72  0.681951    
## platelets                 1    0.020       216     248.70  0.886580    
## serum_creatinine          1   16.120       215     232.58 5.946e-05 ***
## serum_sodium              1    2.027       214     230.56  0.154558    
## sex                       1    1.060       213     229.50  0.303144    
## smoking                   1    0.243       212     229.25  0.622015    
## time                      1   59.026       211     170.23 1.556e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova suggests serum_creatinine, in addition to the other 3 variables suggested in the previous part.

5.3 Selecting Model:

5.3.1 BIC

bs1 <- bestglm(Xy = training, family = binomial, IC = "BIC")
summary(bs1)

## Fitting algorithm:  BIC-glm
## Best Model:
##             df deviance
## Null Model 220 181.6895
## Full Model 223 288.2842
## 
##  likelihood-ratio test - GLM
## 
## data:  H0: Null Model vs. H1: Best Fit BIC-glm
## X = 106.59, df = 3, p-value < 2.2e-16

bs1

## BIC
## BICq equivalent for q in (0.13864112844905, 0.639542967214341)
## Best Model:
##                      Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept)        3.56479321 0.854055815  4.173958 2.993528e-05
## ejection_fraction -0.07417862 0.017205430 -4.311349 1.622615e-05
## serum_creatinine   0.59316104 0.225618041  2.629050 8.562385e-03
## time              -0.02047303 0.003266243 -6.268065 3.655623e-10

bs1$Subsets

##     Intercept   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 0        TRUE FALSE   FALSE                    FALSE    FALSE             FALSE
## 1        TRUE FALSE   FALSE                    FALSE    FALSE             FALSE
## 2        TRUE FALSE   FALSE                    FALSE    FALSE              TRUE
## 3*       TRUE FALSE   FALSE                    FALSE    FALSE              TRUE
## 4        TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 5        TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 6        TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 7        TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 8        TRUE  TRUE   FALSE                     TRUE    FALSE              TRUE
## 9        TRUE  TRUE   FALSE                     TRUE    FALSE              TRUE
## 10       TRUE  TRUE   FALSE                     TRUE    FALSE              TRUE
## 11       TRUE  TRUE   FALSE                     TRUE     TRUE              TRUE
## 12       TRUE  TRUE    TRUE                     TRUE     TRUE              TRUE
##     high_blood_pressure platelets serum_creatinine serum_sodium   sex smoking
## 0                 FALSE     FALSE            FALSE        FALSE FALSE   FALSE
## 1                 FALSE     FALSE            FALSE        FALSE FALSE   FALSE
## 2                 FALSE     FALSE            FALSE        FALSE FALSE   FALSE
## 3*                FALSE     FALSE             TRUE        FALSE FALSE   FALSE
## 4                 FALSE     FALSE             TRUE        FALSE FALSE   FALSE
## 5                 FALSE     FALSE             TRUE         TRUE FALSE   FALSE
## 6                 FALSE     FALSE             TRUE         TRUE  TRUE   FALSE
## 7                  TRUE     FALSE             TRUE         TRUE  TRUE   FALSE
## 8                  TRUE     FALSE             TRUE         TRUE  TRUE   FALSE
## 9                  TRUE      TRUE             TRUE         TRUE  TRUE   FALSE
## 10                 TRUE      TRUE             TRUE         TRUE  TRUE    TRUE
## 11                 TRUE      TRUE             TRUE         TRUE  TRUE    TRUE
## 12                 TRUE      TRUE             TRUE         TRUE  TRUE    TRUE
##      time logLikelihood      BIC
## 0   FALSE    -144.14211 288.2842
## 1    TRUE    -107.36937 220.1504
## 2    TRUE     -95.37721 201.5777
## 3*   TRUE     -90.84476 197.9245
## 4    TRUE     -88.71232 199.0712
## 5    TRUE     -87.23411 201.5264
## 6    TRUE     -85.95420 204.3783
## 7    TRUE     -85.50229 208.8861
## 8    TRUE     -85.22504 213.7432
## 9    TRUE     -85.16525 219.0353
## 10   TRUE     -85.13704 224.3905
## 11   TRUE     -85.12043 229.7690
## 12   TRUE     -85.11388 235.1675

The BIC criterion suggested to use 3 variables that are: ejection_fraction, serum_creatinine, time.

5.3.2 AIC

bs2 <- bestglm(Xy = training, family = binomial, IC = "AIC")
summary(bs2)

## Fitting algorithm:  AIC-glm
## Best Model:
##             df deviance
## Null Model 217 171.9084
## Full Model 223 288.2842
## 
##  likelihood-ratio test - GLM
## 
## data:  H0: Null Model vs. H1: Best Fit AIC-glm
## X = 116.38, df = 6, p-value < 2.2e-16

bs2

## AIC
## BICq equivalent for q in (0.806264154642832, 0.904987365701856)
## Best Model:
##                      Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)       12.49988349 6.15599546  2.030522 4.230352e-02
## age                0.03339734 0.01729805  1.930699 5.352025e-02
## ejection_fraction -0.08740968 0.01939957 -4.505754 6.613780e-06
## serum_creatinine   0.45475163 0.20880107  2.177918 2.941213e-02
## serum_sodium      -0.07109988 0.04271560 -1.664495 9.601364e-02
## sex1              -0.70413868 0.44407036 -1.585647 1.128195e-01
## time              -0.02157386 0.00350475 -6.155608 7.479016e-10

bs2$Subsets

##     Intercept   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 0        TRUE FALSE   FALSE                    FALSE    FALSE             FALSE
## 1        TRUE FALSE   FALSE                    FALSE    FALSE             FALSE
## 2        TRUE FALSE   FALSE                    FALSE    FALSE              TRUE
## 3        TRUE FALSE   FALSE                    FALSE    FALSE              TRUE
## 4        TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 5        TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 6*       TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 7        TRUE  TRUE   FALSE                    FALSE    FALSE              TRUE
## 8        TRUE  TRUE   FALSE                     TRUE    FALSE              TRUE
## 9        TRUE  TRUE   FALSE                     TRUE    FALSE              TRUE
## 10       TRUE  TRUE   FALSE                     TRUE    FALSE              TRUE
## 11       TRUE  TRUE   FALSE                     TRUE     TRUE              TRUE
## 12       TRUE  TRUE    TRUE                     TRUE     TRUE              TRUE
##     high_blood_pressure platelets serum_creatinine serum_sodium   sex smoking
## 0                 FALSE     FALSE            FALSE        FALSE FALSE   FALSE
## 1                 FALSE     FALSE            FALSE        FALSE FALSE   FALSE
## 2                 FALSE     FALSE            FALSE        FALSE FALSE   FALSE
## 3                 FALSE     FALSE             TRUE        FALSE FALSE   FALSE
## 4                 FALSE     FALSE             TRUE        FALSE FALSE   FALSE
## 5                 FALSE     FALSE             TRUE         TRUE FALSE   FALSE
## 6*                FALSE     FALSE             TRUE         TRUE  TRUE   FALSE
## 7                  TRUE     FALSE             TRUE         TRUE  TRUE   FALSE
## 8                  TRUE     FALSE             TRUE         TRUE  TRUE   FALSE
## 9                  TRUE      TRUE             TRUE         TRUE  TRUE   FALSE
## 10                 TRUE      TRUE             TRUE         TRUE  TRUE    TRUE
## 11                 TRUE      TRUE             TRUE         TRUE  TRUE    TRUE
## 12                 TRUE      TRUE             TRUE         TRUE  TRUE    TRUE
##      time logLikelihood      AIC
## 0   FALSE    -144.14211 288.2842
## 1    TRUE    -107.36937 216.7387
## 2    TRUE     -95.37721 194.7544
## 3    TRUE     -90.84476 187.6895
## 4    TRUE     -88.71232 185.4246
## 5    TRUE     -87.23411 184.4682
## 6*   TRUE     -85.95420 183.9084
## 7    TRUE     -85.50229 185.0046
## 8    TRUE     -85.22504 186.4501
## 9    TRUE     -85.16525 188.3305
## 10   TRUE     -85.13704 190.2741
## 11   TRUE     -85.12043 192.2409
## 12   TRUE     -85.11388 194.2278

The AIC criterion suggested to use 6 variables that are: age, ejection_fraction, serum_creatinine, sex, time, and serum_sodium. It is as expected as AIC is more lenient.

We went ahead with both models and compared the accuracy rates to figure out the better model.

5.3.3 Model 1

#Variables taken from BIC

model1 <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine + time, family = binomial,data = training)
summary(model1)

## 
## Call:
## glm(formula = DEATH_EVENT ~ ejection_fraction + serum_creatinine + 
##     time, family = binomial, data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1277  -0.6797  -0.2658   0.6305   2.5867  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        3.564793   0.854056   4.174 2.99e-05 ***
## ejection_fraction -0.074179   0.017205  -4.311 1.62e-05 ***
## serum_creatinine   0.593161   0.225618   2.629  0.00856 ** 
## time              -0.020473   0.003266  -6.268 3.66e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 288.28  on 223  degrees of freedom
## Residual deviance: 181.69  on 220  degrees of freedom
## AIC: 189.69
## 
## Number of Fisher Scoring iterations: 5

5.3.3.1 Accuracy Rate - Model 1

predicted1<-predict(model1, test, type="response")
for (i in 1:length(predicted1)) {
  if(predicted1[i]>=0.5){
    predicted1[i]=1
  } else
  {predicted1[i]=0
  }
}
(accuracy <- (length(which(predicted1==test$DEATH_EVENT)) / nrow(test)) * 100)

## [1] 81.33333

5.3.4 Model 2

#Variables taken from AIC

model2 <- glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, family = binomial, 
              data = training)
summary(model2)

## 
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + 
##     serum_sodium + sex + time, family = binomial, data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4657  -0.6626  -0.2309   0.5266   2.5327  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       12.499883   6.155995   2.031   0.0423 *  
## age                0.033397   0.017298   1.931   0.0535 .  
## ejection_fraction -0.087410   0.019400  -4.506 6.61e-06 ***
## serum_creatinine   0.454752   0.208801   2.178   0.0294 *  
## serum_sodium      -0.071100   0.042716  -1.664   0.0960 .  
## sex1              -0.704139   0.444070  -1.586   0.1128    
## time              -0.021574   0.003505  -6.156 7.48e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 288.28  on 223  degrees of freedom
## Residual deviance: 171.91  on 217  degrees of freedom
## AIC: 185.91
## 
## Number of Fisher Scoring iterations: 5

5.3.4.1 Accuracy Rate - Model 2

predicted2<-predict(model2, test, type="response")
for (i in 1:length(predicted2)) {
  if(predicted2[i]>=0.5){
    predicted2[i]=1
  } else
  {predicted2[i]=0
  }
}
(accuracy2 <- (length(which(predicted2==test$DEATH_EVENT)) / nrow(test)) * 100)

## [1] 84

The accuracy for variables with AIC is slightly better than with BIC. We ran Leave One out Cross validation to verify the results.

5.3.5 Model Performance Metrics

5.3.5.1 Leave One Out Cross-Validation Model 1

loocv1 <- glm(DEATH_EVENT ~ ejection_fraction + serum_creatinine + time, family = binomial, heartdata)

(out1<- cv.glm(heartdata, glmfit = loocv1, K = nrow(heartdata)))

## $call
## cv.glm(data = heartdata, glmfit = loocv1, K = nrow(heartdata))
## 
## $K
## [1] 299
## 
## $delta
## [1] 0.1305802 0.1305721
## 
## $seed
##   [1]       10403         331   442120954   887807557   135382687  1579654182
##   [7]  2101590192  2041315667  1735425137   889127328  1140743486  1102298185
##  [13]  1912223995  1729474762 -1972500948 -1889117649  2133119717   636053404
##  [19]  -208842254   487308397  -238613081   626423758   889509160  1667909419
##  [25]  -190296759 -2026614184 -1682674938   107955937  1351242355 -2035524894
##  [31]   -38640044  1517596503   623674189  1273685732  -696473910  -755456203
##  [37] -1385230961  -807097706  1248395840 -1981676797  1166144289 -1957743600
##  [43]   608113646  1343677337  1963727915 -1673860838  -879080484 -1785811489
##  [49]   -14353995 -1428957364 -1544073822  1046142205   535970295   750423838
##  [55]  1193185752   410831675   -19792679   535965800  1293702550   330657041
##  [61]   347807427 -1080099118  -964016284 -1377068377 -2110741507   198030836
##  [67] -2140597542 -1719925723 -1333171905 -1078546746   705209168  -348898445
##  [73]  -124121711 -1241542720  -997209698  2106643305 -1302419557   378662570
##  [79]   890448076   441825679   942534085  -160484996  -102735406 -1264927411
##  [85]   899203079  1239379950 -1250093368  1932372427   364100649   627182200
##  [91] -1008019546 -1509662271  -624904621   976837890 -1899369420   838569399
##  [97]    35692973  -524819772 -1687226646 -1810556587 -1357779217  1917832438
## [103] -1824981984  1890041315  -959734143   866916592  -503441522  -753407751
## [109] -1188924789 -1092693318  -411984452    68241151  1891371605   113257836
## [115]  -559897150   238503453 -1068990825   305625086  1019298744  -148119653
## [121]   214042105  1836844360 -1082290058  -646767183 -1660402717  -875364174
## [127]   438119684  -581800377   856217885  1748621204   472554298  -446126203
## [133]   -79883169  1243733734  1586996592   613191315  -398510287 -2008683040
## [139]   274143870   870942985   123038523  -459988214   809855724   913434479
## [145]  -977614683  2106696284  -182447566 -1880270547 -1731780633    95053582
## [151]  -836765464  1760365291 -1360788599   186146584  1341063750  -994401631
## [157] -1188723405 -1831682910  -749816940   663431703    75247117  1475005220
## [163]  1775064842   990509685   659515855  1155592918  -587295104 -1836377917
## [169] -1815456159  1580822224  1150695470  -945959463  2048897515  -298240934
## [175] -1337635300 -1450426593  -113388683  -897279476  1171921890  -238884675
## [181]  1157994295    60841950 -1548088040 -1050156933  -804129255   550346408
## [187]  -576820138  1252649169   224777347  -168380654  -989490524 -1862017433
## [193]  -807145539 -1429487308   949674650  -129965851    33989247 -2035364858
## [199]   499351440  1008165427 -1640089135   153774976  1438041694 -1180131927
## [205] -2022699941 -1257884566 -1870557940  -776433841  1029065733   666194620
## [211] -2063236206   872104077  1682446023 -1560166482   551782920 -1322422773
## [217]   781461225  1451975096   347135590  -502531327   248368531   989603394
## [223]   976124404  1408017399  1704630253  1129517188  -414734678   613022741
## [229]  1406176431   669335740  -936241516  1577494146  -946250560   994570556
## [235]  -233590640   801548834 -1268161016  1868906764 -1402137956  1023225362
## [241]  -539153024   -56320492   339760680   620258618 -2146089008  -129002900
## [247]  -696211660 -1620335982  -398706336   765972812 -1136300064 -1138090078
## [253]  1598142376 -1202435188  -473110852    98503538   477804784 -1950853052
## [259] -1978229416  1351832634  -219823248  -879868868  2014063380 -2139403838
## [265]  1913880800  1012660092  1653923536  1112704674  1686651944   372222220
## [271]  -488934468  1166982450   516956096  -597716428   602823624 -1211784006
## [277]  1486565264 -1915611668 -1763458988   579722898  -333579488  1153371596
## [283]  -132154400 -1755168062 -2058675672   430849708  -952757572  1431007730
## [289]   348768560  -250565468   606380728  -666211494  -336646768  -739856004
## [295]  1485819156  -267835070   214491904   590666684  1111513936 -2095668958
## [301]   770268488  -702187508 -1411450788 -1623268142  -212809728   899013460
## [307] -1185234456   503871482  1616328400  -483973844  2021861620   379545490
## [313] -1997595552    -3400436 -1291434272  1486126434 -1298455896 -1747592628
## [319]   118609788 -1271715022 -1237180816 -1976656572 -1216627752   939257978
## [325]  1814046256 -2035332420  1779926868 -1720272318 -1879053408  -868996932
## [331]  -251584048 -1905924766 -1589505048 -1695868980  1795806972  -905342990
## [337]  1413774464 -1324844812 -2003782520  1229546810  -798407856  -982242196
## [343] -1964908396  -937903790 -1046069536  1954287820  1584388256  1181243778
## [349]  -101844760 -2058381076   761337916  -203251598   837560432    71754276
## [355]  1282646584  1378200090 -1569582896   228570684   839142548 -2115464062
## [361]  -431421504   319270204  -930431344 -1841572830  1237946376  2108472460
## [367]  -367935460   807707794    13257856    89113620  -602807512 -1255795654
## [373] -1510560560  -463915924   498177716  -996024302 -1513242656 -2136110132
## [379] -1241072160 -1897690078  -904637144  1004034316   653446716 -1555577614
## [385]   337410288 -1456818364 -1515851688  1606484538 -1290778384  1935188668
## [391] -1085847020  1486466242  -773704096 -1135835780  1176134864 -1819167710
## [397]  -781891416    10263948  1912906172   262277170  1376211776  -572514636
## [403]   866554184 -1520815430  1962573584   876282092  1383741524   290703122
## [409] -1577197152 -1060739124  2123177824  -518345534  -850735576 -1738184276
## [415]  -588635332    41968498  1908576048 -1066827740   787496504 -2060086566
## [421]  1011144592    17495292   514018708  1064403778   972343808  1201506364
## [427]  -850148784 -1405009758 -1299135672  1046166028   521476828 -1460948910
## [433]  1749388416 -1758181548   981581800  1223963386  -869513392  -868359124
## [439]  -841070220   258125458   250602080  1692709004  2053329248   689281250
## [445] -1743426392   496621260   150329852   986483634 -1740553232  1542554052
## [451]  -222282408   476049786  -175869520  -494392900 -1123090860   571611586
## [457] -1321680598 -1162828344 -1909499151  -895290269  -357167004  -666756206
## [463] -2099703705 -1728139215 -1210220170 -1396371820  -907590491 -2059036241
## [469]   607224424   770398790  -671588365 -1578795179  2038833458  -172079872
## [475]   191419881 -1156939685 -1747307860  -785904454  1960611439  1845582841
## [481]  1209180814 -1947774116 -2050310195    95379255   644242880  -426327394
## [487]  -822846613  -769596691   816519002  1862626328   728136065  1095757619
## [493] -1479211692   997551970 -1171692553  1589005377  -576452410  -382519804
## [499]  -960224843   289840799  1679904504   411783894   347195235  1730805541
## [505]  1940255938 -1125342800  -475489831  -974407861  -181396036 -2140443158
## [511]  1047845407 -1115967319 -1568368578  -510830292  -568192131  1453181383
## [517]   173034672  1424632526  1535483899   -10098595   236375946  1934720936
## [523]   170210577  1634020355  -863467772   688540466   664275335  1106448337
## [529]   921903446 -1187201228   571286085 -1917374833 -1603631096 -1513579482
## [535]  -356100909   824817205  1435651474   266834272   -46217015  -815567877
## [541]  1139608524 -1225824678  -225547441 -1916296679  2123391022   652963388
## [547]   123012013 -2033603497   959875488 -1667386626 -1340014133   826255309
## [553]  1920338682  1626960184  1513413345 -1526452845  1876809140 -1808275774
## [559]   191235799   -84236511   188505830   504100836  2088470293  1822203711
## [565]  1157552856 -1982873866  -413132541 -1170863547   308546146  -355792432
## [571]   141113081  -865926101 -1463565412  1494703050 -2097711937 -1161980407
## [577]  2007768862  2033510092  1330988701   908383015  -247071408  1082213358
## [583]  -555635557 -1292024067 -1262334486  1981963912  1264830513  -430222685
## [589]  -664241500  -321295534  1703699111   560774897 -1173991882 -1381887276
## [595]  -162011547  1937716335 -2078044504    -1643514   199547187  -963148395
## [601]   276433394   855402048   439551913 -1583273445  2099897324  -221740038
## [607]   125972527  -863231943 -1714916786  1221317276  2090825741  1162585207
## [613]  1178039424   -58400034  -650440277  -101315923 -2075209958  -499691560
## [619]  -916644799 -1235268877   199691540   492263586   264517047  -384923263
## [625] -1311379578  1059699905

5.3.5.2 Leave One Out Cross-Validation Model 2

loocv2 <- glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, family = binomial, heartdata)

(out2 <- cv.glm(heartdata, glmfit = loocv2, K = nrow(heartdata))$delta[1])

## [1] 0.1271257

Model 2 has the lower error rate, and better accuracy than Model 1. We can conclude from this that age, ejection_fraction, serum_creatinine, serum_sodium, sex, and time could be used as predictor variables for death_event.

6 Other Models:

We also tried other models for prediction, in order to figure out the best one, and verify the results of model selection.

6.1 Logistic Regression

6.1.1 Using all variables:

lr <- glm(DEATH_EVENT ~ ., family = binomial, training)
predictedlr <-predict(lr, test, type="response")

for (i in 1:length(predictedlr)) {
  if(predictedlr[i]>=0.5){
    predictedlr[i]=1
  } else
  {predictedlr[i]=0
  }
}
(accuracylr <- (length(which(predictedlr==test$DEATH_EVENT)) / nrow(test)) * 100)

## [1] 81.33333

confusionMatrix(table(Predicted=predictedlr, Actual = test$DEATH_EVENT))

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  0  1
##         0 47  5
##         1  9 14
##                                          
##                Accuracy : 0.8133         
##                  95% CI : (0.7067, 0.894)
##     No Information Rate : 0.7467         
##     P-Value [Acc > NIR] : 0.1139         
##                                          
##                   Kappa : 0.5387         
##                                          
##  Mcnemar's Test P-Value : 0.4227         
##                                          
##             Sensitivity : 0.8393         
##             Specificity : 0.7368         
##          Pos Pred Value : 0.9038         
##          Neg Pred Value : 0.6087         
##              Prevalence : 0.7467         
##          Detection Rate : 0.6267         
##    Detection Prevalence : 0.6933         
##       Balanced Accuracy : 0.7881         
##                                          
##        'Positive' Class : 0              
##

6.2 Linear Discriminant Analysis

Linear Discriminant Analysis is a method to find a linear combination of features that characterizes or separates two or more classes of objects or events. This method can be used for linear classification or dimension reduction. We have used it for the purpose of classification in this project.

6.2.1 Using all variables:

ldamod = lda(DEATH_EVENT ~ ., data= training)

predicted = predict(ldamod, test)
(confu_lda = table(Predicted=predict(ldamod, test)$class, Actual = test$DEATH_EVENT))

##          Actual
## Predicted  0  1
##         0 48  5
##         1  8 14

(accuracylda <- sum(100 * diag(confu_lda)/ nrow(test)))

## [1] 82.66667

confusionMatrix(confu_lda)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  0  1
##         0 48  5
##         1  8 14
##                                           
##                Accuracy : 0.8267          
##                  95% CI : (0.7219, 0.9043)
##     No Information Rate : 0.7467          
##     P-Value [Acc > NIR] : 0.06804         
##                                           
##                   Kappa : 0.5645          
##                                           
##  Mcnemar's Test P-Value : 0.57910         
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.7368          
##          Pos Pred Value : 0.9057          
##          Neg Pred Value : 0.6364          
##              Prevalence : 0.7467          
##          Detection Rate : 0.6400          
##    Detection Prevalence : 0.7067          
##       Balanced Accuracy : 0.7970          
##                                           
##        'Positive' Class : 0               
##

6.2.2 Using Model 2 variables:

ldamod1 = lda(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, data= training)

predicted = predict(ldamod1, test)
(confu_lda = table(Predicted=predict(ldamod1, test)$class, Actual = test$DEATH_EVENT))

##          Actual
## Predicted  0  1
##         0 47  5
##         1  9 14

(accuracylda <- sum(100 * diag(confu_lda)/ nrow(test)))

## [1] 81.33333

confusionMatrix(confu_lda)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  0  1
##         0 47  5
##         1  9 14
##                                          
##                Accuracy : 0.8133         
##                  95% CI : (0.7067, 0.894)
##     No Information Rate : 0.7467         
##     P-Value [Acc > NIR] : 0.1139         
##                                          
##                   Kappa : 0.5387         
##                                          
##  Mcnemar's Test P-Value : 0.4227         
##                                          
##             Sensitivity : 0.8393         
##             Specificity : 0.7368         
##          Pos Pred Value : 0.9038         
##          Neg Pred Value : 0.6087         
##              Prevalence : 0.7467         
##          Detection Rate : 0.6267         
##    Detection Prevalence : 0.6933         
##       Balanced Accuracy : 0.7881         
##                                          
##        'Positive' Class : 0              
##

6.3 Quadratic Disciminant Analysis

Quadratic Discriminant Analysis is similar to Linear discriminant analysis but used a quadratic function for classification instead of linear.

6.3.1 Using all variables:

qda.fit <- qda(DEATH_EVENT ~ ., data = training)
qda.class <- predict(qda.fit, test)$class
(confu_qda = table(Predicted = qda.class, Actual = test$DEATH_EVENT))

##          Actual
## Predicted  0  1
##         0 49  8
##         1  7 11

(accuracyqda <- sum(100 * diag(confu_qda)/ nrow(test)))

## [1] 80

confusionMatrix(confu_qda)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  0  1
##         0 49  8
##         1  7 11
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6917, 0.8835)
##     No Information Rate : 0.7467          
##     P-Value [Acc > NIR] : 0.1771          
##                                           
##                   Kappa : 0.462           
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.5789          
##          Pos Pred Value : 0.8596          
##          Neg Pred Value : 0.6111          
##              Prevalence : 0.7467          
##          Detection Rate : 0.6533          
##    Detection Prevalence : 0.7600          
##       Balanced Accuracy : 0.7270          
##                                           
##        'Positive' Class : 0               
##

6.3.2 Using Model 2 variables:

qdamod1 = qda(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + serum_sodium+ sex + time, data= training)

predicted = predict(qdamod1, test)
(confu_qda = table(Predicted=predict(qdamod1, test)$class, Actual = test$DEATH_EVENT))

##          Actual
## Predicted  0  1
##         0 49  6
##         1  7 13

(accuracyqda <- sum(100 * diag(confu_qda)/ nrow(test)))

## [1] 82.66667

confusionMatrix(confu_qda)

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted  0  1
##         0 49  6
##         1  7 13
##                                           
##                Accuracy : 0.8267          
##                  95% CI : (0.7219, 0.9043)
##     No Information Rate : 0.7467          
##     P-Value [Acc > NIR] : 0.06804         
##                                           
##                   Kappa : 0.5497          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.6842          
##          Pos Pred Value : 0.8909          
##          Neg Pred Value : 0.6500          
##              Prevalence : 0.7467          
##          Detection Rate : 0.6533          
##    Detection Prevalence : 0.7333          
##       Balanced Accuracy : 0.7796          
##                                           
##        'Positive' Class : 0               
##

The QDA is performing better than LDA.

6.4 Decision Tree

This is another algorithm used for classification purposes. These algorithms are able to map non linear relationships as well leading to better prediction power than other linear supervised learning models in case of non linear relationships in the data.

# Fitting the tree to training dataset:
fit <- rpart(DEATH_EVENT ~ ., method="class", data=training)
printcp(fit) # display the results

## 
## Classification tree:
## rpart(formula = DEATH_EVENT ~ ., data = training, method = "class")
## 
## Variables actually used in tree construction:
## [1] ejection_fraction platelets         serum_creatinine  time             
## 
## Root node error: 77/224 = 0.34375
## 
## n= 224 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.558442      0   1.00000 1.00000 0.092319
## 2 0.051948      1   0.44156 0.48052 0.072178
## 3 0.012987      3   0.33766 0.48052 0.072178
## 4 0.010000      4   0.32468 0.51948 0.074443

plot(fit, uniform=TRUE, main="Classification Tree")
text(fit, use.n=TRUE, all=TRUE, cex= 0.7)

# Accuracy Rate:
(conf.matrix = table(Actual = test$DEATH_EVENT, Predicted = predict(fit,test,type="class")))

##       Predicted
## Actual  0  1
##      0 44 12
##      1  4 15

sum(diag(conf.matrix))/nrow(test) * 100

## [1] 78.66667

confusionMatrix(conf.matrix)

## Confusion Matrix and Statistics
## 
##       Predicted
## Actual  0  1
##      0 44 12
##      1  4 15
##                                           
##                Accuracy : 0.7867          
##                  95% CI : (0.6768, 0.8729)
##     No Information Rate : 0.64            
##     P-Value [Acc > NIR] : 0.004518        
##                                           
##                   Kappa : 0.505           
##                                           
##  Mcnemar's Test P-Value : 0.080118        
##                                           
##             Sensitivity : 0.9167          
##             Specificity : 0.5556          
##          Pos Pred Value : 0.7857          
##          Neg Pred Value : 0.7895          
##              Prevalence : 0.6400          
##          Detection Rate : 0.5867          
##    Detection Prevalence : 0.7467          
##       Balanced Accuracy : 0.7361          
##                                           
##        'Positive' Class : 0               
##

6.4.1 Pruning:

# Pruning
fitp = prune(fit, cp = 0.019608)
printcp(fitp)

## 
## Classification tree:
## rpart(formula = DEATH_EVENT ~ ., data = training, method = "class")
## 
## Variables actually used in tree construction:
## [1] ejection_fraction serum_creatinine  time             
## 
## Root node error: 77/224 = 0.34375
## 
## n= 224 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.558442      0   1.00000 1.00000 0.092319
## 2 0.051948      1   0.44156 0.48052 0.072178
## 3 0.019608      3   0.33766 0.48052 0.072178

plot(fitp) #plot smaller rpart object
text(fitp, use.n=TRUE, all=TRUE, cex=.8)

(conf.matrix.prune = table(test$DEATH_EVENT, predict(fitp,test, type="class")))

##    
##      0  1
##   0 42 14
##   1  4 15

sum(diag(conf.matrix.prune))/nrow(test) * 100

## [1] 76

confusionMatrix(conf.matrix.prune)

## Confusion Matrix and Statistics
## 
##    
##      0  1
##   0 42 14
##   1  4 15
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.6475, 0.8511)
##     No Information Rate : 0.6133          
##     P-Value [Acc > NIR] : 0.005283        
##                                           
##                   Kappa : 0.4596          
##                                           
##  Mcnemar's Test P-Value : 0.033895        
##                                           
##             Sensitivity : 0.9130          
##             Specificity : 0.5172          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.7895          
##              Prevalence : 0.6133          
##          Detection Rate : 0.5600          
##    Detection Prevalence : 0.7467          
##       Balanced Accuracy : 0.7151          
##                                           
##        'Positive' Class : 0               
##

6.5 Random Forest:

6.5.1 Using all variables:

6.5.1.1 Hypertuning Parameter

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(4)
mtry <- sqrt(ncol(heartdata))
rf_random <- train(DEATH_EVENT~., data=heartdata, method="rf", metric='Accuracy', tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 299 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 269, 269, 269, 268, 269, 270, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.7691954  0.3574794
##    2    0.8461884  0.6272339
##    3    0.8529675  0.6492393
##    4    0.8541169  0.6541063
##    5    0.8541144  0.6544194
##    6    0.8474478  0.6373658
##    7    0.8463367  0.6407240
##    8    0.8306612  0.6038030
##   10    0.8228068  0.5860266
##   12    0.8104672  0.5610446
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

plot(rf_random)

6.5.1.2 Applying Model

set.seed(4)
rf_model <- randomForest(x = training[,-13], 
                         y = training$DEATH_EVENT,
                         ntree = 1000, mtry = 4)
y_pred = predict(rf_model, newdata = test[,-13])
table(y_pred,test$DEATH_EVENT)

##       
## y_pred  0  1
##      0 48  4
##      1  8 15

sum(diag(table(y_pred,test$DEATH_EVENT)))/nrow(test) * 100

## [1] 84

confusionMatrix(table(y_pred,test$DEATH_EVENT))

## Confusion Matrix and Statistics
## 
##       
## y_pred  0  1
##      0 48  4
##      1  8 15
##                                           
##                Accuracy : 0.84            
##                  95% CI : (0.7372, 0.9145)
##     No Information Rate : 0.7467          
##     P-Value [Acc > NIR] : 0.03754         
##                                           
##                   Kappa : 0.6046          
##                                           
##  Mcnemar's Test P-Value : 0.38648         
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.7895          
##          Pos Pred Value : 0.9231          
##          Neg Pred Value : 0.6522          
##              Prevalence : 0.7467          
##          Detection Rate : 0.6400          
##    Detection Prevalence : 0.6933          
##       Balanced Accuracy : 0.8233          
##                                           
##        'Positive' Class : 0               
##

6.5.2 Using only model 2 variables:

#Splitting into training and testing for Model 2:
set.seed(4)
heartdata2 <- heartdata[,c("age","ejection_fraction","serum_creatinine","serum_sodium","sex","time","DEATH_EVENT")]
n = nrow(heartdata)
split <- sample(1:n, 0.75*n, replace = FALSE)
training_model2 <- heartdata2[split,]
test_model2 <- heartdata2[-split,]

6.5.2.1 Hypertuning Parameter

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(4)
mtry <- floor(sqrt(ncol(heartdata2)))
rf_random <- train(DEATH_EVENT~., data=heartdata2, method="rf", metric='Accuracy', tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 299 samples
##   6 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 269, 269, 269, 268, 269, 270, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.8540143  0.6458588
##   2     0.8450748  0.6386223
##   3     0.8394401  0.6255142
##   4     0.8406254  0.6299524
##   5     0.8284748  0.6003216
##   6     0.8285465  0.5998256
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

plot(rf_random)

6.5.2.2 Applying Model

set.seed(4)
rf_model <- randomForest(x = training_model2[,-7], 
                         y = training_model2$DEATH_EVENT,
                         ntree = 1000, mtry = 1)
y_pred = predict(rf_model, newdata = test_model2[,-7])
table(y_pred,test_model2$DEATH_EVENT)

##       
## y_pred  0  1
##      0 52  5
##      1  4 14

sum(diag(table(y_pred,test_model2$DEATH_EVENT)))/nrow(test_model2) * 100

## [1] 88

confusionMatrix(table(y_pred,test_model2$DEATH_EVENT))

## Confusion Matrix and Statistics
## 
##       
## y_pred  0  1
##      0 52  5
##      1  4 14
##                                           
##                Accuracy : 0.88            
##                  95% CI : (0.7844, 0.9436)
##     No Information Rate : 0.7467          
##     P-Value [Acc > NIR] : 0.003624        
##                                           
##                   Kappa : 0.6772          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.9286          
##             Specificity : 0.7368          
##          Pos Pred Value : 0.9123          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.7467          
##          Detection Rate : 0.6933          
##    Detection Prevalence : 0.7600          
##       Balanced Accuracy : 0.8327          
##                                           
##        'Positive' Class : 0               
##

6.6 KNN Algorithm

6.6.1 Using all variables:

set.seed(4)
knn_mod <- knn(training,test,cl=training$DEATH_EVENT,k=6)
table(knn_mod,test$DEATH_EVENT)

##        
## knn_mod  0  1
##       0 44 10
##       1 12  9

sum(diag(table(knn_mod,test$DEATH_EVENT)))/nrow(test) * 100

## [1] 70.66667

confusionMatrix(table(knn_mod,test$DEATH_EVENT))

## Confusion Matrix and Statistics
## 
##        
## knn_mod  0  1
##       0 44 10
##       1 12  9
##                                           
##                Accuracy : 0.7067          
##                  95% CI : (0.5902, 0.8062)
##     No Information Rate : 0.7467          
##     P-Value [Acc > NIR] : 0.8245          
##                                           
##                   Kappa : 0.2507          
##                                           
##  Mcnemar's Test P-Value : 0.8312          
##                                           
##             Sensitivity : 0.7857          
##             Specificity : 0.4737          
##          Pos Pred Value : 0.8148          
##          Neg Pred Value : 0.4286          
##              Prevalence : 0.7467          
##          Detection Rate : 0.5867          
##    Detection Prevalence : 0.7200          
##       Balanced Accuracy : 0.6297          
##                                           
##        'Positive' Class : 0               
##

6.6.2 Using only variables from model 2:

set.seed(9)
knn_mod <- knn(training_model2,test_model2,cl=training_model2$DEATH_EVENT,k=6)
table(knn_mod,test_model2$DEATH_EVENT)

##        
## knn_mod  0  1
##       0 50  5
##       1  6 14

sum(diag(table(knn_mod,test_model2$DEATH_EVENT)))/nrow(test_model2) * 100

## [1] 85.33333

confusionMatrix(table(knn_mod,test$DEATH_EVENT))

## Confusion Matrix and Statistics
## 
##        
## knn_mod  0  1
##       0 50  5
##       1  6 14
##                                           
##                Accuracy : 0.8533          
##                  95% CI : (0.7527, 0.9244)
##     No Information Rate : 0.7467          
##     P-Value [Acc > NIR] : 0.01899         
##                                           
##                   Kappa : 0.6189          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8929          
##             Specificity : 0.7368          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.7467          
##          Detection Rate : 0.6667          
##    Detection Prevalence : 0.7333          
##       Balanced Accuracy : 0.8148          
##                                           
##        'Positive' Class : 0               
##

7 Conclusion

After conducting model selection, following conclusions can be made:

The most useful predictor variables, as suggested by AIC criteria, are: age, serum_sodium, sex, ejection_fraction, serum_creatinine, and time. Using these predictors improved the accuracy over using all the variables as predictors, in all the models.
The overall accuracy rate for logistic regression with given 6 variables is 84%. This accuracy rate is better than the accuracy rate when all the variables are included, i.e, 81.33%
The best predictor models, if model 2 variables are considered were Random Forest, KNN,and Logistic Regression.
The variables used by tree based method for classification, before pruning, were: ejection_fraction, platelets, serum_creatinine, and time, and after pruning were: ejection_fraction, serum_creatinine, and time.

8 Reflection

Having more observations would have been better. It would have helped us in being more confident about the accuracy rates, and model selection.
Having longer followup period would have made data more reliable.

9 Citations

Davide Chicco, Giuseppe Jurman: “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone”. BMC Medical Informatics and Decision Making 20, 16 (2020). Web Link

Computational Statistics - Model Selection

4/29/2022