Assignment 7: Titanic Survival Prediction
This assignment is for 80 points, four times the normal assignment weight. The goal of the project is to
predict what type of persons were more likely to survive? The features available are Name, Age, Gender,
Fare Class, etc. Data dictionary is provided in the appendix. Data is partitioned into (1) ProjectTrain.csv,
and (2) ProjectTest.csv. Use Train data to develop the model and report performance results on the Test
dataset.
1) Develop Logistic Repression, LDA, QDA and KNN based survival prediction models using Pclass,
Sex, Age, SibSp, Parch, and Embarked as predictor variables. Note that some of these variables may
need to be case of categorical (factors in R). Also, Age has lot of missing values. The missing values
may need to be imputed (e.g., mean) for using this variable. Try few values of k in KNN to
determine suitable value for K. Compare and interpret True Positive (TP) and False Positive (FP) of
the different models using test data. 40 Points
2) “Cabin” has sparse data content. One approach to handle the missing data is to have a special value
“Not Available” for all the missing values. For the Logistic Regression model, evaluate performance
improvement with and without including the cabin feature using test data. 10 Points
3) Like linear regression, Logistic regression (LR) has the advantage of interpretability. Research the
concepts of “Unadjusted Odds Ratio” and “Adjusted Odds Ratio”. Determine the adjusted odds ratio
for Sex, Pclass, and Embarked using LR. Interpret the results. 10 Points
4) The default threshold to classify an entity to a class is 0.5. For the LR models, vary the threshold to
0.8, 0.5, and 0.2. Which threshold value do you think is appropriate for survival prediction? Why?
Justify your answer with respect to misclassification rate on test data 10 Points
5) Develop ROC plot for the LDA model. 5 Points
6) What features do you think are important to make the prediction? Why? Evaluate the KNN model
performance by including just the important features 5 Points
In the report, include text of the R code.
Submit through link: eCampus -> Assignment 7
Deadline: March 18, 11:55 PM
Data Dictionary
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.