Need Help With Your Assignment? Get expert academic writing assistance! We can write any paper on any subject within the tightest time.
Data Loading
Q1. (10 points) Load the data from the source file and set up the target y and predictors X as expected
by scikit-learn.
Train-Test Split
Q2. (10 points) Create a train-test 80-20 split of the data while maintaining the same target value
proportion in each of the training and testing partition. You should use the training partition for the
subsequent analysis and then finally use the testing at the last step for final validation using the best
model found.
Optional Step: Create some charts for data exploration to gain an understanding of the data with
respect to the given prediction problem. Explain your observations along with each chart.
Data Preprocessing & Feature Selection/Engineering
Q3. (20 points) Set up a data preparation pipeline using scikit-learn to perform the following
preprocessing steps.
A. If there are missing values in the data, take appropriate measures.
B. Select (add/drop) the features as follows:
a. The variable DEP_TIME (actual departure time) cannot be used for predicting new
flights. Why? Briefly explain.
b. Create a new categorical variable by binning the scheduled departure time
(CRS_DEP_TIME) into 2-hour bins.
c. Drop the original variable CRS_DEP_TIME from the data to be analyzed and keep the
new categorical variable.
d. Drop the variables DISTANCE and FL_DATE from the dataset. What would be the
reasons to do so? Think and provide possible explanation.
C. Handle the following categorical variables in the data using the one-hot encoding approach.
How many new variables would you get as a result of this one-hot encoding? Explain.
a. day of week
b. carrier
c. departure airport (origin)
d. arrival airport (destination)
e. scheduled (binned/categorical) departure time
D. Weather is coded as 1 if there was a weather-delay. Would you need to use one-hot encoding
for this variable? Why or why not? Explain and take the appropriate action.
Using the pipeline, create a prepared training dataset to be used for predictive modeling.
MIST.6160: Advanced Data Mining
Copyright 2020 Prof. Amit V. Deokar. All rights reserved. 3
Model Training and Validation
Q5. (10 points) Select any two classification algorithms listed in Q6 below and demonstrate how to find
the “best” hyperparameters for each of these two models with grid search using 5-fold cross-validation
experimenting with 1-2 parameters in each case.
Need Help With Your Assignment? Get expert academic writing assistance! We can write any paper on any subject within the tightest time.