Problem Statement
An automobile company aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know: - Which variables are significant in predicting the price of a car - How well those variables describe the price of a car
Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.
Aim of the project
We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the qualitative and quantitative characteristics of the car. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyverse)
library(highcharter)
library(readxl)
library(DT)
library(tm)
library(RColorBrewer)
library(Boruta)
library(rpart)
library(rattle)
library(caret)
library(scales)
library(bigmemory)
library(naniar)
library(stringr)
library(psych)
library(mlbench)
library(caret)
library(randomForest)
library(car)
Car Price Data The dataset has been taken from Kaggle.
Reading the Car Price Dataset
car_price <- read.csv(file = 'D:\\Drive E\\MSBA UCin\\Course\\Spring Sem\\BANA 6043 - Stat Computing\\Project\\Car Price\\CarPrice_Assignment.csv', header = TRUE, check.names=FALSE)
Glimpse of the Data
glimpse(car_price)
## Rows: 205
## Columns: 26
## $ car_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ symboling <int> 3, 3, 1, 2, 2, 2, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0,~
## $ CarName <chr> "alfa-romero giulia", "alfa-romero stelvio", "alfa-ro~
## $ fueltype <chr> "gas", "gas", "gas", "gas", "gas", "gas", "gas", "gas~
## $ aspiration <chr> "std", "std", "std", "std", "std", "std", "std", "std~
## $ doornumber <chr> "two", "two", "two", "four", "four", "two", "four", "~
## $ carbody <chr> "convertible", "convertible", "hatchback", "sedan", "~
## $ drivewheel <chr> "rwd", "rwd", "rwd", "fwd", "4wd", "fwd", "fwd", "fwd~
## $ enginelocation <chr> "front", "front", "front", "front", "front", "front",~
## $ wheelbase <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 105~
## $ carlength <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192.~
## $ carwidth <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4,~
## $ carheight <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9,~
## $ curbweight <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086,~
## $ enginetype <chr> "dohc", "dohc", "ohcv", "ohc", "ohc", "ohc", "ohc", "~
## $ cylindernumber <chr> "four", "four", "six", "four", "five", "five", "five"~
## $ enginesize <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 131, 108~
## $ fuelsystem <chr> "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi~
## $ boreratio <dbl> 3.47, 3.47, 2.68, 3.19, 3.19, 3.19, 3.19, 3.19, 3.13,~
## $ stroke <dbl> 2.68, 2.68, 3.47, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40,~
## $ compressionratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.30~
## $ horsepower <int> 111, 111, 154, 102, 115, 110, 110, 110, 140, 160, 101~
## $ peakrpm <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500,~
## $ citympg <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21, 2~
## $ highwaympg <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28, 2~
## $ price <dbl> 13495.00, 16500.00, 16500.00, 13950.00, 17450.00, 152~
The dataset was imported into R studio and it was found to have 205 observations and 26 variables.
Checking for Null Values
summary(is.na(car_price))
## car_ID symboling CarName fueltype
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:205 FALSE:205 FALSE:205 FALSE:205
## aspiration doornumber carbody drivewheel
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:205 FALSE:205 FALSE:205 FALSE:205
## enginelocation wheelbase carlength carwidth
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:205 FALSE:205 FALSE:205 FALSE:205
## carheight curbweight enginetype cylindernumber
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:205 FALSE:205 FALSE:205 FALSE:205
## enginesize fuelsystem boreratio stroke
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:205 FALSE:205 FALSE:205 FALSE:205
## compressionratio horsepower peakrpm citympg
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:205 FALSE:205 FALSE:205 FALSE:205
## highwaympg price
## Mode :logical Mode :logical
## FALSE:205 FALSE:205
Analysis shows that there are no missing values in our dataset. Hence, we will retain all the variables in our study.
#Extracting Car Company from Car Name
car_details <- car_price %>% separate(CarName, c("CarCompany", "CarName"))
car_details
#Checking for Unique values
unique(car_details$CarCompany)
Error in Car Company name
maxda = mazda nissan = Nissan porcschce = porsche toyouta = toyota vokswagen = vw = volkswagen
#Correcting the typos
car_details$CarCompany <- gsub("maxda", "mazda", car_details$CarCompany)
car_details$CarCompany <- gsub("nissan", "Nissan", car_details$CarCompany)
car_details$CarCompany <- gsub("porcshce", "porsche", car_details$CarCompany)
car_details$CarCompany <- gsub("toyouta", "toyota", car_details$CarCompany)
car_details$CarCompany <- gsub("vokswagen", "volkswagen", car_details$CarCompany)
car_details$CarCompany <- gsub("vw", "volkswagen", car_details$CarCompany)
Toyota seems to be the company with the most number of models. Mercury seems to be the company with the least number of models.
Jaguar, Buick and porsche seems to have the highest average price. Chevrolet and Dodge have the lowest average price.
Carwidth, carlength, curbweight, enginesize and horsepower seem to have a poitive correlation with price. Carheight doesn’t show any significant trend with price. Citympg and highwaympg seem to have a significant negative correlation with price.
Fuel Type: Diesel cars are comparatively expensive than the cars with fuel type as gas Door number: Cars with four doors are slightly more expensive than cars with two doors Aspiration: Cars with turbo aspiration are more expensive Car body: Hardtop and convertible cars are more expensive that other types of cars Engine location: Cars with rear engine location are way more expensive than cars with front engine location Drive wheel: Cars with RWD are more expensive than 4WD or FWD cars Engine type: Cars with engine type DOHCV or OHCV are expensive than others Cylinder Number: Cars with cylinder count of five or more are expensive than others Fuel System: Cars with MPFI are the most expensive whereas cars with 1BBL or 2BBL fuel system are the cheapest
#Converting the data type of categorical variables from character to factor
df = car_details%>%select(-c(1:4))%>%mutate_if(is.character,as.factor)
var_0 = nearZeroVar(df)
df = df[-var_0]
#One hot encoding the categorical variables
dmy <- dummyVars(" ~ .", data = df)
df_enc <- data.frame(predict(dmy, newdata = df))
head(df_enc,5)
## fueltype.diesel fueltype.gas aspiration.std aspiration.turbo doornumber.four
## 1 0 1 1 0 0
## 2 0 1 1 0 0
## 3 0 1 1 0 0
## 4 0 1 1 0 1
## 5 0 1 1 0 1
## doornumber.two carbody.convertible carbody.hardtop carbody.hatchback
## 1 1 1 0 0
## 2 1 1 0 0
## 3 1 0 0 1
## 4 0 0 0 0
## 5 0 0 0 0
## carbody.sedan carbody.wagon drivewheel.4wd drivewheel.fwd drivewheel.rwd
## 1 0 0 0 0 1
## 2 0 0 0 0 1
## 3 0 0 0 0 1
## 4 1 0 0 1 0
## 5 1 0 1 0 0
## wheelbase carlength carwidth carheight curbweight enginetype.dohc
## 1 88.6 168.8 64.1 48.8 2548 1
## 2 88.6 168.8 64.1 48.8 2548 1
## 3 94.5 171.2 65.5 52.4 2823 0
## 4 99.8 176.6 66.2 54.3 2337 0
## 5 99.4 176.6 66.4 54.3 2824 0
## enginetype.dohcv enginetype.l enginetype.ohc enginetype.ohcf enginetype.ohcv
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 1
## 4 0 0 1 0 0
## 5 0 0 1 0 0
## enginetype.rotor cylindernumber.eight cylindernumber.five cylindernumber.four
## 1 0 0 0 1
## 2 0 0 0 1
## 3 0 0 0 0
## 4 0 0 0 1
## 5 0 0 1 0
## cylindernumber.six cylindernumber.three cylindernumber.twelve
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 0
## 5 0 0 0
## cylindernumber.two enginesize fuelsystem.1bbl fuelsystem.2bbl fuelsystem.4bbl
## 1 0 130 0 0 0
## 2 0 130 0 0 0
## 3 0 152 0 0 0
## 4 0 109 0 0 0
## 5 0 136 0 0 0
## fuelsystem.idi fuelsystem.mfi fuelsystem.mpfi fuelsystem.spdi fuelsystem.spfi
## 1 0 0 1 0 0
## 2 0 0 1 0 0
## 3 0 0 1 0 0
## 4 0 0 1 0 0
## 5 0 0 1 0 0
## boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
## 1 3.47 2.68 9 111 5000 21 27 13495
## 2 3.47 2.68 9 111 5000 21 27 16500
## 3 2.68 3.47 9 154 5000 19 26 16500
## 4 3.19 3.40 10 102 5500 24 30 13950
## 5 3.19 3.40 8 115 5500 18 22 17450
#Using the Recursive Feature Elimination technique to select the most impactful predictors
control <- rfeControl(functions = rfFuncs, # random forest
method = "repeatedcv", # repeated cv
repeats = 5, # number of repeats
number = 10)
result_rfe1 <- rfe(x = df_enc[,-ncol(df_enc)],
y = df_enc$price,
sizes = c(1:20),
rfeControl = control)
# Print the results
result_rfe1
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 2790 0.8629 2030 451.5 0.08119 342.4
## 2 2514 0.8991 1806 485.5 0.04650 333.4
## 3 2254 0.9191 1615 481.0 0.04231 285.9
## 4 2187 0.9274 1516 539.6 0.04009 315.0
## 5 2149 0.9309 1476 595.4 0.03531 326.9
## 6 2101 0.9310 1454 518.2 0.03832 285.5
## 7 2079 0.9326 1428 540.5 0.03750 289.7
## 8 2059 0.9347 1415 538.9 0.03547 280.3
## 9 2050 0.9350 1421 522.4 0.03563 270.2
## 10 2044 0.9357 1412 521.5 0.03426 271.6
## 11 2028 0.9371 1402 543.2 0.03392 284.3
## 12 2022 0.9373 1398 537.3 0.03396 291.7
## 13 2012 0.9381 1393 535.1 0.03391 290.8
## 14 1995 0.9394 1380 546.4 0.03300 300.3
## 15 1992 0.9389 1383 538.3 0.03339 302.5
## 16 2000 0.9387 1388 523.5 0.03384 287.8
## 17 1990 0.9395 1381 520.9 0.03283 289.2
## 18 1990 0.9395 1387 521.8 0.03270 299.2
## 19 1990 0.9393 1386 519.9 0.03390 294.2
## 20 1973 0.9403 1377 519.4 0.03221 293.1 *
## 49 1976 0.9402 1391 504.6 0.03205 282.9
##
## The top 5 variables (out of 20):
## enginesize, curbweight, horsepower, wheelbase, carlength
# Print the selected features
predictors(result_rfe1)
## [1] "enginesize" "curbweight" "horsepower"
## [4] "wheelbase" "carlength" "carwidth"
## [7] "highwaympg" "citympg" "peakrpm"
## [10] "fuelsystem.mpfi" "stroke" "cylindernumber.four"
## [13] "carheight" "boreratio" "compressionratio"
## [16] "drivewheel.rwd" "carbody.hatchback" "fuelsystem.2bbl"
## [19] "aspiration.std" "drivewheel.fwd"
#Plotting the top 20 predictors identified using RFE technique
varimp_data <- data.frame(feature = row.names(varImp(result_rfe1))[1:20],
importance = varImp(result_rfe1)[1:20, 1])
varimp_data
## feature importance
## 1 enginesize 20.754652
## 2 curbweight 15.958237
## 3 horsepower 12.903263
## 4 wheelbase 11.409220
## 5 carlength 10.724714
## 6 carwidth 10.654571
## 7 highwaympg 10.563011
## 8 citympg 9.753462
## 9 peakrpm 7.676091
## 10 fuelsystem.mpfi 6.574579
## 11 stroke 6.483638
## 12 cylindernumber.four 6.139136
## 13 carheight 6.011663
## 14 boreratio 5.729216
## 15 compressionratio 5.568195
## 16 drivewheel.rwd 4.949179
## 17 carbody.hatchback 4.780230
## 18 carbody.convertible 4.633970
## 19 fuelsystem.2bbl 4.537069
## 20 carbody.sedan 4.501732
ggplot(data = varimp_data,
aes(x = reorder(feature, -importance), y = importance, fill = feature)) +
geom_bar(stat="identity") + labs(x = "Features", y = "Variable Importance") +
geom_text(aes(label = round(importance, 2)), vjust=1.6, color="white", size=4) +
theme_bw() + theme(legend.position = "none") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
We now have the top 20 predictors identified through the recursive feature elimination technique that explain most of the variance in the response variable.
Model 1
Fitting a liner model using all the top 20 predictors identified through the RFE algorithm.
#Fitting a linear model using all the top 20 predictors obtained from RFE
fit1 = lm(price ~
enginesize+
curbweight+
horsepower+
wheelbase +
carlength +
highwaympg+
carwidth +
citympg +
peakrpm +
stroke +
fuelsystem.mpfi +
cylindernumber.four +
carheight +
boreratio +
compressionratio +
drivewheel.rwd +
drivewheel.fwd +
aspiration.turbo +
fuelsystem.2bbl +
enginetype.ohc, data = df_enc)
summary(fit1)
##
## Call:
## lm(formula = price ~ enginesize + curbweight + horsepower + wheelbase +
## carlength + highwaympg + carwidth + citympg + peakrpm + stroke +
## fuelsystem.mpfi + cylindernumber.four + carheight + boreratio +
## compressionratio + drivewheel.rwd + drivewheel.fwd + aspiration.turbo +
## fuelsystem.2bbl + enginetype.ohc, data = df_enc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7686.5 -1119.8 -93.5 1140.8 11527.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.934e+04 1.478e+04 -3.338 0.001023 **
## enginesize 1.234e+02 1.626e+01 7.587 1.58e-12 ***
## curbweight 6.589e-01 1.829e+00 0.360 0.719028
## horsepower -1.479e+00 1.854e+01 -0.080 0.936477
## wheelbase 2.467e+01 9.828e+01 0.251 0.802101
## carlength -3.829e+01 5.453e+01 -0.702 0.483383
## highwaympg 2.496e+02 1.609e+02 1.552 0.122454
## carwidth 4.652e+02 2.404e+02 1.935 0.054536 .
## citympg -2.557e+02 1.716e+02 -1.490 0.137845
## peakrpm 2.667e+00 6.727e-01 3.964 0.000105 ***
## stroke -4.169e+03 8.875e+02 -4.697 5.14e-06 ***
## fuelsystem.mpfi 3.398e+00 8.326e+02 0.004 0.996748
## cylindernumber.four -4.440e+03 8.549e+02 -5.194 5.43e-07 ***
## carheight 2.151e+02 1.347e+02 1.597 0.112041
## boreratio 1.713e+03 1.324e+03 1.294 0.197333
## compressionratio 1.386e+02 9.735e+01 1.424 0.156095
## drivewheel.rwd 1.616e+03 1.222e+03 1.323 0.187515
## drivewheel.fwd -5.349e+02 1.226e+03 -0.436 0.663166
## aspiration.turbo 1.986e+03 8.675e+02 2.289 0.023197 *
## fuelsystem.2bbl -3.172e+01 8.435e+02 -0.038 0.970045
## enginetype.ohc 2.518e+03 6.458e+02 3.898 0.000136 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2878 on 184 degrees of freedom
## Multiple R-squared: 0.8829, Adjusted R-squared: 0.8702
## F-statistic: 69.4 on 20 and 184 DF, p-value: < 2.2e-16
summary(fit1)$sigma
## [1] 2877.935
From the above result we can see that Model 1 (fit1) has a high adjusted R squared value of 87.02% with the model p-value at less than 0.05 establishing that the model parameters are statistically significant.
#Checking the Variance Inflation Factor (VIF) to check for multicollinearity among the predictors
sort(vif(fit1),decreasing = TRUE)
## citympg highwaympg curbweight horsepower
## 31.029971 30.230279 22.333410 13.233624
## enginesize carlength drivewheel.fwd wheelbase
## 11.295916 11.146046 9.032550 8.626071
## drivewheel.rwd carwidth fuelsystem.mpfi fuelsystem.2bbl
## 8.616762 6.551885 4.260030 3.843818
## compressionratio boreratio cylindernumber.four aspiration.turbo
## 3.682896 3.167896 3.148117 2.754874
## carheight peakrpm enginetype.ohc stroke
## 2.668219 2.535464 2.072398 1.907721
From the VIF above we can see that there are many variables with greater than 10 VIF which means the model is multicollinear in nature. Based on the correlation plot from the EDA section 4.3, we can see that some variables are highly correlated with each other so, we will now remove such redundant variables.
Model 2
Fitting a second liner model after eliminating predictors from Model 1 with high correlation and multicollinearity.
#Fitting a new linear model by eliminating predictors with high correlation and multicollinearity
fit2 = lm(price ~
enginesize+
highwaympg+
carwidth +
peakrpm +
stroke +
fuelsystem.mpfi +
cylindernumber.four +
carheight +
boreratio +
compressionratio +
drivewheel.rwd +
drivewheel.fwd +
aspiration.turbo +
fuelsystem.2bbl +
enginetype.ohc, data = df_enc)
summary(fit2)
##
## Call:
## lm(formula = price ~ enginesize + highwaympg + carwidth + peakrpm +
## stroke + fuelsystem.mpfi + cylindernumber.four + carheight +
## boreratio + compressionratio + drivewheel.rwd + drivewheel.fwd +
## aspiration.turbo + fuelsystem.2bbl + enginetype.ohc, data = df_enc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7580.8 -1088.3 38.8 1191.2 11367.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50812.509 13301.621 -3.820 0.000181 ***
## enginesize 123.084 9.846 12.501 < 2e-16 ***
## highwaympg 38.761 66.434 0.583 0.560291
## carwidth 455.386 176.955 2.573 0.010835 *
## peakrpm 2.730 0.573 4.764 3.77e-06 ***
## stroke -4131.680 859.041 -4.810 3.08e-06 ***
## fuelsystem.mpfi 64.719 775.197 0.083 0.933553
## cylindernumber.four -4699.975 792.792 -5.928 1.43e-08 ***
## carheight 178.890 99.831 1.792 0.074745 .
## boreratio 2107.794 1227.038 1.718 0.087473 .
## compressionratio 113.568 90.262 1.258 0.209872
## drivewheel.rwd 1690.436 1132.113 1.493 0.137061
## drivewheel.fwd -497.899 1127.398 -0.442 0.659258
## aspiration.turbo 2007.499 672.315 2.986 0.003201 **
## fuelsystem.2bbl -115.751 823.391 -0.141 0.888352
## enginetype.ohc 2625.564 609.949 4.305 2.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2862 on 189 degrees of freedom
## Multiple R-squared: 0.8811, Adjusted R-squared: 0.8716
## F-statistic: 93.35 on 15 and 189 DF, p-value: < 2.2e-16
summary(fit2)$sigma
## [1] 2862.18
From the above result we can see that Model 2 (fit2) has a high adjusted R squared value of 87.16% which is slightly higher than Model 1 inspite of having lesser number of predictors. Also the model p-value is less than 0.05 establishing that the model parameters are statistically significant.
#Checking VIF values for the new model
sort(vif(fit2),decreasing = TRUE)
## drivewheel.fwd drivewheel.rwd highwaympg enginesize
## 7.719823 7.482301 5.212094 4.186127
## fuelsystem.mpfi fuelsystem.2bbl carwidth compressionratio
## 3.733600 3.703595 3.588381 3.200929
## boreratio cylindernumber.four enginetype.ohc peakrpm
## 2.750362 2.737323 1.868860 1.860507
## stroke aspiration.turbo carheight
## 1.807206 1.673051 1.481834
Now we can see that all the VIF values are less than 10, so we can say that now the model is not multicollinear.
Diagnostics for Model 2
par(mfrow=c(2,2))
plot(fit2,which = 1:4)
We can see from the residuals vs fitted plot that the there is heteroskedasticity in the error terms as the variance is not constant. Also from the Normal QQ plot the error terms seem to be somewhat normal but the tails are a bit skewed. This could be because of the presence of some outliers, which we can see from the Cook’s distance plot. The points at the position of 17,50 and 129 are outliers and have very high influence thereby skewing the results.
Residual Histogram for Model 2
res_hat = fit2$residuals
res_hist <- ggplot(data.frame(res_hat), aes(res_hat)) + geom_histogram()
res_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Checking if the Model 2 satisfies the assumption of zero mean error
t.test(res_hat)
##
## One Sample t-test
##
## data: res_hat
## t = -6.3548e-16, df = 204, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -379.3744 379.3744
## sample estimates:
## mean of x
## -1.222751e-13
As p-value is greater than 0.05, we cannot reject the assumption that mean of residual is equal to zero. Now to improve the model, we will transform the data for better fit.
Box Cox Power Transformation
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
bc <- boxcox(price ~ enginesize+
highwaympg+
carwidth +
peakrpm +
stroke +
fuelsystem.mpfi +
cylindernumber.four +
carheight +
boreratio +
compressionratio +
drivewheel.rwd +
drivewheel.fwd +
aspiration.turbo +
fuelsystem.2bbl +
enginetype.ohc,data=df_enc)
lambda = bc$x[which.max(bc$y)]
lambda
## [1] -0.1414141
As seen from the Box Cox Power transformation value of lambda above which is very close to 0, we will now use the log function to transform the response variable to better fit the data.
Log Transformation of Response Variable and modelling
Model 3
#Transforming the response variable using log transformation and running a linear model
df_transform = df_enc%>%mutate(ln_price = log(price))
fit3 = lm(ln_price ~
enginesize+
highwaympg+
carwidth +
peakrpm +
stroke +
fuelsystem.mpfi +
cylindernumber.four +
carheight +
boreratio +
compressionratio +
drivewheel.rwd +
drivewheel.fwd +
aspiration.turbo +
fuelsystem.2bbl +
enginetype.ohc, data = df_transform)
summary(fit3)
##
## Call:
## lm(formula = ln_price ~ enginesize + highwaympg + carwidth +
## peakrpm + stroke + fuelsystem.mpfi + cylindernumber.four +
## carheight + boreratio + compressionratio + drivewheel.rwd +
## drivewheel.fwd + aspiration.turbo + fuelsystem.2bbl + enginetype.ohc,
## data = df_transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47978 -0.09447 -0.01227 0.11872 0.45642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.685e+00 7.742e-01 7.343 6.06e-12 ***
## enginesize 4.325e-03 5.730e-04 7.548 1.82e-12 ***
## highwaympg -9.535e-03 3.867e-03 -2.466 0.014555 *
## carwidth 3.319e-02 1.030e-02 3.222 0.001497 **
## peakrpm 1.037e-04 3.335e-05 3.110 0.002160 **
## stroke -1.722e-01 5.000e-02 -3.444 0.000705 ***
## fuelsystem.mpfi 1.061e-01 4.512e-02 2.352 0.019698 *
## cylindernumber.four -2.695e-01 4.614e-02 -5.841 2.23e-08 ***
## carheight 8.255e-03 5.810e-03 1.421 0.157017
## boreratio 2.072e-01 7.142e-02 2.901 0.004156 **
## compressionratio 1.356e-02 5.253e-03 2.581 0.010608 *
## drivewheel.rwd 1.196e-01 6.589e-02 1.815 0.071137 .
## drivewheel.fwd -2.435e-02 6.562e-02 -0.371 0.710965
## aspiration.turbo 1.301e-01 3.913e-02 3.324 0.001065 **
## fuelsystem.2bbl -6.807e-02 4.792e-02 -1.421 0.157110
## enginetype.ohc 1.576e-01 3.550e-02 4.440 1.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1666 on 189 degrees of freedom
## Multiple R-squared: 0.8987, Adjusted R-squared: 0.8907
## F-statistic: 111.8 on 15 and 189 DF, p-value: < 2.2e-16
summary(fit3)$sigma
## [1] 0.1665837
As seen from the results above the adjusted R squared value of Model 3 is 89.07% which is higher than Model 1 & 2. The overall model p-value is also less than 0.05 this suggesting that the model parameters are statistically significant. The RMSE of the model is 0.167 which is also very low.
Residual diagnostics for Model 3 (fit3)
par(mfrow=c(2,2))
plot(fit3,which = 1:4)
From the above plot of residual vs fitted values we can see that residuals have near constant variance across the range of fitted values and can be said to have homoskedasticity. The Normal QQ plot also shows that error terms are normally distributed.
res_hat = fit3$residuals
res_hist <- ggplot(data.frame(res_hat), aes(res_hat)) + geom_histogram()
res_hist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From residual histogram above we can verify that residuals are fairly normally distributed.
#Checking assumption of zero mean error for the transformed model
t.test(res_hat)
##
## One Sample t-test
##
## data: res_hat
## t = -3.3183e-16, df = 204, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.02208023 0.02208023
## sample estimates:
## mean of x
## -3.71607e-18
As p-value is greater than 0.05, we cannot reject the assumption that mean of residual is equal to zero.
We have successfully built a linear regression model to predict the price of the car using its qualitative and quantitative features. Based on our analysis we conclude that our log transformed model (Model 3: fit3) is the best fitting model that explains the variability in the price of the car. We checked the model residuals, R squared value, p-values and RMSE to determine the performance of this model. The predictors that were used for car price prediction along with their beta or coefficient values are as listed below -
fit3$coefficients
## (Intercept) enginesize highwaympg carwidth
## 5.6845656973 0.0043252506 -0.0095349933 0.0331882984
## peakrpm stroke fuelsystem.mpfi cylindernumber.four
## 0.0001037302 -0.1722059792 0.1061207294 -0.2695336516
## carheight boreratio compressionratio drivewheel.rwd
## 0.0082554381 0.2072041061 0.0135593813 0.1195799635
## drivewheel.fwd aspiration.turbo fuelsystem.2bbl enginetype.ohc
## -0.0243514601 0.1300767849 -0.0680742920 0.1576154110
In our project we have used the Recursive Feature Elimination technique for selection and identification of predictors that better explain the variability in the response variable than others. It is a backward selection algorithm which is placed the ‘caret’ library in R. We also used the ‘mlbench’ and ‘randomForest’ libraries to implement the algorithm.