Car Price Prediction

1. Introduction

Problem Statement

An automobile company aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know: - Which variables are significant in predicting the price of a car - How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large data set of different types of cars across the America market.

Aim of the project

We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the qualitative and quantitative characteristics of the car. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

2. Packages Required

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyverse)
library(highcharter)
library(readxl)
library(DT)
library(tm)
library(RColorBrewer)
library(Boruta)
library(rpart)
library(rattle)
library(caret)
library(scales)
library(bigmemory)
library(naniar)
library(stringr)
library(psych)
library(mlbench)
library(caret)
library(randomForest)
library(car)

3. Data Preparation

3.1 Reading the Data

Car Price Data The dataset has been taken from Kaggle.

Reading the Car Price Dataset

car_price <- read.csv(file = 'D:\\Drive E\\MSBA UCin\\Course\\Spring Sem\\BANA 6043 - Stat Computing\\Project\\Car Price\\CarPrice_Assignment.csv', header = TRUE, check.names=FALSE)

Glimpse of the Data

glimpse(car_price)

## Rows: 205
## Columns: 26
## $ car_ID           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16~
## $ symboling        <int> 3, 3, 1, 2, 2, 2, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0,~
## $ CarName          <chr> "alfa-romero giulia", "alfa-romero stelvio", "alfa-ro~
## $ fueltype         <chr> "gas", "gas", "gas", "gas", "gas", "gas", "gas", "gas~
## $ aspiration       <chr> "std", "std", "std", "std", "std", "std", "std", "std~
## $ doornumber       <chr> "two", "two", "two", "four", "four", "two", "four", "~
## $ carbody          <chr> "convertible", "convertible", "hatchback", "sedan", "~
## $ drivewheel       <chr> "rwd", "rwd", "rwd", "fwd", "4wd", "fwd", "fwd", "fwd~
## $ enginelocation   <chr> "front", "front", "front", "front", "front", "front",~
## $ wheelbase        <dbl> 88.6, 88.6, 94.5, 99.8, 99.4, 99.8, 105.8, 105.8, 105~
## $ carlength        <dbl> 168.8, 168.8, 171.2, 176.6, 176.6, 177.3, 192.7, 192.~
## $ carwidth         <dbl> 64.1, 64.1, 65.5, 66.2, 66.4, 66.3, 71.4, 71.4, 71.4,~
## $ carheight        <dbl> 48.8, 48.8, 52.4, 54.3, 54.3, 53.1, 55.7, 55.7, 55.9,~
## $ curbweight       <int> 2548, 2548, 2823, 2337, 2824, 2507, 2844, 2954, 3086,~
## $ enginetype       <chr> "dohc", "dohc", "ohcv", "ohc", "ohc", "ohc", "ohc", "~
## $ cylindernumber   <chr> "four", "four", "six", "four", "five", "five", "five"~
## $ enginesize       <int> 130, 130, 152, 109, 136, 136, 136, 136, 131, 131, 108~
## $ fuelsystem       <chr> "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi", "mpfi~
## $ boreratio        <dbl> 3.47, 3.47, 2.68, 3.19, 3.19, 3.19, 3.19, 3.19, 3.13,~
## $ stroke           <dbl> 2.68, 2.68, 3.47, 3.40, 3.40, 3.40, 3.40, 3.40, 3.40,~
## $ compressionratio <dbl> 9.00, 9.00, 9.00, 10.00, 8.00, 8.50, 8.50, 8.50, 8.30~
## $ horsepower       <int> 111, 111, 154, 102, 115, 110, 110, 110, 140, 160, 101~
## $ peakrpm          <int> 5000, 5000, 5000, 5500, 5500, 5500, 5500, 5500, 5500,~
## $ citympg          <int> 21, 21, 19, 24, 18, 19, 19, 19, 17, 16, 23, 23, 21, 2~
## $ highwaympg       <int> 27, 27, 26, 30, 22, 25, 25, 25, 20, 22, 29, 29, 28, 2~
## $ price            <dbl> 13495.00, 16500.00, 16500.00, 13950.00, 17450.00, 152~

The dataset was imported into R studio and it was found to have 205 observations and 26 variables.

3.2 Data Cleaning

Checking for Null Values

summary(is.na(car_price))

##    car_ID        symboling        CarName         fueltype      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:205       FALSE:205       FALSE:205       FALSE:205      
##  aspiration      doornumber       carbody        drivewheel     
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:205       FALSE:205       FALSE:205       FALSE:205      
##  enginelocation  wheelbase       carlength        carwidth      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:205       FALSE:205       FALSE:205       FALSE:205      
##  carheight       curbweight      enginetype      cylindernumber 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:205       FALSE:205       FALSE:205       FALSE:205      
##  enginesize      fuelsystem      boreratio         stroke       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:205       FALSE:205       FALSE:205       FALSE:205      
##  compressionratio horsepower       peakrpm         citympg       
##  Mode :logical    Mode :logical   Mode :logical   Mode :logical  
##  FALSE:205        FALSE:205       FALSE:205       FALSE:205      
##  highwaympg        price        
##  Mode :logical   Mode :logical  
##  FALSE:205       FALSE:205

Analysis shows that there are no missing values in our dataset. Hence, we will retain all the variables in our study.

4. Exploratory Data Analysis and Visualization

4.1 Number of car models by company

#Extracting Car Company from Car Name
car_details <- car_price %>% separate(CarName, c("CarCompany", "CarName"))
car_details

#Checking for Unique values
unique(car_details$CarCompany)

Error in Car Company name

maxda = mazda nissan = Nissan porcschce = porsche toyouta = toyota vokswagen = vw = volkswagen

#Correcting the typos
car_details$CarCompany <- gsub("maxda", "mazda", car_details$CarCompany)
car_details$CarCompany <- gsub("nissan", "Nissan", car_details$CarCompany)
car_details$CarCompany <- gsub("porcshce", "porsche", car_details$CarCompany)
car_details$CarCompany <- gsub("toyouta", "toyota", car_details$CarCompany)
car_details$CarCompany <- gsub("vokswagen", "volkswagen", car_details$CarCompany)
car_details$CarCompany <- gsub("vw", "volkswagen", car_details$CarCompany)

Toyota seems to be the company with the most number of models. Mercury seems to be the company with the least number of models.

4.2 Mean Price of Cars by company

Jaguar, Buick and porsche seems to have the highest average price. Chevrolet and Dodge have the lowest average price.

4.3 Correlation between numeric variables

Carwidth, carlength, curbweight, enginesize and horsepower seem to have a poitive correlation with price. Carheight doesn’t show any significant trend with price. Citympg and highwaympg seem to have a significant negative correlation with price.

4.4 Univariate analysis for categorical variables using BoxPlot

Fuel Type: Diesel cars are comparatively expensive than the cars with fuel type as gas Door number: Cars with four doors are slightly more expensive than cars with two doors Aspiration: Cars with turbo aspiration are more expensive Car body: Hardtop and convertible cars are more expensive that other types of cars Engine location: Cars with rear engine location are way more expensive than cars with front engine location Drive wheel: Cars with RWD are more expensive than 4WD or FWD cars Engine type: Cars with engine type DOHCV or OHCV are expensive than others Cylinder Number: Cars with cylinder count of five or more are expensive than others Fuel System: Cars with MPFI are the most expensive whereas cars with 1BBL or 2BBL fuel system are the cheapest

5. Data Modelling

5.1 One hot encoding of the categorical variables

#Converting the data type of categorical variables from character to factor
df = car_details%>%select(-c(1:4))%>%mutate_if(is.character,as.factor)
var_0 = nearZeroVar(df)
df = df[-var_0]

#One hot encoding the categorical variables 
dmy <- dummyVars(" ~ .", data = df)
df_enc <- data.frame(predict(dmy, newdata = df))
head(df_enc,5)

##   fueltype.diesel fueltype.gas aspiration.std aspiration.turbo doornumber.four
## 1               0            1              1                0               0
## 2               0            1              1                0               0
## 3               0            1              1                0               0
## 4               0            1              1                0               1
## 5               0            1              1                0               1
##   doornumber.two carbody.convertible carbody.hardtop carbody.hatchback
## 1              1                   1               0                 0
## 2              1                   1               0                 0
## 3              1                   0               0                 1
## 4              0                   0               0                 0
## 5              0                   0               0                 0
##   carbody.sedan carbody.wagon drivewheel.4wd drivewheel.fwd drivewheel.rwd
## 1             0             0              0              0              1
## 2             0             0              0              0              1
## 3             0             0              0              0              1
## 4             1             0              0              1              0
## 5             1             0              1              0              0
##   wheelbase carlength carwidth carheight curbweight enginetype.dohc
## 1      88.6     168.8     64.1      48.8       2548               1
## 2      88.6     168.8     64.1      48.8       2548               1
## 3      94.5     171.2     65.5      52.4       2823               0
## 4      99.8     176.6     66.2      54.3       2337               0
## 5      99.4     176.6     66.4      54.3       2824               0
##   enginetype.dohcv enginetype.l enginetype.ohc enginetype.ohcf enginetype.ohcv
## 1                0            0              0               0               0
## 2                0            0              0               0               0
## 3                0            0              0               0               1
## 4                0            0              1               0               0
## 5                0            0              1               0               0
##   enginetype.rotor cylindernumber.eight cylindernumber.five cylindernumber.four
## 1                0                    0                   0                   1
## 2                0                    0                   0                   1
## 3                0                    0                   0                   0
## 4                0                    0                   0                   1
## 5                0                    0                   1                   0
##   cylindernumber.six cylindernumber.three cylindernumber.twelve
## 1                  0                    0                     0
## 2                  0                    0                     0
## 3                  1                    0                     0
## 4                  0                    0                     0
## 5                  0                    0                     0
##   cylindernumber.two enginesize fuelsystem.1bbl fuelsystem.2bbl fuelsystem.4bbl
## 1                  0        130               0               0               0
## 2                  0        130               0               0               0
## 3                  0        152               0               0               0
## 4                  0        109               0               0               0
## 5                  0        136               0               0               0
##   fuelsystem.idi fuelsystem.mfi fuelsystem.mpfi fuelsystem.spdi fuelsystem.spfi
## 1              0              0               1               0               0
## 2              0              0               1               0               0
## 3              0              0               1               0               0
## 4              0              0               1               0               0
## 5              0              0               1               0               0
##   boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
## 1      3.47   2.68                9        111    5000      21         27 13495
## 2      3.47   2.68                9        111    5000      21         27 16500
## 3      2.68   3.47                9        154    5000      19         26 16500
## 4      3.19   3.40               10        102    5500      24         30 13950
## 5      3.19   3.40                8        115    5500      18         22 17450

5.2 Recursive Feature Elimination

#Using the Recursive Feature Elimination technique to select the most impactful predictors
control <- rfeControl(functions = rfFuncs, # random forest
                      method = "repeatedcv", # repeated cv
                      repeats = 5, # number of repeats
                      number = 10)

result_rfe1 <- rfe(x = df_enc[,-ncol(df_enc)], 
                   y = df_enc$price, 
                   sizes = c(1:20),
                   rfeControl = control)
 
# Print the results
result_rfe1

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables RMSE Rsquared  MAE RMSESD RsquaredSD MAESD Selected
##          1 2790   0.8629 2030  451.5    0.08119 342.4         
##          2 2514   0.8991 1806  485.5    0.04650 333.4         
##          3 2254   0.9191 1615  481.0    0.04231 285.9         
##          4 2187   0.9274 1516  539.6    0.04009 315.0         
##          5 2149   0.9309 1476  595.4    0.03531 326.9         
##          6 2101   0.9310 1454  518.2    0.03832 285.5         
##          7 2079   0.9326 1428  540.5    0.03750 289.7         
##          8 2059   0.9347 1415  538.9    0.03547 280.3         
##          9 2050   0.9350 1421  522.4    0.03563 270.2         
##         10 2044   0.9357 1412  521.5    0.03426 271.6         
##         11 2028   0.9371 1402  543.2    0.03392 284.3         
##         12 2022   0.9373 1398  537.3    0.03396 291.7         
##         13 2012   0.9381 1393  535.1    0.03391 290.8         
##         14 1995   0.9394 1380  546.4    0.03300 300.3         
##         15 1992   0.9389 1383  538.3    0.03339 302.5         
##         16 2000   0.9387 1388  523.5    0.03384 287.8         
##         17 1990   0.9395 1381  520.9    0.03283 289.2         
##         18 1990   0.9395 1387  521.8    0.03270 299.2         
##         19 1990   0.9393 1386  519.9    0.03390 294.2         
##         20 1973   0.9403 1377  519.4    0.03221 293.1        *
##         49 1976   0.9402 1391  504.6    0.03205 282.9         
## 
## The top 5 variables (out of 20):
##    enginesize, curbweight, horsepower, wheelbase, carlength

# Print the selected features
predictors(result_rfe1)

##  [1] "enginesize"          "curbweight"          "horsepower"         
##  [4] "wheelbase"           "carlength"           "carwidth"           
##  [7] "highwaympg"          "citympg"             "peakrpm"            
## [10] "fuelsystem.mpfi"     "stroke"              "cylindernumber.four"
## [13] "carheight"           "boreratio"           "compressionratio"   
## [16] "drivewheel.rwd"      "carbody.hatchback"   "fuelsystem.2bbl"    
## [19] "aspiration.std"      "drivewheel.fwd"

#Plotting the top 20 predictors identified using RFE technique
varimp_data <- data.frame(feature = row.names(varImp(result_rfe1))[1:20],
                          importance = varImp(result_rfe1)[1:20, 1])

varimp_data

##                feature importance
## 1           enginesize  20.754652
## 2           curbweight  15.958237
## 3           horsepower  12.903263
## 4            wheelbase  11.409220
## 5            carlength  10.724714
## 6             carwidth  10.654571
## 7           highwaympg  10.563011
## 8              citympg   9.753462
## 9              peakrpm   7.676091
## 10     fuelsystem.mpfi   6.574579
## 11              stroke   6.483638
## 12 cylindernumber.four   6.139136
## 13           carheight   6.011663
## 14           boreratio   5.729216
## 15    compressionratio   5.568195
## 16      drivewheel.rwd   4.949179
## 17   carbody.hatchback   4.780230
## 18 carbody.convertible   4.633970
## 19     fuelsystem.2bbl   4.537069
## 20       carbody.sedan   4.501732

ggplot(data = varimp_data, 
       aes(x = reorder(feature, -importance), y = importance, fill = feature)) +
  geom_bar(stat="identity") + labs(x = "Features", y = "Variable Importance") + 
  geom_text(aes(label = round(importance, 2)), vjust=1.6, color="white", size=4) + 
  theme_bw() + theme(legend.position = "none") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

We now have the top 20 predictors identified through the recursive feature elimination technique that explain most of the variance in the response variable.

5.3 Multiple Linear Regression Model

Model 1

Fitting a liner model using all the top 20 predictors identified through the RFE algorithm.

#Fitting a linear model using all the top 20 predictors obtained from RFE
fit1 = lm(price ~ 
enginesize+         
curbweight+             
horsepower+             
wheelbase   +           
carlength   +           
highwaympg+             
carwidth    +           
citympg     +       
peakrpm     +       
stroke      +       
fuelsystem.mpfi +               
cylindernumber.four +           
carheight   +           
boreratio   +           
compressionratio    +       
drivewheel.rwd      +       
drivewheel.fwd      +       
aspiration.turbo    +           
fuelsystem.2bbl     +       
enginetype.ohc, data = df_enc)

summary(fit1)

## 
## Call:
## lm(formula = price ~ enginesize + curbweight + horsepower + wheelbase + 
##     carlength + highwaympg + carwidth + citympg + peakrpm + stroke + 
##     fuelsystem.mpfi + cylindernumber.four + carheight + boreratio + 
##     compressionratio + drivewheel.rwd + drivewheel.fwd + aspiration.turbo + 
##     fuelsystem.2bbl + enginetype.ohc, data = df_enc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7686.5 -1119.8   -93.5  1140.8 11527.4 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.934e+04  1.478e+04  -3.338 0.001023 ** 
## enginesize           1.234e+02  1.626e+01   7.587 1.58e-12 ***
## curbweight           6.589e-01  1.829e+00   0.360 0.719028    
## horsepower          -1.479e+00  1.854e+01  -0.080 0.936477    
## wheelbase            2.467e+01  9.828e+01   0.251 0.802101    
## carlength           -3.829e+01  5.453e+01  -0.702 0.483383    
## highwaympg           2.496e+02  1.609e+02   1.552 0.122454    
## carwidth             4.652e+02  2.404e+02   1.935 0.054536 .  
## citympg             -2.557e+02  1.716e+02  -1.490 0.137845    
## peakrpm              2.667e+00  6.727e-01   3.964 0.000105 ***
## stroke              -4.169e+03  8.875e+02  -4.697 5.14e-06 ***
## fuelsystem.mpfi      3.398e+00  8.326e+02   0.004 0.996748    
## cylindernumber.four -4.440e+03  8.549e+02  -5.194 5.43e-07 ***
## carheight            2.151e+02  1.347e+02   1.597 0.112041    
## boreratio            1.713e+03  1.324e+03   1.294 0.197333    
## compressionratio     1.386e+02  9.735e+01   1.424 0.156095    
## drivewheel.rwd       1.616e+03  1.222e+03   1.323 0.187515    
## drivewheel.fwd      -5.349e+02  1.226e+03  -0.436 0.663166    
## aspiration.turbo     1.986e+03  8.675e+02   2.289 0.023197 *  
## fuelsystem.2bbl     -3.172e+01  8.435e+02  -0.038 0.970045    
## enginetype.ohc       2.518e+03  6.458e+02   3.898 0.000136 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2878 on 184 degrees of freedom
## Multiple R-squared:  0.8829, Adjusted R-squared:  0.8702 
## F-statistic:  69.4 on 20 and 184 DF,  p-value: < 2.2e-16

summary(fit1)$sigma

## [1] 2877.935

From the above result we can see that Model 1 (fit1) has a high adjusted R squared value of 87.02% with the model p-value at less than 0.05 establishing that the model parameters are statistically significant.

#Checking the Variance Inflation Factor (VIF) to check for multicollinearity among the predictors
sort(vif(fit1),decreasing = TRUE)

##             citympg          highwaympg          curbweight          horsepower 
##           31.029971           30.230279           22.333410           13.233624 
##          enginesize           carlength      drivewheel.fwd           wheelbase 
##           11.295916           11.146046            9.032550            8.626071 
##      drivewheel.rwd            carwidth     fuelsystem.mpfi     fuelsystem.2bbl 
##            8.616762            6.551885            4.260030            3.843818 
##    compressionratio           boreratio cylindernumber.four    aspiration.turbo 
##            3.682896            3.167896            3.148117            2.754874 
##           carheight             peakrpm      enginetype.ohc              stroke 
##            2.668219            2.535464            2.072398            1.907721

From the VIF above we can see that there are many variables with greater than 10 VIF which means the model is multicollinear in nature. Based on the correlation plot from the EDA section 4.3, we can see that some variables are highly correlated with each other so, we will now remove such redundant variables.

Model 2

Fitting a second liner model after eliminating predictors from Model 1 with high correlation and multicollinearity.

#Fitting a new linear model by eliminating predictors with high correlation and multicollinearity
fit2 = lm(price ~ 
enginesize+         
highwaympg+             
carwidth    +           
peakrpm     +       
stroke      +       
fuelsystem.mpfi +               
cylindernumber.four +           
carheight   +           
boreratio   +           
compressionratio    +       
drivewheel.rwd      +       
drivewheel.fwd      +       
aspiration.turbo    +           
fuelsystem.2bbl     +       
enginetype.ohc, data = df_enc)

summary(fit2)

## 
## Call:
## lm(formula = price ~ enginesize + highwaympg + carwidth + peakrpm + 
##     stroke + fuelsystem.mpfi + cylindernumber.four + carheight + 
##     boreratio + compressionratio + drivewheel.rwd + drivewheel.fwd + 
##     aspiration.turbo + fuelsystem.2bbl + enginetype.ohc, data = df_enc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7580.8 -1088.3    38.8  1191.2 11367.0 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -50812.509  13301.621  -3.820 0.000181 ***
## enginesize             123.084      9.846  12.501  < 2e-16 ***
## highwaympg              38.761     66.434   0.583 0.560291    
## carwidth               455.386    176.955   2.573 0.010835 *  
## peakrpm                  2.730      0.573   4.764 3.77e-06 ***
## stroke               -4131.680    859.041  -4.810 3.08e-06 ***
## fuelsystem.mpfi         64.719    775.197   0.083 0.933553    
## cylindernumber.four  -4699.975    792.792  -5.928 1.43e-08 ***
## carheight              178.890     99.831   1.792 0.074745 .  
## boreratio             2107.794   1227.038   1.718 0.087473 .  
## compressionratio       113.568     90.262   1.258 0.209872    
## drivewheel.rwd        1690.436   1132.113   1.493 0.137061    
## drivewheel.fwd        -497.899   1127.398  -0.442 0.659258    
## aspiration.turbo      2007.499    672.315   2.986 0.003201 ** 
## fuelsystem.2bbl       -115.751    823.391  -0.141 0.888352    
## enginetype.ohc        2625.564    609.949   4.305 2.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2862 on 189 degrees of freedom
## Multiple R-squared:  0.8811, Adjusted R-squared:  0.8716 
## F-statistic: 93.35 on 15 and 189 DF,  p-value: < 2.2e-16

summary(fit2)$sigma

## [1] 2862.18

From the above result we can see that Model 2 (fit2) has a high adjusted R squared value of 87.16% which is slightly higher than Model 1 inspite of having lesser number of predictors. Also the model p-value is less than 0.05 establishing that the model parameters are statistically significant.

#Checking VIF values for the new model
sort(vif(fit2),decreasing = TRUE)

##      drivewheel.fwd      drivewheel.rwd          highwaympg          enginesize 
##            7.719823            7.482301            5.212094            4.186127 
##     fuelsystem.mpfi     fuelsystem.2bbl            carwidth    compressionratio 
##            3.733600            3.703595            3.588381            3.200929 
##           boreratio cylindernumber.four      enginetype.ohc             peakrpm 
##            2.750362            2.737323            1.868860            1.860507 
##              stroke    aspiration.turbo           carheight 
##            1.807206            1.673051            1.481834

Now we can see that all the VIF values are less than 10, so we can say that now the model is not multicollinear.

5.4 Residual Diagnotstics

Diagnostics for Model 2

par(mfrow=c(2,2))
plot(fit2,which = 1:4)

We can see from the residuals vs fitted plot that the there is heteroskedasticity in the error terms as the variance is not constant. Also from the Normal QQ plot the error terms seem to be somewhat normal but the tails are a bit skewed. This could be because of the presence of some outliers, which we can see from the Cook’s distance plot. The points at the position of 17,50 and 129 are outliers and have very high influence thereby skewing the results.

Residual Histogram for Model 2

res_hat = fit2$residuals
res_hist <- ggplot(data.frame(res_hat), aes(res_hat)) + geom_histogram()
res_hist

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Checking if the Model 2 satisfies the assumption of zero mean error

t.test(res_hat)

## 
##  One Sample t-test
## 
## data:  res_hat
## t = -6.3548e-16, df = 204, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -379.3744  379.3744
## sample estimates:
##     mean of x 
## -1.222751e-13

As p-value is greater than 0.05, we cannot reject the assumption that mean of residual is equal to zero. Now to improve the model, we will transform the data for better fit.

5.5 Data Transformation

Box Cox Power Transformation

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

bc <- boxcox(price ~ enginesize+            
highwaympg+             
carwidth    +           
peakrpm     +       
stroke      +       
fuelsystem.mpfi +               
cylindernumber.four +           
carheight   +           
boreratio   +           
compressionratio    +       
drivewheel.rwd      +       
drivewheel.fwd      +       
aspiration.turbo    +           
fuelsystem.2bbl     +       
enginetype.ohc,data=df_enc)

lambda = bc$x[which.max(bc$y)]
lambda

## [1] -0.1414141

As seen from the Box Cox Power transformation value of lambda above which is very close to 0, we will now use the log function to transform the response variable to better fit the data.

Log Transformation of Response Variable and modelling

Model 3

#Transforming the response variable using log transformation and running a linear model
df_transform = df_enc%>%mutate(ln_price = log(price))
fit3 = lm(ln_price ~ 
enginesize+         
highwaympg+             
carwidth    +           
peakrpm     +       
stroke      +       
fuelsystem.mpfi +               
cylindernumber.four +           
carheight   +           
boreratio   +           
compressionratio    +       
drivewheel.rwd      +       
drivewheel.fwd      +       
aspiration.turbo    +           
fuelsystem.2bbl     +       
enginetype.ohc, data = df_transform)

summary(fit3)

## 
## Call:
## lm(formula = ln_price ~ enginesize + highwaympg + carwidth + 
##     peakrpm + stroke + fuelsystem.mpfi + cylindernumber.four + 
##     carheight + boreratio + compressionratio + drivewheel.rwd + 
##     drivewheel.fwd + aspiration.turbo + fuelsystem.2bbl + enginetype.ohc, 
##     data = df_transform)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47978 -0.09447 -0.01227  0.11872  0.45642 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.685e+00  7.742e-01   7.343 6.06e-12 ***
## enginesize           4.325e-03  5.730e-04   7.548 1.82e-12 ***
## highwaympg          -9.535e-03  3.867e-03  -2.466 0.014555 *  
## carwidth             3.319e-02  1.030e-02   3.222 0.001497 ** 
## peakrpm              1.037e-04  3.335e-05   3.110 0.002160 ** 
## stroke              -1.722e-01  5.000e-02  -3.444 0.000705 ***
## fuelsystem.mpfi      1.061e-01  4.512e-02   2.352 0.019698 *  
## cylindernumber.four -2.695e-01  4.614e-02  -5.841 2.23e-08 ***
## carheight            8.255e-03  5.810e-03   1.421 0.157017    
## boreratio            2.072e-01  7.142e-02   2.901 0.004156 ** 
## compressionratio     1.356e-02  5.253e-03   2.581 0.010608 *  
## drivewheel.rwd       1.196e-01  6.589e-02   1.815 0.071137 .  
## drivewheel.fwd      -2.435e-02  6.562e-02  -0.371 0.710965    
## aspiration.turbo     1.301e-01  3.913e-02   3.324 0.001065 ** 
## fuelsystem.2bbl     -6.807e-02  4.792e-02  -1.421 0.157110    
## enginetype.ohc       1.576e-01  3.550e-02   4.440 1.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1666 on 189 degrees of freedom
## Multiple R-squared:  0.8987, Adjusted R-squared:  0.8907 
## F-statistic: 111.8 on 15 and 189 DF,  p-value: < 2.2e-16

summary(fit3)$sigma

## [1] 0.1665837

As seen from the results above the adjusted R squared value of Model 3 is 89.07% which is higher than Model 1 & 2. The overall model p-value is also less than 0.05 this suggesting that the model parameters are statistically significant. The RMSE of the model is 0.167 which is also very low.

Residual diagnostics for Model 3 (fit3)

par(mfrow=c(2,2))
plot(fit3,which = 1:4)

From the above plot of residual vs fitted values we can see that residuals have near constant variance across the range of fitted values and can be said to have homoskedasticity. The Normal QQ plot also shows that error terms are normally distributed.

res_hat = fit3$residuals
res_hist <- ggplot(data.frame(res_hat), aes(res_hat)) + geom_histogram()
res_hist

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From residual histogram above we can verify that residuals are fairly normally distributed.

#Checking assumption of zero mean error for the transformed model
t.test(res_hat)

## 
##  One Sample t-test
## 
## data:  res_hat
## t = -3.3183e-16, df = 204, p-value = 1
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.02208023  0.02208023
## sample estimates:
##    mean of x 
## -3.71607e-18

As p-value is greater than 0.05, we cannot reject the assumption that mean of residual is equal to zero.

6. Conclusion

We have successfully built a linear regression model to predict the price of the car using its qualitative and quantitative features. Based on our analysis we conclude that our log transformed model (Model 3: fit3) is the best fitting model that explains the variability in the price of the car. We checked the model residuals, R squared value, p-values and RMSE to determine the performance of this model. The predictors that were used for car price prediction along with their beta or coefficient values are as listed below -

fit3$coefficients

##         (Intercept)          enginesize          highwaympg            carwidth 
##        5.6845656973        0.0043252506       -0.0095349933        0.0331882984 
##             peakrpm              stroke     fuelsystem.mpfi cylindernumber.four 
##        0.0001037302       -0.1722059792        0.1061207294       -0.2695336516 
##           carheight           boreratio    compressionratio      drivewheel.rwd 
##        0.0082554381        0.2072041061        0.0135593813        0.1195799635 
##      drivewheel.fwd    aspiration.turbo     fuelsystem.2bbl      enginetype.ohc 
##       -0.0243514601        0.1300767849       -0.0680742920        0.1576154110

7. Appendix

In our project we have used the Recursive Feature Elimination technique for selection and identification of predictors that better explain the variability in the response variable than others. It is a backward selection algorithm which is placed the ‘caret’ library in R. We also used the ‘mlbench’ and ‘randomForest’ libraries to implement the algorithm.