Scroll Top

Car Price Prediction (InoVision Data Analytics Group)

air freight and investment stock graphs showing on the screen Ge

In the rapidly growing used-car market, accurately determining a car’s true value is crucial for buyers, sellers, and dealerships. Conducted by the Inovision Data Analytics Group, this project utilizes machine learning to predict resale prices of cars based on key features such as fuel type, transmission, production year, and total mileage (kilometers driven).

Project Goals
Develop an intelligent model for predicting the prices of used cars.
Identify influential factors on resale prices (e.g., automatic transmission, number of previous owners, etc.).
Provide a comparative analysis of different machine learning models to find the optimal approach.
Dataset
Number of records: 299
Number of features: 9,
including:
Year of manufacture (Year) Current market price (Present Price)
Mileage (Kms Driven)
Fuel type (Fuel Type)
Seller type (Seller Type)
Transmission type (Transmission)
Number of previous owners (Owner)
Car model (for this project, model names are changed to examples like Toyota Camry, Honda Civic, BMW 3 Series, Mercedes C-Class, etc.)
Technologies used:
Python, Pandas, Seaborn, Scikit-learn, Matplotlib Analysis and Preprocessing
Data Cleaning:
Removing duplicate records and handling any missing entries (this dataset had no missing values).
Categorical Encoding:
Converting fuel type (Petrol, Diesel, CNG), seller type (Dealer, Individual), and transmission type (Manual, Automatic) into numerical values for machine learning.
Feature Impact Analysis:
Exploratory analysis revealed that fuel type, transmission type, and year of manufacture are significant in determining the resale price.
Machine Learning Models
1. Linear Regression
R² on training data: ~0.87
R² on test data: ~0.85
Provides decent performance with a simple structure; may require enhancements for more complex scenarios.
2. Lasso Regression
R² on training data: ~0.84
R² on test data: ~0.79
Slightly lower accuracy than linear regression; however, it is more effective in reducing overfitting.
Key Results and Conclusions Cars with automatic transmission and diesel fuel (e.g., a BMW 3 Series Diesel) often have higher resale values.
The number of previous owners (Owner) negatively affects the selling price.
Linear Regression showed better performance in this project, though more advanced models such as Random Forest or XGBoost could potentially yield higher accuracy.

# InoVision - Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn import metrics

# =======================================================
# Assume that the CSV file includes car models such as
# "Toyota Camry", "Honda Civic", "BMW 3 Series", "Kia Optima", etc.
# =======================================================

# inovision - Loading the dataset (with modified car model names)
car_dataset = pd.read_csv("car_data_modified.csv")

# inovision - Data Cleaning (removing duplicates)
car_dataset.drop_duplicates(inplace=True)

# inovision - Encoding categorical features
car_dataset.replace({'Fuel_Type': {'Petrol':0, 'Diesel':1, 'CNG':2}}, inplace=True)
car_dataset.replace({'Seller_Type': {'Dealer':0, 'Individual':1}}, inplace=True)
car_dataset.replace({'Transmission': {'Manual':0, 'Automatic':1}}, inplace=True)

# inovision - Splitting data into features (X) and target (Y)
X = car_dataset.drop(['Car_Name','Selling_Price'], axis=1)
Y = car_dataset['Selling_Price']

# -- Sample parameter changes --
# For example, change test_size from 0.1 to 0.2 and random_state to 10
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.2, random_state=10
)

# inovision - Linear Regression model
lin_reg_model = LinearRegression()
lin_reg_model.fit(X_train, Y_train)

# inovision - Predict on training data
train_preds_lin = lin_reg_model.predict(X_train)
train_r2_lin = metrics.r2_score(Y_train, train_preds_lin)
print("Linear Regression (Train) R²:", train_r2_lin)

# inovision - Predict on test data
test_preds_lin = lin_reg_model.predict(X_test)
test_r2_lin = metrics.r2_score(Y_test, test_preds_lin)
print("Linear Regression (Test) R²:", test_r2_lin)

# inovision - Visualization (sample)
plt.figure(figsize=(6,4))
plt.scatter(Y_test, test_preds_lin, color='blue')
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Linear Regression: Actual vs. Predicted (inovision)")
plt.show()

# inovision - Lasso Regression model
# (sample change: set alpha to 0.01)
lasso_model = Lasso(alpha=0.01)
lasso_model.fit(X_train, Y_train)

# inovision - Predict on training data (Lasso)
train_preds_lasso = lasso_model.predict(X_train)
train_r2_lasso = metrics.r2_score(Y_train, train_preds_lasso)
print("Lasso Regression (Train) R²:", train_r2_lasso)

# inovision - Predict on test data (Lasso)
test_preds_lasso = lasso_model.predict(X_test)
test_r2_lasso = metrics.r2_score(Y_test, test_preds_lasso)
print("Lasso Regression (Test) R²:", test_r2_lasso)

# inovision - Visualization for Lasso
plt.figure(figsize=(6,4))
plt.scatter(Y_test, test_preds_lasso, color='red')
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices (Lasso)")
plt.title("Lasso Regression: Actual vs. Predicted (inovision)")
plt.show()
Final Notes
This project (by the Inovision Data Analytics Group) demonstrates how Linear Regression and Lasso Regression can effectively predict the resale prices of various car models (e.g., Toyota Camry, Honda Civic, etc.).
Eliminating duplicate entries and encoding categorical values play a significant role in enhancing model accuracy.
Adjusting parameters (such as test_size, random_state, and alpha) shows how fine-tuning can influence results.
For higher accuracy, advanced models like Random Forest or XGBoost and additional feature engineering can be explored.
Cresta Help Chat
Send via WhatsApp