Project Title : Sales Prediction : Predicting sales of a major store chain Rossmann¶

Problem Description ¶

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.¶

You are provided with historical sales data for 1,115 Rossmann stores. The task is to forecast the "Sales" column for the test set. Note that some stores in the dataset were temporarily closed for refurbishment.¶

Data fields¶

Most of the fields are self-explanatory. The following are descriptions for those that aren't.¶

  • Id - an Id that represents a (Store, Date) duple within the test set¶

  • Store - a unique Id for each store¶

  • Sales - the turnover for any given day (this is what you are predicting)¶

  • Customers - the number of customers on a given day¶

  • Open - an indicator for whether the store was open: 0 = closed, 1 = open¶

  • StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None¶

  • SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public schools¶

  • StoreType - differentiates between 4 different store models: a, b, c, d¶

  • Assortment - describes an assortment level: a = basic, b = extra, c = extended¶

  • CompetitionDistance - distance in meters to the nearest competitor store¶

  • CompetitionOpenSince[Month/Year] - gives the approximate year and month of the time the nearest competitor was opened¶

  • Promo - indicates whether a store is running a promo on that day¶

  • Promo2 - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating¶

  • Promo2Since[Year/Week] - describes the year and calendar week when the store started participating in Promo2¶

  • PromoInterval - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store¶

Loading Libraries & Data¶

In [41]:
!pip3 install -r requirements.txt
Requirement already satisfied: dbrepo==1.7.3 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 1)) (1.7.3)
Requirement already satisfied: joblib==1.4.2 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 2)) (1.4.2)
Requirement already satisfied: matplotlib==3.10.1 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 3)) (3.10.1)
Requirement already satisfied: numpy==2.2.5 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 4)) (2.2.5)
Requirement already satisfied: pandas==2.2.3 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 5)) (2.2.3)
Requirement already satisfied: python-dotenv==1.1.0 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 6)) (1.1.0)
Requirement already satisfied: scikit_learn==1.6.1 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 7)) (1.6.1)
Requirement already satisfied: seaborn==0.13.2 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 8)) (0.13.2)
Requirement already satisfied: statsmodels==0.14.4 in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 9)) (0.14.4)
Requirement already satisfied: requests>=2.31 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from dbrepo==1.7.3->-r requirements.txt (line 1)) (2.32.3)
Requirement already satisfied: pika in /usr/local/lib/python3.11/site-packages (from dbrepo==1.7.3->-r requirements.txt (line 1)) (1.3.2)
Requirement already satisfied: pydantic in /usr/local/lib/python3.11/site-packages (from dbrepo==1.7.3->-r requirements.txt (line 1)) (2.11.3)
Requirement already satisfied: tuspy in /usr/local/lib/python3.11/site-packages (from dbrepo==1.7.3->-r requirements.txt (line 1)) (1.1.0)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (24.1)
Requirement already satisfied: pillow>=8 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (3.2.3)
Requirement already satisfied: python-dateutil>=2.7 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from matplotlib==3.10.1->-r requirements.txt (line 3)) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from pandas==2.2.3->-r requirements.txt (line 5)) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from pandas==2.2.3->-r requirements.txt (line 5)) (2023.3)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/site-packages (from scikit_learn==1.6.1->-r requirements.txt (line 7)) (1.13.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/site-packages (from scikit_learn==1.6.1->-r requirements.txt (line 7)) (3.6.0)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.11/site-packages (from statsmodels==0.14.4->-r requirements.txt (line 9)) (1.0.1)
Requirement already satisfied: six>=1.5 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from python-dateutil>=2.7->matplotlib==3.10.1->-r requirements.txt (line 3)) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from requests>=2.31->dbrepo==1.7.3->-r requirements.txt (line 1)) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from requests>=2.31->dbrepo==1.7.3->-r requirements.txt (line 1)) (3.5)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from requests>=2.31->dbrepo==1.7.3->-r requirements.txt (line 1)) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from requests>=2.31->dbrepo==1.7.3->-r requirements.txt (line 1)) (2023.11.17)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/site-packages (from pydantic->dbrepo==1.7.3->-r requirements.txt (line 1)) (0.7.0)
Requirement already satisfied: pydantic-core==2.33.1 in /usr/local/lib/python3.11/site-packages (from pydantic->dbrepo==1.7.3->-r requirements.txt (line 1)) (2.33.1)
Requirement already satisfied: typing-extensions>=4.12.2 in /usr/local/lib/python3.11/site-packages (from pydantic->dbrepo==1.7.3->-r requirements.txt (line 1)) (4.13.0)
Requirement already satisfied: typing-inspection>=0.4.0 in /usr/local/lib/python3.11/site-packages (from pydantic->dbrepo==1.7.3->-r requirements.txt (line 1)) (0.4.0)
Requirement already satisfied: tinydb>=3.5.0 in /usr/local/lib/python3.11/site-packages (from tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (4.8.2)
Requirement already satisfied: aiohttp>=3.6.2 in /usr/local/lib/python3.11/site-packages (from tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (3.11.16)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/site-packages (from aiohttp>=3.6.2->tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (2.6.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/site-packages (from aiohttp>=3.6.2->tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (1.3.2)
Requirement already satisfied: attrs>=17.3.0 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from aiohttp>=3.6.2->tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (23.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/site-packages (from aiohttp>=3.6.2->tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (1.6.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/site-packages (from aiohttp>=3.6.2->tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (6.4.3)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/site-packages (from aiohttp>=3.6.2->tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (0.3.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/site-packages (from aiohttp>=3.6.2->tuspy->dbrepo==1.7.3->-r requirements.txt (line 1)) (1.20.0)

[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python3.11 -m pip install --upgrade pip
In [42]:
%pip install python-dotenv
Requirement already satisfied: python-dotenv in /usr/local/lib/python3.11/site-packages (1.1.0)

[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python3.11 -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.
In [1]:
from dbrepo.RestClient import RestClient
import os
from dotenv import load_dotenv, dotenv_values 
load_dotenv() 

client = RestClient(endpoint="https://test.dbrepo.tuwien.ac.at", username=os.getenv("DBREPO_USERNAME"), password=os.getenv("DBREPO_PASSWORD"))
store = client.get_identifier_data(identifier_id="9627ec46-4ee6-4969-b14a-bda555fe34db") 
store

#store https://test.dbrepo.tuwien.ac.at/pid/9627ec46-4ee6-4969-b14a-bda555fe34db https://handle.test.datacite.org/10.82556/nqeg-gy34
#train https://test.dbrepo.tuwien.ac.at/pid/b1c59499-9c6e-42c2-af8f-840181e809db https://handle.test.datacite.org/10.82556/yb6j-jw41
#test https://test.dbrepo.tuwien.ac.at/pid/7cbb845c-21dd-4b60-b990-afa8754a0dd9 https://handle.test.datacite.org/10.82556/jerg-4b84
Out[1]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
storestoretypeassortmentcompetitiondistancecompetitionopensincemonthcompetitionopensinceyearpromo2promo2sinceweekpromo2sinceyearpromointerval
01ca127092008falseNoneNoneNone
12aa570112007true132010Jan,Apr,Jul,Oct
23aa14130122006true142011Jan,Apr,Jul,Oct
34cc62092009falseNoneNoneNone
45aa2991042015falseNoneNoneNone
.................................
11101111aa190062014true312013Jan,Apr,Jul,Oct
11111112cc188042006falseNoneNoneNone
11121113ac9260NoneNonefalseNoneNoneNone
11131114ac870NoneNonefalseNoneNoneNone
11141115dc5350NoneNonetrue222012Mar,Jun,Sept,Dec

1115 rows × 10 columns

In [7]:
import pandas as pd
data_chunks = []

page = 0
size = 10000
while True:
    chunk = client.get_table_data(
        database_id="18021ccb-88bd-41af-98db-835cb7dc7354",
        table_id="d81e3014-4ad0-4ea5-91df-8b1e90e87ff7",
        page=page,
        size=size
    )
    
    if chunk.empty:
        break
    data_chunks.append(chunk)
    page += 1

store_train = pd.concat(data_chunks, ignore_index=True)
store_train.head()
Out[7]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweekdatesalescustomersopenpromostateholidayschoolholiday
00152015-07-315263555truetrue01
11252015-07-316064625truetrue01
22352015-07-318314821truetrue01
33452015-07-31139951498truetrue01
44552015-07-314822559truetrue01
In [8]:
store_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 10 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   id             1017209 non-null  object
 1   store          1017209 non-null  object
 2   dayofweek      1017209 non-null  object
 3   date           1017209 non-null  object
 4   sales          1017209 non-null  object
 5   customers      1017209 non-null  object
 6   open           1017209 non-null  object
 7   promo          1017209 non-null  object
 8   stateholiday   1017209 non-null  object
 9   schoolholiday  1017209 non-null  object
dtypes: object(10)
memory usage: 77.6+ MB
In [9]:
store_train['open'] = store_train['open'].map({'true': 1, 'false': 0})
store_train['promo'] = store_train['promo'].map({'true': 1, 'false': 0})

dtype_mapping = {
    'id': 'int64',
    'store': 'int64',
    'dayofweek': 'int64',
    'date': 'datetime64[ns]',
    'sales': 'int64',
    'customers': 'int64',
    'open': 'bool',
    'promo': 'bool',
    'schoolholiday': 'int64'
}

for column, dtype in dtype_mapping.items():
    if dtype == 'bool':
        store_train[column] = store_train[column].astype('int').astype('bool')
    else:
        store_train[column] = pd.to_numeric(store_train[column], errors='coerce') if 'int' in dtype else pd.to_datetime(store_train[column], errors='coerce')
In [10]:
store_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 10 columns):
 #   Column         Non-Null Count    Dtype         
---  ------         --------------    -----         
 0   id             1017209 non-null  int64         
 1   store          1017209 non-null  int64         
 2   dayofweek      1017209 non-null  int64         
 3   date           1017209 non-null  datetime64[ns]
 4   sales          1017209 non-null  int64         
 5   customers      1017209 non-null  int64         
 6   open           1017209 non-null  bool          
 7   promo          1017209 non-null  bool          
 8   stateholiday   1017209 non-null  object        
 9   schoolholiday  1017209 non-null  int64         
dtypes: bool(2), datetime64[ns](1), int64(6), object(1)
memory usage: 64.0+ MB
In [11]:
data_chunks = []

page = 0
size = 10000 
while True:
    chunk = client.get_table_data(
        database_id="18021ccb-88bd-41af-98db-835cb7dc7354",
        table_id="17e50f8e-b94b-407d-97fc-abd9b9cb422e",
        page=page,
        size=size
    )
    
    if chunk.empty:
        break 
    data_chunks.append(chunk)
    page += 1 

store_test = pd.concat(data_chunks, ignore_index=True)
store_test.head()
Out[11]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweekdateopenpromostateholidayschoolholiday
01142015-09-171.0trued0
12342015-09-171.0trued0
23742015-09-171.0trued0
34842015-09-171.0trued0
45942015-09-171.0trued0
In [12]:
store_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41088 entries, 0 to 41087
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             41088 non-null  object
 1   store          41088 non-null  object
 2   dayofweek      41088 non-null  object
 3   date           41088 non-null  object
 4   open           41077 non-null  object
 5   promo          41088 non-null  object
 6   stateholiday   41088 non-null  object
 7   schoolholiday  41088 non-null  object
dtypes: object(8)
memory usage: 2.5+ MB
In [13]:
store_test['open'] = pd.to_numeric(store_test['open'], errors='coerce')
store_test['open'] = store_test['open'].fillna(0).astype(int)
store_test['promo'] = store_test['promo'].map({'true': 1, 'false': 0}).fillna(0).astype(int)

dtype_mapping = {
    'id': 'int64',
    'store': 'int64',
    'dayofweek': 'int64',
    'date': 'datetime64[ns]',
    'schoolholiday': 'int64'
}

for column, dtype in dtype_mapping.items():
    if 'int' in dtype:
        store_test[column] = pd.to_numeric(store_test[column], errors='coerce')
    else:
        store_test[column] = pd.to_datetime(store_test[column], errors='coerce')
In [14]:
store_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41088 entries, 0 to 41087
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   id             41088 non-null  int64         
 1   store          41088 non-null  int64         
 2   dayofweek      41088 non-null  int64         
 3   date           41088 non-null  datetime64[ns]
 4   open           41088 non-null  int64         
 5   promo          41088 non-null  int64         
 6   stateholiday   41088 non-null  object        
 7   schoolholiday  41088 non-null  int64         
dtypes: datetime64[ns](1), int64(6), object(1)
memory usage: 2.5+ MB
In [15]:
#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import matplotlib
import matplotlib.pylab as pylab

%matplotlib inline
matplotlib.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 8,6

%pip install statsmodels

import math
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LassoLars
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import ElasticNet
Requirement already satisfied: statsmodels in /usr/local/lib/python3.11/site-packages (0.14.4)
Requirement already satisfied: numpy<3,>=1.22.3 in /usr/local/lib/python3.11/site-packages (from statsmodels) (2.2.5)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in /usr/local/lib/python3.11/site-packages (from statsmodels) (1.13.1)
Requirement already satisfied: pandas!=2.1.0,>=1.4 in /usr/local/lib/python3.11/site-packages (from statsmodels) (2.2.3)
Requirement already satisfied: patsy>=0.5.6 in /usr/local/lib/python3.11/site-packages (from statsmodels) (1.0.1)
Requirement already satisfied: packaging>=21.3 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from statsmodels) (24.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.7 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from pandas!=2.1.0,>=1.4->statsmodels) (2023.3)
Requirement already satisfied: six>=1.5 in /Users/Dilara/Library/Python/3.11/lib/python/site-packages (from python-dateutil>=2.8.2->pandas!=2.1.0,>=1.4->statsmodels) (1.16.0)

[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python3.11 -m pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

Analysing the Rossman Dataset¶

In [16]:
store_train.head()
Out[16]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweekdatesalescustomersopenpromostateholidayschoolholiday
00152015-07-315263555TrueTrue01
11252015-07-316064625TrueTrue01
22352015-07-318314821TrueTrue01
33452015-07-31139951498TrueTrue01
44552015-07-314822559TrueTrue01
In [17]:
store_train.tail()
Out[17]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweekdatesalescustomersopenpromostateholidayschoolholiday
10172041017204111122013-01-0100FalseFalse11
10172051017205111222013-01-0100FalseFalse11
10172061017206111322013-01-0100FalseFalse11
10172071017207111422013-01-0100FalseFalse11
10172081017208111522013-01-0100FalseFalse11
Checking Information about Dataset¶
In [18]:
store_train.shape
Out[18]:
(1017209, 10)
In [19]:
#Checking Null Values
store_train.isnull().sum()
Out[19]:
id               0
store            0
dayofweek        0
date             0
sales            0
customers        0
open             0
promo            0
stateholiday     0
schoolholiday    0
dtype: int64

Summary Statastics Of Dataset

In [20]:
#Summary Statastics
store_train.describe()
Out[20]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweekdatesalescustomersschoolholiday
count1.017209e+061.017209e+061.017209e+0610172091.017209e+061.017209e+061.017209e+06
mean5.086040e+055.584297e+023.998341e+002014-04-11 01:30:42.8460618245.773819e+036.331459e+021.786467e-01
min0.000000e+001.000000e+001.000000e+002013-01-01 00:00:000.000000e+000.000000e+000.000000e+00
25%2.543020e+052.800000e+022.000000e+002013-08-17 00:00:003.727000e+034.050000e+020.000000e+00
50%5.086040e+055.580000e+024.000000e+002014-04-02 00:00:005.744000e+036.090000e+020.000000e+00
75%7.629060e+058.380000e+026.000000e+002014-12-12 00:00:007.856000e+038.370000e+020.000000e+00
max1.017208e+061.115000e+037.000000e+002015-07-31 00:00:004.155100e+047.388000e+031.000000e+00
std2.936431e+053.219087e+021.997391e+00NaN3.849926e+034.644117e+023.830564e-01
In [21]:
#No. Of Stores in the Dataset
store_train.store.nunique()
Out[21]:
1115
In [22]:
# Value_counts of StateHoliday Column
store_train['stateholiday'].value_counts()
Out[22]:
stateholiday
0    986159
1     31050
Name: count, dtype: int64
In [23]:
print(store_train.date.min(),'initial')
print(store_train.date.max(),'final')
2013-01-01 00:00:00 initial
2015-07-31 00:00:00 final

This tells us we have a data of almost 3 years.

In [24]:
# extract year, month, day and week of year from "Date"

store_train['date']=pd.to_datetime(store_train['date'])
store_train['year'] = store_train['date'].apply(lambda x: x.year)
store_train['month'] = store_train['date'].apply(lambda x: x.month)
store_train['day'] = store_train['date'].apply(lambda x: x.day)
store_train['weekofyear'] = store_train['date'].apply(lambda x: x.weekofyear)
In [25]:
store_train.sort_values(by=['date','store'],inplace=True,ascending=[False,True])
store_train
Out[25]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweekdatesalescustomersopenpromostateholidayschoolholidayyearmonthdayweekofyear
00152015-07-315263555TrueTrue01201573131
11252015-07-316064625TrueTrue01201573131
22352015-07-318314821TrueTrue01201573131
33452015-07-31139951498TrueTrue01201573131
44552015-07-314822559TrueTrue01201573131
.............................................
10172041017204111122013-01-0100FalseFalse112013111
10172051017205111222013-01-0100FalseFalse112013111
10172061017206111322013-01-0100FalseFalse112013111
10172071017207111422013-01-0100FalseFalse112013111
10172081017208111522013-01-0100FalseFalse112013111

1017209 rows × 14 columns

EDA On Rossman Dataset¶

Heatmap of the Rossman Dataset¶

In [26]:
store_train['open'] = store_train['open'].astype(int)
store_train['promo'] = store_train['promo'].astype(int)
store_train = store_train.drop(['date'], axis=1)
In [29]:
numeric_features = store_train.select_dtypes(include=[np.number])
correlation_map = numeric_features.corr()
mask = np.tril(np.ones_like(correlation_map, dtype=bool))
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_map, mask=mask, vmax=1.0, vmin=-1.0,
            square=True, annot=True, fmt='.2f', linewidths=0.5, cbar_kws={"shrink": .5})
plt.title('Correlation Heatmap', fontsize=16)
plt.show()
No description has been provided for this image

As we can see that in the graph given below that Stores mainly closed on Sunday

In [30]:
sns.countplot(x='dayofweek',hue='open',data=store_train)
2025-04-28 10:00:08,096 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-04-28 10:00:08,928 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
Out[30]:
<Axes: xlabel='dayofweek', ylabel='count'>
No description has been provided for this image

Sales Are nearly doubled High When Promo is Running

In [31]:
#Impact of promo on sales
store_train['sales'] = pd.to_numeric(store_train['sales'], errors='coerce')

Promo_sales = pd.DataFrame(store_train.groupby('promo').agg({'sales':'mean'}))

sns.barplot(x=Promo_sales.index, y=Promo_sales['sales'])
plt.xlabel('Promo')
plt.ylabel('Average Sales')
plt.title('Average Sales by Promo')
plt.show()
2025-04-28 10:00:12,484 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-04-28 10:00:12,491 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
No description has been provided for this image

As We can see that In the month of November and Specially in December Sales is increasing Rapidly every year on the christmas eve.

In [32]:
sns.catplot(x="month", y="sales", data=store_train, kind="point", aspect=2, height=10)

plt.xlabel('Month')
plt.ylabel('Sales')
plt.title('Sales by Month')
plt.show()
2025-04-28 10:00:16,103 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-04-28 10:00:16,838 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
No description has been provided for this image
In [33]:
# Value Counts of SchoolHoliday Column
store_train.schoolholiday.value_counts()
Out[33]:
schoolholiday
0    835488
1    181721
Name: count, dtype: int64

As we can see in the Piechart Sales affected by School Holiday is 18% and Mainly Sales aren't afffected by School Holiday

In [34]:
labels = 'Not-Affected' , 'Affected'
sizes = store_train.schoolholiday.value_counts()
colors = ['gold', 'silver']
explode = (0.1, 0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.title("Sales Affected by Schoolholiday or Not ?",fontsize=20)
plt.plot()
fig=plt.gcf()
fig.set_size_inches(6,6)
plt.show()
No description has been provided for this image

Transforming Variable StateHoliday¶

As we can see in the Piechart Sales affected by State Holiday is only 3% means Sales aren't afffected by State Holiday

In [35]:
sizes = store_train['stateholiday'].value_counts()
labels = sizes.index
colors = sns.color_palette('pastel')[0:len(labels)]
explode = [0.1] + [0 for _ in range(len(labels)-1)]

fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, colors=colors,
       autopct='%1.1f%%', shadow=True, startangle=180)
ax.axis('equal')
plt.title("State Holiday Types", fontsize=20)
fig.set_size_inches(6,6)
plt.show()
No description has been provided for this image

As Sales isn't much affected by State Holiday so i'm removing this column

In [36]:
store_train.drop('stateholiday',inplace=True,axis=1)

Histogram Representation of Sales. Here 0 is showing because most of the time store was closed.

In [37]:
#distribution of sales
fig, ax = plt.subplots()
fig.set_size_inches(11, 7)
sns.distplot(store_train['sales'], kde = False,bins=40);
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/1536171744.py:4: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(store_train['sales'], kde = False,bins=40);
No description has been provided for this image

Sales vs Customers

In [38]:
store_train['sales'] = pd.to_numeric(store_train['sales'], errors='coerce')
store_train['customers'] = pd.to_numeric(store_train['customers'], errors='coerce')

sns.lmplot(x='sales', y='customers', data=store_train, palette='seismic', height=5, aspect=1, line_kws={'color':'blue'})

plt.title('Linear Relation Between Sales and Customers', fontsize=15)
plt.show()
No description has been provided for this image

Analysing the Store Dataset¶

In [39]:
store.head(5)
Out[39]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
storestoretypeassortmentcompetitiondistancecompetitionopensincemonthcompetitionopensinceyearpromo2promo2sinceweekpromo2sinceyearpromointerval
01ca127092008falseNoneNoneNone
12aa570112007true132010Jan,Apr,Jul,Oct
23aa14130122006true142011Jan,Apr,Jul,Oct
34cc62092009falseNoneNoneNone
45aa2991042015falseNoneNoneNone
In [40]:
store.tail()
Out[40]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
storestoretypeassortmentcompetitiondistancecompetitionopensincemonthcompetitionopensinceyearpromo2promo2sinceweekpromo2sinceyearpromointerval
11101111aa190062014true312013Jan,Apr,Jul,Oct
11111112cc188042006falseNoneNoneNone
11121113ac9260NoneNonefalseNoneNoneNone
11131114ac870NoneNonefalseNoneNoneNone
11141115dc5350NoneNonetrue222012Mar,Jun,Sept,Dec
Checking Information about Dataset¶
In [41]:
store.shape
Out[41]:
(1115, 10)
In [42]:
#Checking info of data as data types and rows and cols
store.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1115 entries, 0 to 1114
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   store                      1115 non-null   object
 1   storetype                  1115 non-null   object
 2   assortment                 1115 non-null   object
 3   competitiondistance        1112 non-null   object
 4   competitionopensincemonth  761 non-null    object
 5   competitionopensinceyear   761 non-null    object
 6   promo2                     1115 non-null   object
 7   promo2sinceweek            571 non-null    object
 8   promo2sinceyear            571 non-null    object
 9   promointerval              571 non-null    object
dtypes: object(10)
memory usage: 87.2+ KB
In [43]:
#Checking Null Values
store.isnull().sum()
Out[43]:
store                          0
storetype                      0
assortment                     0
competitiondistance            3
competitionopensincemonth    354
competitionopensinceyear     354
promo2                         0
promo2sinceweek              544
promo2sinceyear              544
promointerval                544
dtype: int64

Heatmap for null values

In [44]:
# creating heatmap for null values
plt.figure(figsize=(10,6))
sns.heatmap(store.isnull(),yticklabels= False, cbar= False, cmap= 'gnuplot')
Out[44]:
<Axes: >
No description has been provided for this image

Distribution Of Different Store Types

In [45]:
labels = 'a' , 'b' , 'c' , 'd'
sizes = store.storetype.value_counts()
colors = ['orange', 'green' , 'red' , 'pink']
explode = (0.1, 0.0 , 0.15 , 0.0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=180)
plt.axis('equal')
plt.title("Distribution of different StoreTypes")
plt.plot()
fig=plt.gcf()
fig.set_size_inches(6,6)
plt.show()
No description has been provided for this image

Remove features with high percentages of missing values¶

we can see that some features have a high percentage of missing values and they won't be accurate as indicators, so we will remove features with more than 30% missing values.¶

In [46]:
# remove features
store = store.drop(['competitionopensincemonth', 'competitionopensinceyear','promo2sinceweek',
                     'promo2sinceyear', 'promointerval'], axis=1)

Replace missing values in features with low percentages of missing values¶

In [47]:
# CompetitionDistance is distance in meters to the nearest competitor store
# let's first have a look at its distribution

sns.distplot(store.competitiondistance.dropna())
plt.title("Distributin of Store Competition Distance")
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/1933368162.py:4: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(store.competitiondistance.dropna())
Out[47]:
Text(0.5, 1.0, 'Distributin of Store Competition Distance')
No description has been provided for this image

The distribution is right skewed, so we'll replace missing values with the median.¶

In [48]:
# replace missing values in CompetitionDistance with median for the store dataset
store['competitiondistance'] = pd.to_numeric(store['competitiondistance'], errors='coerce')
store.competitiondistance.fillna(store.competitiondistance.median(), inplace=True)
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/2473167054.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  store.competitiondistance.fillna(store.competitiondistance.median(), inplace=True)

Pairplot for Store Dataset

Checking stores with their assortment type

In [49]:
#checking stores with their assortment type 
sns.set_style("whitegrid")
fig, ax = plt.subplots()
fig.set_size_inches(11, 7)
store_type=sns.countplot(x='storetype',hue='assortment', data=store,palette="inferno")

for p in store_type.patches:
    store_type.annotate(f'\n{p.get_height()}', (p.get_x()+0.15, p.get_height()),ha='center', va='top', color='white', size=10)
No description has been provided for this image

We can see that there is not such significant differences in these 3 years in terms of sales.

In [50]:
#plotting year vs sales
sns.catplot(x='year',y='sales',data=store_train, height=4, aspect=4 );
2025-04-28 10:02:44,710 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-04-28 10:02:45,503 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
No description has been provided for this image

Merging Two Datasets¶

In [51]:
store_train['store'] = store_train['store'].astype(int)
store['store'] = store['store'].astype(int)

train = pd.merge(store_train, store, on='store', how='left')
test = pd.merge(store_test, store, on='store', how='left')
train
Out[51]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweeksalescustomersopenpromoschoolholidayyearmonthdayweekofyearstoretypeassortmentcompetitiondistancepromo2
00155263555111201573131ca1270.0false
11256064625111201573131aa570.0true
22358314821111201573131aa14130.0true
3345139951498111201573131cc620.0false
44554822559111201573131aa29910.0false
...................................................
1017204101720411112000012013111aa1900.0true
1017205101720511122000012013111cc1880.0false
1017206101720611132000012013111ac9260.0false
1017207101720711142000012013111ac870.0false
1017208101720811152000012013111dc5350.0true

1017209 rows × 16 columns

In [52]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1017209 entries, 0 to 1017208
Data columns (total 16 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   id                   1017209 non-null  int64  
 1   store                1017209 non-null  int64  
 2   dayofweek            1017209 non-null  int64  
 3   sales                1017209 non-null  int64  
 4   customers            1017209 non-null  int64  
 5   open                 1017209 non-null  int64  
 6   promo                1017209 non-null  int64  
 7   schoolholiday        1017209 non-null  int64  
 8   year                 1017209 non-null  int64  
 9   month                1017209 non-null  int64  
 10  day                  1017209 non-null  int64  
 11  weekofyear           1017209 non-null  int64  
 12  storetype            1017209 non-null  object 
 13  assortment           1017209 non-null  object 
 14  competitiondistance  1017209 non-null  float64
 15  promo2               1017209 non-null  object 
dtypes: float64(1), int64(12), object(3)
memory usage: 124.2+ MB

EDA On Merged Dataset¶

Heatmap Of Merged Dataset¶

In [53]:
numeric_columns = train.select_dtypes(include=[np.number])

corr_matrix = numeric_columns.corr().abs()

plt.subplots(figsize=(20, 12))
sns.heatmap(corr_matrix, annot=True)
plt.show()
No description has been provided for this image
In [54]:
train["avg_customer_sales"] = train.sales/train.customers
In [55]:
f, ax = plt.subplots(2, 3, figsize = (20,10))

store.groupby("storetype")["store"].count().plot(kind = "bar", ax = ax[0, 0], title = "Total StoreTypes in the Dataset")
train.groupby("storetype")["sales"].sum().plot(kind = "bar", ax = ax[0,1], title = "Total Sales of the StoreTypes")
train.groupby("storetype")["customers"].sum().plot(kind = "bar", ax = ax[0,2], title = "Total nr Customers of the StoreTypes")
train.groupby("storetype")["sales"].mean().plot(kind = "bar", ax = ax[1,0], title = "Average Sales of StoreTypes")
train.groupby("storetype")["avg_customer_sales"].mean().plot(kind = "bar", ax = ax[1,1], title = "Average Spending per Customer")
train.groupby("storetype")["customers"].mean().plot(kind = "bar", ax = ax[1,2], title = "Average Customers per StoreType")

plt.subplots_adjust(hspace = 0.3)
plt.show()
No description has been provided for this image

As we can see from the graphs, the StoreType A has the most stores, sales and customers. However the StoreType D has the best averages spendings per customers. StoreType B, with only 17 stores has the most average customers.

Lets go ahead with the promotions¶

In [56]:
train['promo2'] = train['promo2'].map({'true': 1, 'false': 0}).fillna(0).astype(int)
test['promo2'] = test['promo2'].map({'true': 1, 'false': 0}).fillna(0).astype(int)

sns.catplot(data=train, x="month", y="sales", 
            col='promo', # per store type in cols
            hue='promo2', 
            row="year", 
            kind="point", 
            height=5, aspect=1.5)

plt.subplots_adjust(top=0.9)
plt.suptitle("Sales by Month, Promo, and Year", fontsize=16)

plt.show()
# So, of course, if the stores are having promotion the sells are higher.
# Overall the store promotions sellings are also higher than the seasionality promotions (Promo2). However I can't see no yearly trend. 
2025-04-28 10:03:11,187 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-04-28 10:03:12,011 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
No description has been provided for this image

As We can see that when the promo is running Sales are high

In [57]:
sns.catplot(data=train, x="dayofweek", y="sales", hue="promo", kind="point", height=6, aspect=1.5)

plt.title("Sales by Day of Week and Promo", fontsize=16)
plt.show()
2025-04-28 10:03:53,327 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-04-28 10:03:54,139 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
No description has been provided for this image
In [58]:
print("""So, no promotion in the weekend. However, the sales are very high, if the stores have promotion. 
The Sales are going crazy on Sunday. No wonder.""")
print("There are", train[(train.open == 1) & (train.dayofweek == 7)].store.unique().shape[0], "stores opend on sundays")    
So, no promotion in the weekend. However, the sales are very high, if the stores have promotion. 
The Sales are going crazy on Sunday. No wonder.
There are 33 stores opend on sundays

Let's see the trends on a yearly basis.

In [59]:
sns.catplot(data=train, x="month", y="sales", col="year", hue="storetype", kind="point", height=6, aspect=1.5)

plt.subplots_adjust(top=0.9)
plt.suptitle("Sales by Month, Year, and Store Type", fontsize=16)

plt.show()
# Yes, we can see a seasonalities, but not trends. The sales stays constantly yearly. 
2025-04-28 10:04:15,020 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2025-04-28 10:04:16,113 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
No description has been provided for this image

What about the Competition Distance. What kind of inpact does this have on the sales.

In [60]:
train = train.dropna(subset=['competitiondistance'])
train['competitiondistance'] = pd.to_numeric(train['competitiondistance'], errors='coerce')
# The obsverations are continous numbers, so we need to convert them into a categories. Lets a create a new variable.
train["competitiondistance_cat"] = pd.cut(train["competitiondistance"], bins=5)

print(train['competitiondistance_cat'].head())
0     (-55.84, 15188.0]
1     (-55.84, 15188.0]
2     (-55.84, 15188.0]
3     (-55.84, 15188.0]
4    (15188.0, 30356.0]
Name: competitiondistance_cat, dtype: category
Categories (5, interval[float64, right]): [(-55.84, 15188.0] < (15188.0, 30356.0] < (30356.0, 45524.0] < (45524.0, 60692.0] < (60692.0, 75860.0]]
In [61]:
f, ax = plt.subplots(1,2, figsize = (15,5))

train.groupby(by = "competitiondistance_cat").sales.mean().plot(kind = "bar", title = "Average Total Sales by Competition Distance", ax = ax[0])
train.groupby(by = "competitiondistance_cat").customers.mean().plot(kind = "bar", title = "Average Total Customers by Competition Distance", ax = ax[1])

# It is pretty clear. If the competions is very far away, the stores are performing better (sales and customers)
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3843378168.py:3: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  train.groupby(by = "competitiondistance_cat").sales.mean().plot(kind = "bar", title = "Average Total Sales by Competition Distance", ax = ax[0])
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3843378168.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  train.groupby(by = "competitiondistance_cat").customers.mean().plot(kind = "bar", title = "Average Total Customers by Competition Distance", ax = ax[1])
Out[61]:
<Axes: title={'center': 'Average Total Customers by Competition Distance'}, xlabel='competitiondistance_cat'>
No description has been provided for this image
In [62]:
train.drop(['avg_customer_sales','competitiondistance_cat'],axis=1,inplace=True)

Box plot shows that we have a very high outliers in sales

In [63]:
#checking outliers in sales
sns.boxplot(train['sales'])
Out[63]:
<Axes: ylabel='sales'>
No description has been provided for this image

Removing Outliers Of Sales Column

In [64]:
#removing outliers
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out
In [65]:
# defining new variable after removing outliers
train= remove_outlier(train, 'sales')

Conclusion of the analysis:¶

Sales are highly correlated to number of Customers.

The most selling and crowded store type is A.

StoreType B has the lowest Average Sales per Customer. So i think customers visit this type only for small things.

StoreTybe D had the highest buyer cart.

Promo runs only in weekdays.

For all stores, Promotion leads to increase in Sales and Customers both.

More stores are opened during School holidays than State holidays.

The stores which are opened during School Holiday have more sales than normal days.

Sales are increased during Chirstmas week, this might be due to the fact that people buy more beauty products during a Christmas celebration.

Promo2 doesnt seems to be correlated to any significant change in the sales amount.

Absence of values in features CompetitionOpenSinceYear/Month doesn’t indicate the absence of competition as CompetitionDistance values are not null where the other two values are null.

Drop Subsets Of Data Where Might Cause Bias¶

In [66]:
# where stores are closed, they won't generate sales, so we will remove that part of the dataset
train = train[train.open != 0]
In [67]:
# Open isn't a variable anymore, so we'll drop it too
train = train.drop('open', axis=1)
In [68]:
# Check if there's any opened store with zero sales
train['store'] = pd.to_numeric(train['store'], errors='coerce')
train[train.sales == 0]['store'].sum()
Out[68]:
np.int64(31460)
In [69]:
# see the percentage of open stored with zero sales
train[train.sales == 0]['sales'].sum()/train.sales.sum()
Out[69]:
np.float64(0.0)
In [70]:
# remove this part of data to avoid bias
train = train[train.sales != 0]
In [91]:
train_new=train.copy()
In [92]:
train_new = pd.get_dummies(train_new,columns=['storetype','assortment'])

train_new['storetype_a'] = train_new['storetype_a'].replace({True: 1, False: 0})
train_new['storetype_b'] = train_new['storetype_b'].replace({True: 1, False: 0})
train_new['storetype_c'] = train_new['storetype_c'].replace({True: 1, False: 0})
train_new['storetype_d'] = train_new['storetype_d'].replace({True: 1, False: 0})

train_new['assortment_a'] = train_new['assortment_a'].replace({True: 1, False: 0})
train_new['assortment_b'] = train_new['assortment_b'].replace({True: 1, False: 0})
train_new['assortment_c'] = train_new['assortment_c'].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3227963305.py:3: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  train_new['storetype_a'] = train_new['storetype_a'].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3227963305.py:4: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  train_new['storetype_b'] = train_new['storetype_b'].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3227963305.py:5: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  train_new['storetype_c'] = train_new['storetype_c'].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3227963305.py:6: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  train_new['storetype_d'] = train_new['storetype_d'].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3227963305.py:8: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  train_new['assortment_a'] = train_new['assortment_a'].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3227963305.py:9: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  train_new['assortment_b'] = train_new['assortment_b'].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/3227963305.py:10: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  train_new['assortment_c'] = train_new['assortment_c'].replace({True: 1, False: 0})
In [93]:
train_new.head()
Out[93]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweeksalescustomerspromoschoolholidayyearmonthdayweekofyearcompetitiondistancepromo2storetype_astoretype_bstoretype_cstoretype_dassortment_aassortment_bassortment_c
00155263555112015731311270.000010100
1125606462511201573131570.011000100
223583148211120157313114130.011000100
334513995149811201573131620.000010001
445548225591120157313129910.001000100

From plot it can be sen that most of the sales have been on 1st and last day of week

In [94]:
train_new['dayofweek'] = pd.to_numeric(train_new['dayofweek'], errors='coerce')

colors = sns.color_palette("husl", 7)  # 7 unique colors for 7 days

plt.figure(figsize=(15, 8))
sns.barplot(x='dayofweek', y='sales', data=train_new, palette=colors)

plt.xlabel('Day of the Week')
plt.ylabel('Sales')
plt.title('Sales in Terms of Days of the Week')
plt.show()
2025-04-28 10:10:40,413 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/221677450.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x='dayofweek', y='sales', data=train_new, palette=colors)
2025-04-28 10:10:41,179 matplotlib.category INFO   Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
No description has been provided for this image

Setting Features and Target Variables¶

In [95]:
X = train_new.drop(['sales', 'store', 'year'] , axis = 1)
y= train_new.sales
In [96]:
X.shape
Out[96]:
(817644, 17)
In [97]:
X.head()
Out[97]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
iddayofweekcustomerspromoschoolholidaymonthdayweekofyearcompetitiondistancepromo2storetype_astoretype_bstoretype_cstoretype_dassortment_aassortment_bassortment_c
00555511731311270.000010100
1156251173131570.011000100
225821117313114130.011000100
33514981173131620.000010001
445559117313129910.001000100
In [98]:
y.head()
Out[98]:
0     5263
1     6064
2     8314
3    13995
4     4822
Name: sales, dtype: int64

Splitting Dataset Into Training Set and Test Set

In [99]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=0)
In [100]:
columns=X_train.columns

Implementing Supervised Machine Learning algorithms.¶

In [114]:
import joblib

def plot_feature_importance(importance,names,model_type):

  #Create arrays from feature importance and feature names
  feature_importance = np.array(importance)
  feature_names = np.array(names)

  #Create a DataFrame using a Dictionary
  data={'feature_names':feature_names,'feature_importance':feature_importance}
  fi_df = pd.DataFrame(data)

  #Sort the DataFrame in order decreasing feature importance
  fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

  #Define size of bar plot
  plt.figure(figsize=(10,8))
  #Plot Searborn bar chart
  colors = sns.color_palette("viridis", len(fi_df))
  sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'], palette=colors)
  #Add chart labels
  plt.title(model_type + ' FEATURE IMPORTANCE')
  plt.xlabel('FEATURE IMPORTANCE')
  plt.ylabel('FEATURE NAMES')
  plt.show()

def rmse(x, y):
    return sqrt(mean_squared_error(x, y))

def mape(x, y): 
    return np.mean(np.abs((x - y) / x)) * 100

def model_train_evaluate(model, model_name, X_train, y_train, X_test, y_test, save_model=False, visualize=False):
    model.fit(X_train, y_train)
    
    if save_model:
        joblib.dump(model, f'{model_name}_model.pkl')
    
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    train_rmse = rmse(y_train, y_train_pred)
    test_rmse = rmse(y_test, y_test_pred)
    
    train_mape = mape(y_train, y_train_pred)
    test_mape = mape(y_test, y_test_pred)
    
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    results = {
        'Model': model_name,
        'Train_Score': train_score,
        'Test_Score': test_score,
        'Train_RMSE': train_rmse,
        'Test_RMSE': test_rmse,
        'Train_MAPE': train_mape,
        'Test_MAPE': test_mape
    }

    if visualize:
        rf_prd = y_test_pred[:100]
        rf_real = y_test[:100]
        
        dataset_rf = pd.DataFrame({'Real': rf_real, 'Predicted': rf_prd})
        
        dataset_rf['diff'] = (dataset_rf['Real'] - dataset_rf['Predicted']).abs()

        print("Sample of 4 observations from the dataset:")
        print(dataset_rf.sample(4))
        print("Statistical description of the predictions and actual values:")
        print(dataset_rf.describe())

        dataset_rf.plot.hist(subplots=True, legend=False, figsize=(10, 6))
        plt.suptitle(f'Histograms of Actual and Predicted Sales for {model_name}')
        plt.show()
        
        sns.lmplot(x='Real', y='Predicted', data=dataset_rf, line_kws={'color': 'black'})
        plt.title(f'{model_name} - Real vs Predicted')
        plt.show()

    if visualize and model_name=="Random Forest Regression":
        plot_feature_importance(model.feature_importances_, columns[:], model_name)
    
    return results

1. Linear Regression (OLS)¶

In [102]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In [105]:
# Fitting Multiple Linear Regression to the Training set
results = []

lin_reg = LinearRegression()
results.append(model_train_evaluate(lin_reg, "Linear Regression", X_train, y_train, X_test, y_test, save_model=True, visualize=True))
Sample of 4 observations from the dataset:
        Real    Predicted        diff
384625  8749  9345.056106  596.056106
631227  5293  4779.596754  513.403246
513766  5793  4939.425473  853.574527
591127  3376  3319.097584   56.902416
Statistical description of the predictions and actual values:
               Real     Predicted         diff
count    100.000000    100.000000   100.000000
mean    6775.070000   6778.161351   923.811435
std     2534.335345   2342.036112   734.235102
min     1072.000000   1527.067231    10.571062
25%     5004.750000   5149.826876   374.958548
50%     6548.000000   6760.217687   828.035068
75%     8598.250000   7980.485089  1308.631333
max    13001.000000  14235.858687  3861.033679
No description has been provided for this image
No description has been provided for this image

Inferences On Linear Regression Coefficients¶

In [87]:
X = sm.add_constant(X) ## let's add an intercept (beta_0) to our model
model = sm.OLS(y, X).fit() ## sm.OLS(output, input)
predictions = model.predict(X)

# Print out the statistics
model.summary()
Out[87]:
OLS Regression Results
Dep. Variable: sales R-squared: 0.784
Model: OLS Adj. R-squared: 0.784
Method: Least Squares F-statistic: 1.976e+05
Date: Mon, 28 Apr 2025 Prob (F-statistic): 0.00
Time: 10:09:04 Log-Likelihood: -6.9209e+06
No. Observations: 817644 AIC: 1.384e+07
Df Residuals: 817628 BIC: 1.384e+07
Df Model: 15
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -833.3709 5.867 -142.052 0.000 -844.869 -821.872
id -0.0004 4.34e-06 -98.138 0.000 -0.000 -0.000
dayofweek -21.3047 0.781 -27.294 0.000 -22.835 -19.775
customers 7.2230 0.005 1479.958 0.000 7.213 7.233
promo 1088.0630 2.753 395.296 0.000 1082.668 1093.458
schoolholiday 29.3647 3.307 8.880 0.000 22.883 35.846
month 31.1657 1.424 21.881 0.000 28.374 33.957
day 0.6876 0.151 4.553 0.000 0.392 0.984
weekofyear -3.7812 0.328 -11.529 0.000 -4.424 -3.138
competitiondistance 0.0240 0.000 141.247 0.000 0.024 0.024
promo2 310.4796 2.646 117.330 0.000 305.293 315.666
storetype_a 107.9113 4.771 22.619 0.000 98.561 117.262
storetype_b -2105.0012 12.659 -166.286 0.000 -2129.812 -2080.190
storetype_c -41.5152 5.298 -7.835 0.000 -51.900 -31.131
storetype_d 1205.2343 5.013 240.438 0.000 1195.410 1215.059
assortment_a 1164.9609 6.439 180.928 0.000 1152.341 1177.581
assortment_b -3450.2511 15.307 -225.399 0.000 -3480.253 -3420.249
assortment_c 1451.9194 6.622 219.246 0.000 1438.940 1464.899
Omnibus: 49881.055 Durbin-Watson: 1.736
Prob(Omnibus): 0.000 Jarque-Bera (JB): 152029.909
Skew: 0.297 Prob(JB): 0.00
Kurtosis: 5.027 Cond. No. 6.50e+20


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 6.71e-25. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

2. LARS Lasso Regression¶

In [106]:
lasso = LassoLars(alpha=0.3, fit_intercept=False)
results.append(model_train_evaluate(lasso, "Lasso Regression", X_train, y_train, X_test, y_test, save_model=True, visualize=True))
Sample of 4 observations from the dataset:
         Real     Predicted         diff
430308  10009   9475.100647   533.899353
553403   7685   7920.228289   235.228289
906696   7159   7527.339134   368.339134
360297  13001  14209.177165  1208.177165
Statistical description of the predictions and actual values:
               Real     Predicted         diff
count    100.000000    100.000000   100.000000
mean    6775.070000   6776.844833   924.102549
std     2534.335345   2336.927414   731.734395
min     1072.000000   1537.371381    10.125225
25%     5004.750000   5149.483863   378.671831
50%     6548.000000   6759.511505   822.487562
75%     8598.250000   7978.073258  1306.251548
max    13001.000000  14209.177165  3821.358716
No description has been provided for this image
No description has been provided for this image

3. Decision Tree Regression¶

In [107]:
tree = DecisionTreeRegressor()
results.append(model_train_evaluate(tree, "Decision Tree Regression", X_train, y_train, X_test, y_test, save_model=True, visualize=True))
Sample of 4 observations from the dataset:
         Real  Predicted    diff
158975   4185     4281.0    96.0
362622  10403    10181.0   222.0
31014    4789     3845.0   944.0
27779    7213     5996.0  1217.0
Statistical description of the predictions and actual values:
               Real     Predicted        diff
count    100.000000    100.000000   100.00000
mean    6775.070000   6635.510000   477.56000
std     2534.335345   2405.050761   474.93022
min     1072.000000   1176.000000     5.00000
25%     5004.750000   5122.500000   142.25000
50%     6548.000000   6555.000000   282.00000
75%     8598.250000   8406.000000   602.50000
max    13001.000000  12228.000000  2883.00000
No description has been provided for this image
No description has been provided for this image

Decision Tree With Hyper Parameter Tuning¶

In [134]:
# #another script that takes toooo long, to find the right parameters for tree
# tree = DecisionTreeRegressor()

# params = {
#          'min_samples_split':[2,3,5,7],
#          'min_samples_leaf':[6,8,10],
#          }

# grid = RandomizedSearchCV(estimator=rfr,param_distributions=params,verbose=True,cv=10)
# #choosing 10 K-Folds makes sure i went through all of the data and didn't miss any pattern.

# grid.fit(X_train, y_train)
# grid.best_params_

I trained Model with hyper parameters..to not run everytime i record the result

Here are our best parameters for Decision Tree

{ min_samples_split=5,min_samples_leaf=8 }

In [108]:
tree_tuned = DecisionTreeRegressor(min_samples_leaf=8,min_samples_split=5)
results.append(model_train_evaluate(tree_tuned, "Tuned Decision Tree Regression", X_train, y_train, X_test, y_test, save_model=True, visualize=True))
Sample of 4 observations from the dataset:
          Real     Predicted        diff
97497     9158   9119.461538   38.538462
544880   10150  10482.875000  332.875000
1013343   3132   3293.153846  161.153846
580379    6369   6810.000000  441.000000
Statistical description of the predictions and actual values:
               Real     Predicted         diff
count    100.000000    100.000000   100.000000
mean    6775.070000   6822.308835   428.686999
std     2534.335345   2495.903674   427.737712
min     1072.000000   1194.000000     6.875000
25%     5004.750000   5275.322917   124.783333
50%     6548.000000   6792.153846   295.442308
75%     8598.250000   8648.602679   564.884615
max    13001.000000  12954.571429  2122.636364
No description has been provided for this image
No description has been provided for this image

4. Support Vector Regression¶

In [ ]:
#%%time
#from sklearn.svm import SVR
#svr=SVR()
#results.append(model_train_evaluate(svr, "Support Vector Regression", X_train, y_train, X_test, y_test, save_model=True, visualize=True))

5. K-Nearest Neighbors Regression¶

In [109]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors = 30)

results.append(model_train_evaluate(knn, "k-Nearest Neighbors Regression", X_train, y_train, X_test, y_test, save_model=True, visualize=True))
Sample of 4 observations from the dataset:
        Real    Predicted         diff
580379  6369  4190.800000  2178.200000
27779   7213  6329.800000   883.200000
557319  9581  8049.000000  1532.000000
746746  9630  7318.533333  2311.466667
Statistical description of the predictions and actual values:
               Real     Predicted         diff
count    100.000000    100.000000   100.000000
mean    6775.070000   6627.179000  1177.540333
std     2534.335345   1786.970913   788.663992
min     1072.000000   2555.533333    75.233333
25%     5004.750000   5286.408333   584.700000
50%     6548.000000   6622.183333  1104.216667
75%     8598.250000   7830.908333  1515.850000
max    13001.000000  10637.933333  3972.533333
No description has been provided for this image
No description has been provided for this image

6. Random Forest With Hyper Parameter Tuning¶

In [143]:
# #another script that takes toooo long, to find the right parameters for RFR
# rfr=RandomForestRegressor(n_jobs=-1)

# params = {
#          'n_estimators':[40,50,60,70,80,90],
#          'min_samples_split':[2,3,6,8],
#          'min_samples_leaf':[1,2,3,4],
#          'max_depth':[None,5,15,30]
#          }

# #the dimensionality is high, the number of combinations we have to search is enormous, using RandomizedSearchCV is a better option then GridSearchCV
# grid = RandomizedSearchCV(estimator=rfr,param_distributions=params,verbose=True,cv=10)

# #choosing 10 K-Folds makes sure i went through all of the data and didn't miss any pattern.
# grid.fit(X_train, y_train)
# grid.best_params_

I trained Model with hyper parameters..to not run everytime i record the result

Here are our best parameters for Random Forest

{ n_estimators=80,min_samples_split=2,min_samples_leaf=1,max_depth=None }

In [112]:
#%%time
rdf = RandomForestRegressor(n_estimators=80,min_samples_split=2, min_samples_leaf=1,max_depth=None,n_jobs=-1)
results.append(model_train_evaluate(rdf, "Random Forest Regression", X_train, y_train, X_test, y_test, save_model=True, visualize=True))
Sample of 4 observations from the dataset:
          Real   Predicted      diff
360297   13001  12808.2250  192.7750
513766    5793   5471.0000  322.0000
178721    2026   2195.9625  169.9625
1015614   4888   4768.1125  119.8875
Statistical description of the predictions and actual values:
               Real     Predicted         diff
count    100.000000    100.000000   100.000000
mean    6775.070000   6747.943250   388.222250
std     2534.335345   2484.149906   354.888827
min     1072.000000   1110.562500     7.700000
25%     5004.750000   5134.940625   131.493750
50%     6548.000000   6518.968750   289.362500
75%     8598.250000   8476.671875   562.512500
max    13001.000000  12808.225000  2043.125000
No description has been provided for this image
No description has been provided for this image
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/4195536892.py:20: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'], palette=colors)
No description has been provided for this image

Feature Importance On Random Forest Regressor¶

As we can see that Random Forest has the highest test score

Customers, CompetitionDistance, StoreType_d, Promo these four are most important features in our sales prediction.

In [118]:
score_df = pd.DataFrame(results)
score_df
Out[118]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
ModelTrain_ScoreTest_ScoreTrain_RMSETest_RMSETrain_MAPETest_MAPE
0Linear Regression0.7833170.7848901148.2442121146.36129714.11826414.140345
1Lasso Regression0.7832980.7848661148.2950141146.42535314.12295714.144870
2Decision Tree Regression1.0000000.9119210.000000733.5468860.0000007.677361
3Tuned Decision Tree Regression0.9633510.930715472.228732650.5958604.9960066.879156
4k-Nearest Neighbors Regression0.6763490.6508721403.3318711460.43975918.06652718.854862
5Random Forest Regression0.9937260.956172195.389935517.4469352.1061545.630561
In [121]:
# Melting the DataFrame for visualization purposes
score_df_melted_scores = score_df.melt(id_vars=['Model'], value_vars=['Train_Score', 'Test_Score'], 
                                var_name='Metric_Type', value_name='Value')
score_df_melted_rmse = score_df.melt(id_vars=['Model'], value_vars=['Train_RMSE', 'Test_RMSE'], 
                                var_name='Metric_Type', value_name='Value')
score_df_melted_mape = score_df.melt(id_vars=['Model'], value_vars=['Train_MAPE', 'Test_MAPE'], 
                                var_name='Metric_Type', value_name='Value')
In [122]:
plt.figure(figsize=(12, 6))
sns.set_palette("Set1")

sns.barplot(x='Model', y='Value', hue='Metric_Type', data=score_df_melted_scores)

plt.title('Train vs Test Scores for Different Models', fontsize=16)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Metric Type')

plt.tight_layout()
plt.show()
No description has been provided for this image

Conclusion¶

Random Forest regressor achieved lowest MAPE as 5.63% showing that it is a highly accurate model. MAE is the average magnitude of error produced by your model, the MAPE is how far the model’s predictions are off from their corresponding outputs on averagendIXERr.jpg

In [123]:
plt.figure(figsize=(12, 6))
sns.set_palette("Set1")

sns.barplot(x='Model', y='Value', hue='Metric_Type', data=score_df_melted_rmse)

plt.title('Train vs Test RMSE for Different Models', fontsize=16)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Metric Type')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [124]:
plt.figure(figsize=(12, 6))
sns.set_palette("Set1")

sns.barplot(x='Model', y='Value', hue='Metric_Type', data=score_df_melted_mape)

plt.title('Train vs Test MAPE for Different Models', fontsize=16)
plt.xlabel('Model', fontsize=12)
plt.ylabel('Value', fontsize=12)
plt.xticks(rotation=45)
plt.legend(title='Metric Type')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [127]:
test
Out[127]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idstoredayofweekdateopenpromostateholidayschoolholidaystoretypeassortmentcompetitiondistancepromo2
01142015-09-1711d0ca1270.00
12342015-09-1711d0aa14130.01
23742015-09-1711d0ac24000.00
34842015-09-1711d0aa7520.00
45942015-09-1711d0ac2030.00
.......................................
4108341084111162015-08-0110d0aa1900.01
4108441085111262015-08-0110d0cc1880.00
4108541086111362015-08-0110d0ac9260.00
4108641087111462015-08-0110d0ac870.00
4108741088111562015-08-0110d1dc5350.01

41088 rows × 12 columns

In [ ]:
store_test_processed = test.copy()

customers_median = store_train['customers'].median()
store_test_processed['customers'] = customers_median
store_test_processed['customers'] = store_test_processed['customers'].astype(int)

store_test_processed.drop('stateholiday', inplace=True, axis=1)

store_test_processed['date'] = pd.to_datetime(store_test_processed['date'])
store_test_processed['year'] = store_test_processed['date'].apply(lambda x: x.year)
store_test_processed['month'] = store_test_processed['date'].apply(lambda x: x.month)
store_test_processed['day'] = store_test_processed['date'].apply(lambda x: x.day)
store_test_processed['weekofyear'] = store_test_processed['date'].apply(lambda x: x.weekofyear)

store_test_processed.drop(['year', 'open', 'date', 'store'], axis=1, inplace=True)

store_test_processed = pd.get_dummies(store_test_processed, columns=['storetype', 'assortment'], drop_first=True)

store_types = ['storetype_a', 'storetype_b', 'storetype_c', 'storetype_d']
assortments = ['assortment_a', 'assortment_b', 'assortment_c']

for col in store_types + assortments:
    if col in store_test_processed.columns:
        store_test_processed[col] = store_test_processed[col].replace({True: 1, False: 0})
/var/folders/4p/r3zy7pns7t38_zwr2sjgxj8c0000gp/T/ipykernel_99967/173882841.py:24: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  store_test_processed[col] = store_test_processed[col].replace({True: 1, False: 0})
In [130]:
store_test_processed
Out[130]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
iddayofweekpromoschoolholidaycompetitiondistancepromo2customersmonthdayweekofyearstoretype_bstoretype_cstoretype_dassortment_bassortment_c
014101270.006099173801000
1241014130.016099173800000
2341024000.006099173800001
344107520.006099173800000
454102030.006099173800001
................................................
41083410846001900.01609813100000
41084410856001880.00609813101001
41085410866009260.00609813100001
4108641087600870.00609813100001
41087410886015350.01609813100101

41088 rows × 15 columns

In [131]:
missing_columns = ['storetype_a', 'assortment_a']

for col in missing_columns:
    if col not in store_test_processed.columns:
        store_test_processed[col] = 0
In [132]:
store_test_processed
Out[132]:
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
iddayofweekpromoschoolholidaycompetitiondistancepromo2customersmonthdayweekofyearstoretype_bstoretype_cstoretype_dassortment_bassortment_cstoretype_aassortment_a
014101270.00609917380100000
1241014130.01609917380000000
2341024000.00609917380000100
344107520.00609917380000000
454102030.00609917380000100
......................................................
41083410846001900.0160981310000000
41084410856001880.0060981310100100
41085410866009260.0060981310000100
4108641087600870.0060981310000100
41087410886015350.0160981310010100

41088 rows × 17 columns

In [133]:
rdfreg = joblib.load('Random Forest Regression_model.pkl')
y_pred = rdfreg.predict(store_test_processed)

y_pred_rounded = y_pred.round().astype(int)

submission = pd.DataFrame({
    'Id': store_test_processed['id'], 
    'Sales': y_pred_rounded
})

submission.to_csv('sample_submission.csv', index=False)
/usr/local/lib/python3.11/site-packages/sklearn/utils/validation.py:2732: UserWarning: X has feature names, but RandomForestRegressor was fitted without feature names
  warnings.warn(