Data Science
Approaching a Sales Forecast Problem (Part-1)
Prashant Brahmbhatt
September 12, 2020
8 min

This post assumes that you already have a decent understanding of what a time-series problem is but don’t quite know how to tackle it well enough. All of us who have been studying Data Science or Machine Learning have at some point encountered a Time-Series problem, may it be some weather prediction or some sort of Sales forecasting problem. And since time-series don’t exactly have the same flow as that of the other problems of machine learning, sometimes it hard to get one started on solving the problem. Even trying out different solutions doesn’t amount to much understanding if you’re not sure what exactly you just did with whatever snippet of code you ran.

Here we will take a sample problem relating to Sales Forecasting and try to get reasonable results with a proper understanding of each step.

Note: We will be focusing on modeling univariate problem here for simplicity’s sake. Multivariate analysis will be tackled in a future blog. So let’s get started!

We will be taking a sample dataset from Kaggle as our example problem. You can find the data and its details here.

And if you are well versed and just want to take a peek at the code with minimal explanation, you better head on to the GitHub repository here.

Firstly, we should understand the problem statement. We have historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and we are tasked with predicting the department-wide sales.


The Dataset Structure

We have mainly four relevant CSV files:

  • stores.csv: This file contains anonymized information about the 45 stores, indicating the type and size of the store.
  • train.csv: This is the historical training data, which covers 2010–02–05 to 2012–11–01.
  • test.csv: This file is identical to train.csv, except we have withheld the weekly sales. We must predict the sales for each triplet of store, department, and date in this file.
  • features.csv: This file contains additional data related to the store, department, and regional activity for the given dates.

Taking a look at our data files train.csv, store.csv and features.csv respectively:

train.csv stores.csv features.csv

From the above, we can see that all three of our datasets can be combined easily to get one single set to work on as it will be very handy. We can combine train.csv and features.csv using [Store, Date] as the foreign keys. Then we can combine the resulting data to stores.csv using [Store] as the key. After combining all, our final dataset will look something like this:


All the resulting columns will be:

Observe that we have two duplicate columns IsHoliday_x and IsHoliday_y which denote holiday status as boolean. So after confirming that they’re similar, we can remove one of those and rename the remaining one better to IsHoliday.

Now we should get a sense of how our data is doing in terms of missing values.


The plot has brighter bars in the middle showing us the missing values pattern in our final dataset. All of the columns having the missing data are the Markdown columns. Now markdowns are the reductions in price. So here the missing values can also denote the absence of the discounts. But since we are focusing on Univariate Analysis, we can safely drop these columns.


From the data sample in the above picture, we can see that there are observations for which the sales are negative. Since, here we don’t have any specific context for these values and their possibilities, we can remove these as they are more likely to be incorrect or anomalous observations. So after removing these observations, we will get our final data having shape (420285, 20).

Since our focus is on the univariate problem, we only require the target column itself so we can skip the detailed analysis of the independent columns and right away begin with building models using the Weekly_Sales column.


Before we can start with the modeling part we have to sample the data. It is a necessary step here as the purpose of sampling is to get our data at decided regular intervals. We wouldn’t need to use all of the data points but rather we can get sample observations from the data.

We have a date column for each observation so based on this column we can obtain the sampled set. Since our dataset has two years of data we can sample it in three ways, quarterly, monthly, weekly.

quarterly_sampling monthly_sampling weekly_sampling

The sampled data above is 70% of the total data as we have held 30% of the points for testing purposes. Now as for choosing the interval, we should go with the weekly sampling as for quarterly or monthly the amount of data doesn’t seem to be enough which makes the underlying patterns too abstract.

Also, we need to use an aggregate function while resampling, this aggregate function will be used to deal with conflicting observations. When we have different observations for the same date, the aggregate function will be used to deal with them. The sampling function we have used is the mean function.

Our data contains multiple observations of a single date, each observation corresponds to a particular department. Our aggregate function will combine the sales of all the stores and departments on a particular date and averages them to compose that one observation.

Date 2010-02-07 16876.145176

In the above observation, the sales given for the date is the average sales of all the departments and stores on that day. For now, we are targetting a general model for the entire Walmart sales, we can, later on, use that model to forecast a particular Department that we want.


After getting the sampled data we are left with the stationarity check. Now if you’re not sufficiently familiar with what stationarity is, just know for the time being that it is not desired in our time-series data. We neither welcome trend (continuously increasing or continuously decreasing pattern) nor do we like non-stationary data.

You can think of it this way; like we prefer our distributions to be normal, we want our series to be stationary. If there had been stationarity in our data, we would have required to deal with it using what we know as differencing but we don’t need it here.

For better intuition for both of these concepts, there’ll be a future blog to catch up with.


In our trend component, we see that there isn’t a continuous increasing or decreasing pattern suggesting that there is less likely to be a trend. But we don’t rely on or our eyes in statistics, we rely on numbers. So of course there’s a test for stationarity.

The famous Augmented Dickey-Fuller Test is used to check for stationarity. We will perform the test using a significance level of 5%.

---Results of Dickey Fuller----

ADF Statistic: -5.927437652782698 p-value: 2.424917229727546e-07 Critical Values: 1%: -3.47864788917503 5%: -2.882721765644168 10%: -2.578065326612056 None

As in the results, the p-value of the test is much less than the significance value even for 1%, hereby we can say that our series is surely Stationary.


Seasonality is another attribute of a time-series that shows the recurring patterns in the series. These are the patterns that can be observed at regular intervals in a series.

For example, the sales of woolens are likely to show a seasonal hike in sales during the winter months of each year. Or maybe the Sports TV ratings during a world cup, which would correspond to a seasonal pattern of four years.


Here in our data, we can also observe a hike at the end of each year. Now the significance of seasonality is that we want our model to capture it well. Why you might ask?

Because more or less, that’s our overall target. We are looking to find patterns, we want our model to find subtle patterns that can be used to predict future values. And seasonality is the most obvious pattern, so any decent model should be able to capture the seasonality well, at least.



The parameter that we have to decide in many time-series models is how many time-lags do we have to take for future forecasts. It simply means that we have to decide how many of the past observations should we consider that contribute significantly to future results. But can’t we take all of the past data? The more the better is the usual scenario in Machine Learning, right?

Well NO!

In time-series, the more number of lags we include, the more complex our model becomes. And if we stack up un-important variables we just make our model more complex and increase our computation time all to no avail. Moreover, it worsens our predictions for the future when it fits too well using a large number of lags.

So we have to decide how many time-lags are we going to take in each of our models. And yes! we do have a test for it.

We can use Auto-Correlation and Partial Auto-Correlation plots to get an idea of the number of lags that we could try with our model.

Log-Likelihood Ratio Test (LLR)

Now that we have talked about time lags and that we chose them to select the complexity of our model, we wouldn’t only rely on our vision for the complexity test, would we?

The Log-Likelihood Ratio Test would provide us a measure to see if our more complex model is significantly better than the simpler one or not. We provide the LLR function with the simpler and complex model. If our resulting value is ≤0.05, it would suggest that the complex model is better than the simpler one and vice-versa.

Note that we always have to place the simpler model first in the LLR test otherwise we’ll always get 1.0 which would be wrong.


The simplest model other than Naive Forecasting is the Auto-Regression model. As the name suggests, it is similar to regression models. It tries to find the values of the coefficients consistent with the past lags and their importance for current data. To decide the lags, we use the Partial Auto-Correlation in AR.


The above plots suggest that we have 3–4 significant values so we can try models having 3 lags to at least 5 lags.

Let us observe some of the AR models and compare their performance.



  • Since the p-values (P>|z|) are significant for both the constant and the coefficient, we can try higher lag models. Throughout the blog, we will use these p-values to check for significant lags for the model.





  • We see that we have 4 significant values and 3 non-significant ones.
  • Also, our latest lag is non-significant, so we should be stopping here.

Although we can have a look at how another lag higher model would perform.



  • As expected we can observe that our last lag is non-significant so AR(6) is probably a reach.

We can use our LLR Test to check on the complexities and significance of the models.

LLR For AR(1) and AR(3): 0.093

LLR For AR(3) and AR(5): 0.0

LLR For AR(5) and AR(6): 0.734

  • The above test suggests that a more complex model than AR(3) is significantly better,
  • Although it also suggests that using AR(6) over AR(5) would not be better for the model.

But we will consider the results of AR(6) for comparison.

The AR Model Results


We can see that the results of AR(5) seem alright, AR(3) lacks sufficient depth and AR(6) doesn’t seem much better than AR(5). But we see that the model is still not quite able to predict the future well. It starts off bad, performs okay-ish in the middle, and then flattens out towards the end.

We can do some analysis of the residuals (errors) of the model to get a better insight of the results.

Ideally, the residual series should be white noise. It means that residuals should not show any particular pattern. If there is a pattern observed in the residuals it means that our model is failing to capture that pattern that now shows up in error terms. Checking the stationarity of the residuals.

---Results of Dickey Fuller----

ADF Statistic: -9.89429414692825 p-value: 3.491045332468848e-17 Critical Values: 1%: -3.498198082189098 5%: -2.891208211860468 10%: -2.5825959973472097

  • The residual series is stationary but it is not the same as white noise.
  • We should also consider the ACF plots and none of the points should be significant.


  • We can see that there are no significant points in the residuals.


  • There seems to be an unaccounted pattern at the end of the year in the residuals.

So we should try other models as well.


Another of the simple models is the moving averages model. As the name suggests, this model captures the moving average of the values of the series. It is also known as Rolling Mean or Running Mean.

For choosing the complexity of the MA model, we use the Auto-Correlation plot.


Let us observe some of the MA models and compare their performance.









  • Our lag is not significant anymore, so we will stop here.

Checking complexities.

LLR For MA(1) and MA(3): 0.015

LLR For MA(3) and MA(4): 0.0

LLR For MA(4) and MA(5): 0.176

Comparing the above models’ performances.


We can observe that the model is capturing the rolling mean of the series but is not able to get any of the patterns at all. This suggests we are up for models more complex than AR and MA.

We will continue with more models in the Part-2 of the blog. Stay tuned…

Until next time! Ciao!


Data ScienceTime SeriesForecasting

Related Posts

Using Pre-Trained Models Effectively
November 16, 2020
4 min
© 2021, All Rights Reserved.

Quick Links

Advertise with usContact Us

Social Media