Exploring E-commerce Transactional Data — Part 1

One thing that I enjoy doing sometimes is looking at the dynamics of a business that can be illustrated with data. From time to time I pick up a scenario that is different from what I am used to (i.e banking and financial services) and look for patterns in it.

Today, I will be talking about one specific kind of business, a hybrid (B2C and B2B) E-commerce. E-commerce platforms are a means of selling goods of any kind through the internet. These can be physical goods that need to be shipped somewhere (think about your latest amazon order) or even completely digital goods, that can be accessed instantly.

The E-commerce website I studied is a UK-based retailer that sells primarily gifts. With that context in mind, I wondered how gift-giving culture in a particular region (such as the UK) could influence the seasonal purchasing trends on this website and tried to describe the phenomena.

I always like to map out my assumptions and hypotheses about the phenomena I want to model before I start doing exploratory data analysis on it. The image below illustrates how that went for this project.

Hypothesis map for the task of the original dataset. Image by the author.

By looking at the map, I posed four questions that could shed some light on the purchasing behavior of the clients of this platform over the course of the year. Below are my observations for each one of these.

A way to measure activity on an e-commerce platform is to use the WAU (weekly active users) metric. This is calculated from the dataset by counting the number of unique clients purchasing one or more times in the platform. This is a good way of measuring activity because it smoothes out the clients that make bulk purchases (which can be other businesses instead of regular clients).

Weekly Active Users (WAU) over the period (2016–2017) — Image by the author

The graph above suggests that there is a growing trend of active users in the platform along the last few months of the year (from October to December), including weeks that have commercial holidays in them (such as Black Friday and Mother’s day)

Looking at popular products can help us understand what the website is most known for and what are the customers generally interested in. To illustrate this, we will take a look at the number of orders that contain a specific product was in invoices (again, to smooth out bulk purchases).

20 most popular products in the retailer

Looking at the most popular products, I wondered what is their contribution to the total sales on the website. Removing revenue related to postage and shipping-related items, it turns out that the 20 most popular products alone represent around 6.5% of the total revenue made on the website from November 2016 to December 2017.

Some of the most popular products sorted by monetary value

If we look more closely, we can see something quite interesting:

In E-commerce scenarios, the most popular products are not necessarily the most profitable ones.

This is clear when we see that, even though the “white hanging heart-light holder” product sold 3 times as much as the “regency cakestand 3 tier” product, it only made 1/4 of the revenue. Is this behavior seasonal? Do we have products that become more relevant at a particular point of the year? We will investigate that more closely with the following section, by exploring

To analyze such behavior, I defined a column in the dataset that calculates the time in days until the next holiday for each invoice, considering the UK’s most relevant holidays in gift-giving culture.

These are holidays in which people tend to buy each other gifts of all kinds. Some of these holidays are Christmas, Mother’s Day, and Father’s Day. Along with that, I considered days of the year that usually denote activity in E-commerces, such as Black Friday and Boxing Day.

Distribution of time to next holiday in days for the number of Invoices

Measuring the time to the nearest holiday for each invoice yields the image above, which shows us that people do tend to buy more on this retailer closer to holidays than not (a right-skewed distribution).

In fact, about 50% of all orders on the website happen less than 27 days before a commemorative date). It looks like people from the UK are indeed quite punctual.

But if we want to know which kinds of products sell the most at a particular point in time, we should look at the behavior of each product or category of products independently.

Seasonal Trends of the top-20 products in the platform

The plot above shows the time series for each of the top 20 products on the platform. Notice the line for “paper chain kit 50’s christmas”. At the beginning of the year, sales for this product are virtually zero.

At around week 44 it starts to grow quite rapidly as people start looking into Christmas decoration, peaking around week 48, when it surpasses all other products in terms of number orders.

Understanding this kind of behavior allows for E-commerce retailers to stock up beforehand at the right time, minimizing costs with unnecessary products in stock over the year.

We came back full circle to the topic of active users. Now that we know we have more active customers towards the end of the year, how can we quantify the average effect of having more active customers in the platform?

For that, we will use regression analysis. Specifically, we will perform a regression considering the weekly active users in the platform as the independent variable trying to predict the number of times products are sold in a particular week.

Regression results before and after preprocessing

After some preprocessing of the data, we are able to achieve a reasonable regression model for our use case, with the slope of the model being around 4. That indicates that for every new active user we bring in a week, we expect 4 products to be sold. If we “activate” 100 clients, we should expect sales to increase by around 400 items.

At this point, we managed to become more familiar with the E-commerce setting and, by analyzing transactional data, we can make the business operation of such a website more predictable.

Predictability in businesses allows us to better manage them and reduce inefficiencies, illustrating yet another superpower data science gives us.

This is the first article on a series of explorations of E-commerce data. It was originally written for the Udacity Data Scientist Nanodegree, and the entire jupyter notebook with the code used in this analysis can be found here.

Storyteller and Data Scientist. Passionate about Data Products that empower people. Currently at BTG Pactual. Brown University, Electrical Engineering.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store