Introduction to segmentation

  • Advanced Analytics
  • Articles


Market segmentation is one of the key components of strategic marketing and is fundamental to its success: the best performing companies drive their business based on segmentation (Lilien and Rangaswamy 2003, p. 61).

At Brain Food we believe in the importance of companies getting to know their customers and offering a more personalized service. This allows us to identify the needs of the submarket in which the customers of each segment are found, which are the most profitable customers and which are more likely to respond successfully to a strategic marketing campaign. The purpose of this blog is to show how we can generate segments using transactional data, although this analysis can be enriched using other customer variables (such as demographics). We have experience in customer segmentation in various industries such as medicine, retail, fast food, automotive, sports clubs, among others, where we have been able to profile customers according to different behavioral and socio-demographic variables, identifying opportunities to offer better services, build loyalty and strengthen the bond with customers and increase sales.

The purpose of this blog is to expose two methods of segmenting customers of an online store in the UK. Transactional data is very common within the industry and the objective is to segment these customers according to their buying behavior. For this, an RFM (Recency, Frequency, Monetary Value) analysis is used to understand the buying behavior of each of the customers and to extract different segments.

First, a way of segmenting according to the values of the variables of the RFM analysis is presented, and the second method corresponds to a segmentation based on the K-Means model. The difference between the two is that one seeks to separate customers according to the values of each variable and the second seeks to create segments according to the distance customers are located according to their purchasing behavior.


The data used in the analysis correspond to a cross-national dataset that occurred between 12/01/2010 and 12/09/2011 for a UK-based registered online retailer[1]. The following table shows some rows randomly extracted from the dataset.

The columns are described below:

  • InvoiceNo: Transaction number
  • StockCode: SKU code
  • Description: SKU description
  • Quantity: Quantity sold
  • InvoiceDate: Date of the transaction
  • UnitPrice: Unit price of the SKU
  • CustomerID: Customer ID
  • Country: Country of purchase
  • TotalSum: Total of sale

In order to segment the customers, it is necessary to extract variables from the transactional data. For this purpose, an RFM (Recency, Frequency, Monetary Value) analysis is performed:

  • Recency: Days since the last transaction made by the customer
  • Frequency: Number of transactions made
  • Monetary Value: Total amount spent by the customer

Once these variables have been calculated for each of the customers in the base, we obtain a table as shown below for some customers.

Segmentation methods

From the RFM analysis, two segmentation methods will be explored, one based on separating the customers according to the ranges of the variables and the next one based on a clustering model that automatically generates the different segments. For the first method, the variables will be separated into quartiles, the following table shows the values for the 25%, 50% and 75% percentiles.

The table shows the average, standard deviation, minimum, maximum and the 25%, 50% and 75% percentiles. With these values we can assign a score for each of the variables. For the Recency variable a lower value is better, since the last purchase is more recent. Therefore, if the customer is within the 25% percentile, they will have a score of 4. For Frequency and Monetary Value, they will have a score of 4 if they are above the 75% percentile. In summary, the scores for each client are assigned as follows:

  • Recency:
    • 4: value <= 25%.
    • 3: 25% <= value < 50%
    • 2: 50% <= value < 75%
    • 1: value > 75%
  • Frequency and Monetary Value:
    • 1: value <= 25%
    • 2: 25% <= value < 50%
    • 3: 50% <= value < 75%
    • 4: value > 75%

After assigning the score for each of the variables, segments are generated accordingly. In addition, an overall score is obtained by adding the three RFM analysis scores. Finally, a table is obtained as follows:

It is often not convenient to work with a large number of segments. In this case there are as many segments as combinations of the scores, so there would be 61 segments. To reduce this number, the segments can be divided according to the global score. The following table shows the averages of each variable according to the global score.

From the table above we see better values for each of the variables on average as the score increases. Although we now have 12 segments, we can create a smaller number of segments by grouping the scores. For this, the following separation is considered:

  • Gold: Score over 9
  • Silver: Score between 6 and 9
  • Bronze: Score less than 6

With this classification, the general segments are created and the following table is obtained:

Now the base has three segments which can be used to offer them products and benefits in order to take those that are Bronze to Silver and Gold. The following table shows the average values of the RFM variables for the three segments.

By performing the segmentation in this way we finally obtain three segments which allow us to make strategic decisions according to the purchasing behavior of customers. It is observed that for the Gold segment, their last purchase was 26 days ago on average, which has a much higher frequency than the other two segments with an average of 192 purchases in the period and the second largest segment, which is good news. Then, the Silver segment buys quite less frequently, with 36 purchases on average, they spend quite less than Gold with 724 and on average their last purchase was 100 days ago. Finally, the Bronze segment is clearly the worst of all, they have bought on average 10 products, their last purchase is on average 218 days ago and the total purchases they have made is significantly less than the other two segments with 199.

While this type of segmentation makes sense and allows us to generate customer conclusions, quartile separations may not be the most appropriate way to separate customers. There are several methodologies that are more data-driven than the one analyzed. The following shows how to do segmentation using the K-Means method.

The algorithm consists of 5 steps:

  1. Specify the number of K segments.
  2. Randomly select K observations within the database and take them as the initial centroids[1] of the segments.
  3. Assign all points to the nearest centroid using the Euclidean distance[2] and generate K segments.
  4. Recalculate the centroid for the K segments by minimizing the distance between the centroid and all points belonging to the same segment.

Repeat from step 3 until convergence of the algorithm is achieved.

[1] The centroids in this case are the representative points of the segment.

[2] The Euclidean distance is calculated as d(A,B) = √[(X_A – X_B)²+(Y_A – Y_B)²] for two vectors A and B with coordinates on the X and Y axis.

For this algorithm to work properly, the variables need to be on the same scale and, in addition, as it is a distance-based methodology, it does not work very well if there are values far away from the mean. This is why it is necessary to transform the data before using the algorithm. The histograms for the RFM variables are shown below.

There are several ways to transform the data to avoid values too far away from the mean, for this case, the logarithmic transformation is used since the values are positive. The following figure shows the result.

The last transformation necessary to be able to use the logarithm is to bring the variables to the same scale. This is achieved by standardizing the variables, i.e., subtracting by the average and dividing by the standard deviation. The result of the variables is shown in the following figure.

Now that the variables are on the same scale and there are no values very far from the mean, it is possible to use the K-Means algorithm. To find out how many segments should be chosen, the elbow method is used. This is based on running the algorithm with different number of segments and see the total sum of the squared distances of the points to the centroid of the segment that belongs. We look for the point where there is no sharp decrease. That is, where the elbow is located. This is shown in the following figure.

From the elbow method, it is observed that the optimal number of segments to extract is 3. Then, a segment is assigned to each variable according to its closeness and we obtain the following table.

After having extracted the segments, it is necessary to profile and describe them. For this, we can review the so-called “Snake Plot” which is the graph of the average of each of the standardized variables to see the difference between each of the segments.

The graph shows different values by segment, segment 0 has an average Recency quite far from the average and Frequency and Monetary Value quite high, so it is a segment that buys quite frequently, that its last purchase was recent and spends much more than the rest of the other segments.  On the other hand, segment 1 buys with low frequency, they spend very little compared to segment 0 and their last purchase is not very recent. Finally, segment 2 is quite close to the average, their last purchase is similar to the average, they do not buy as frequently as segment 0 but more frequently than segment 1 and they also spend more than segment 1 but never as much as segment 0.

The table below shows the average of the variables with their original values and the total number of customers per segment. As shown in the Snake Plot above, segment 0 is the best segment according to the RFM analysis and the one with the lowest number of customers. The segment with the highest number of customers is segment 2, which is located between segment 0 and segment 1.

In addition to the average, another way of looking at the segments is to review the deviation of the average of the variables with respect to the average of the total number of customers.

From the figure we can see that segment 0 has a purchase recency of 86% less than the general average, a frequency with a 188% increase with respect to the general average and 221% higher total expenditure compared to the average. On the other hand, segment 1 behaves in the opposite way with 84% more recency, 84% less purchase frequency and 86% less spending. Finally, segment 2 is the closest to the average, with 25% less recency, 27% less purchase frequency and 43% less total expenditure.

For simplicity, we will call segment 0 as Gold, segment 1 as Bronze and segment 2 as Silver to follow the same terminology of the previous method. There are different objectives depending on the segment, for Gold it is to keep them in that segment, for Silver it is to upgrade them to Gold and prevent them from going to Bronze and for Bronze it is to upgrade them to either Silver or Gold. Marketing strategies should be focused on meeting these objectives, for example, offering exclusive offers to Silver to boost consumption and to Bronze to activate them. For Gold, one option would be to review the date of last purchase and if they have been inactive for too long, contact them to reactivate them.


Customer segmentation is a fundamental part of business operations and helps to know customers and offer them better products and benefits. It helps to modify the marketing strategy and is part of a large number of successful companies. Transactional data is the most commonly found within the industry and contains information on customer behavior.

Two segmentation methods are shown in this paper, one based solely on the separation by quartiles of each of the variables of the RFM analysis and then assigning scores accordingly, and the second based on the distances of each of the customers according to the value of the variables. The results obtained are relatively different, since the separation by quartiles is not necessarily the one that finally creates similar segments. In any case, using both methods, three segments are obtained, where one is the customer who is more beneficial to the store, since they spend more, more frequently and are still active, since their last purchase is recent. Then there is an intermediate segment and finally a segment that is not beneficial.

With the help of these segmentations, one could choose a segment to perform actions in order to move customers from one segment to another that generates greater benefits to the company. For example, moving from the intermediate segment to the best segment or reactivating those that have been inactive for too long.

Within the recommendations, more variables can be added, in this case nationality could have been included. In addition, the total time since the first transaction could have been added as a variable, which is called Tenure and is sometimes added to the RFM analysis. In addition, there are other methods of segment extraction such as hierarchical segmentation, autoencoders and other clustering methods.

[1] UCI Machine Learning Repository: Online Retail Data Set