Segmenting without parameterization

  • Articles
  • Data Science & IA

How do I search for segments in my data, if my variables are not only numerical?

Generally, the segments created by most Data Scientists come from continuous data, such as weight, height, age or income.

Put into dimensions, by closeness/remoximity between the points, segments are chosen, just as the K-Means algorithm does. However, in the real world, when we try to create segments from our data, it is usually not clean or in the format we want. It may also be the case that the nature of our data is not appropriate.

Data can be continuous, categorical, or binary. Categorical data, as the name implies, presents different categories within a dimension, for example: nationality. Within the nationality dimension, we can be Chilean, German or Nepalese. Some Data Scientists convert this categorical data to numerical by parameterizing it with distance, language, or some other alternative method that may or may not make sense depending on the conclusions we want to draw. However, there are times when it simply doesn’t make any sense to convert categorical data to continuous. What is the graphical distance between bear and buffalo? Their weight? Their height? Their coordinates? How do I parameterize it?

On the other hand, we have binary data. This is represented by numbers, but exclusively 0 or 1. It is used to see whether or not something complies with a characteristic, variable or method. For example, is it Nepali or not? 0 if it is not, 1 if it is. This way of storing data, known as dummie mode, is not necessarily the most efficient way to store data. For the above example, instead of having a column called Nationality, we would have a column for German, another for Chilean, and another for Nepalese, with 0 or 1 depending on whether or not it complies, having to logically add 1 for each row between all the columns.

So, how do I search for segments in my data, if my variables are not numerical, and it makes no sense to parameterize them?

K-Modes is an algorithm that uses the same procedure as K-Means, replacing the Euclidean distance (in 2, 3 or 10 dimensions) by the Hamming distance. This distance measures the difference between two rows by counting the number of times two variables (columns) differ from each other.

In this case, the Hamming distance between Juan and Pamela is 2, because they do not have the same category in variables R3 and R4. The K-Modes algorithm iterates for a certain number of segments in which each variable describing each of these segments is defined by the variable that is most repeated in the rows that make up that segment. Today we will not go into how it iterates, but you can find information here. By clearly understanding how K-Modes works, we can see that it works well for categorical or even binary data (two possible answers for each binary variable).

Once implemented, how many segments will my data have? For this, two ways are proposed, both frequently used in K-Means, but little developed for K-Modes. These are the Elbow Method, which minimizes the difference between rows belonging to the same cluster, and the Silhouette Method, which maximizes the difference between different clusters. By developing both methods, the Data Scientist will be able to make decisions taking into account the trade-off between these two methods, the similarity between the elements belonging to a segment, and the difference between segments. Now, what does this look like in practice?

Elbow Method

The first thing would be to export a DataFrame with an additional column showing to which segment each row belongs. Then the cost is calculated. This is calculated by taking all the rows belonging to a segment, and see by how many variables a row differs with respect to the segment description (in K-Means, the centroid of each segment) to which it belongs. For the table above, if Pamela is the segment description, and John belongs to this segment, the cost is two. Then, move on to the next person who belongs to the segment whose “centroid” is described by Pamela’s variables.

Once the total cost for each segment is calculated, the costs of all segments are added together for a certain number of clusters. For example, for 3 clusters, take the cost of segment 1, then segment 2 and finally segment 3 and add them together. Then, the same is done for 4 clusters, 5 clusters, and so on. In this way, I can compare the costs for each number of segments.

Thus we see a trend in decreasing costs as the number of segments increases. Which is logical, because, for the same number of segments and rows, our cost would be zero (each row is a segment, so there is no difference between the row and the segment).

Silhouette Method

First, the closest segment must be defined. To do this, the Hamming distance between the segment in which we are “standing” and each of the others is calculated, looking for the minimum distance. For example, for segment 1, segment 4 is the most similar, while for segment 4 it is segment 8. Once the closest segment to each one is defined, the data are compared within them.

The way to compare them consists of finding the Hamming distance between the rows belonging to a cluster, such as Juan and the description of this given by Pamela, which is segment 1. We normalize this distance by the number of existing variables, and obtain an average A. Then we do the same, but with respect to the closest cluster, which would be segment 4, represented by Domingo, we normalize and average obtaining B. This process is done for all the clusters.

Then we perform the following operation:

Clearly, B>A since B represents the distance of the data to the nearest cluster, while A represents the distance to the same cluster.

When approaching 1, it means that this cluster is very different from its closest cluster (B>>A). On the other hand, when approaching 0, it means that we have very similar clusters (B==A). We can average the ‘C’ for all clusters, and then iterate according to the number of segments we want. Below is the C averaged from 2 to 14 clusters.

With these two graphs presented, we can begin to make informed decisions regarding the optimal number of clusters

Useful tips

  • Group each variable into a few categories. For example, if any include Walmart, Amazon, Facebook, Twitter, Instagram, Microsoft, Google, Adobe. Group them as social network, e-commerce and saas (or as it makes more sense for your segmentation).
  • Reduce as much as possible the number of variables, eliminating those that give more cost to the segmentation. You can iterate with Elbow Method and Silhouette Method testing different variables and evaluating performance.
  • Implement intermediate procedures that calculate costs per cluster, in order to know which cluster is more solid than the others, thus allowing well-defined commercial strategies.
  • Any measure to reduce the combinatorics between variables.

Final words

Thanks to K-Modes we can segment our clients, making sure with Elbow Method that the elements belonging to the segment are as similar as possible to each other, and checking that between each grouping of segments there are as many differences as possible thanks to the Silhouette Method. Evaluating this trade-off is not always easy, but with an efficient code it is possible to test and iterate on the result to reach an optimal number of segments in the data.

With implementations like this you can achieve a clean and optimal segmentation, which allows you to have a high interpretability of the result, which equals clear and powerful initiatives for your customers, products and/or suppliers.